Best practices for large-scale distributed deep learning

In this webinar, we’re joined by Eri Rubin the VP of research and development at DeepCube a cnvrg.io customer and NVIDIA Deep Learning Solutions Architect Adam Tetelman to discuss how to optimize distributed training for multi-node and multi-GPU training to maximize performance.

Distributed deep learning can be complex, with many factors contributing to the overall success of a deployment. Training at-scale requires a well designed data center with proper storage, networking, compute, and software design. In this webinar, we will hear from industry experts in distributed deep learning training and go over the best practices for building dynamic distributed training clusters using containers, PyTorch software tips for distributed training, and strategies for data center design and workload management to maximize NVIDIA GPU utilization. Alongside with the cnvrg.io software platform, these best-practices for deep learning software and hardware will help individual training jobs run faster while getting you a higher data center ROI and boosting cluster utilization.

We’ll follow with a live Megatron-LM example of using PyTorch in cnvrg.io. Along with DeepCube, NVIDIA and cnvrg.io CEO, Yochay Ettun, we will share performance optimization tips for:

PyTorch tips and tricks for optimized distributed training for multi-node and multi GPU training
Building dynamic distributed training clusters using NVIDIA NGC containers, Kubernetes, OpenMPI and other open source solutions
Designing a topology aware GPU scheduler with emphasis on bandwidth optimization and warm/cold data tiers
Using real examples such as Megatron-LM, BERT, GPT-2 and other use-cases https://github.com/NVIDIA/Megatron-LM

We’ll follow with a live Megatron-LM example of using PyTorch in cnvrg.io. Along with DeepCube, NVIDIA and cnvrg.io CEO, Yochay Ettun, we will share performance optimization tips for:

PyTorch tips and tricks for optimized distributed training for multi-node and multi GPU training
Building dynamic distributed training clusters using NVIDIA NGC containers, Kubernetes, OpenMPI and other open source solutions
Designing a topology aware GPU scheduler with emphasis on bandwidth optimization and warm/cold data tiers
Using real examples such as Megatron-LM, BERT, GPT-2 and other use-cases

https://github.com/NVIDIA/Megatron-LM