Open Nav
Running Efficient Distributed Training 1200_628 Live_1

Distributed Training can be complex, with many factors contributing to the overall success of a deployment. Training at-scale requires a well designed AI infrastructure with proper storage, networking, compute, and software design. In this webinar, you’ll learn strategies to optimize your infrastructure for AI. We’ll sit with Supermicro System Engineer, Jeff Liu, and cnvrg.io AI Architect, Sean Rowan to discuss strategies to unify your hardware and software technology to maximize AI performance. Supermicro, a global leader in high performance, high efficiency server technology innovation, will share information on their latest purpose-built AI servers, and how to maximize utilization for AI workloads. You’ll also hear from cnvrg.io to learn how to build a container-based infrastructure and unified control plane to enable scalability and simple allocation of resources. We’ll share a few end to end examples and use cases for performing distributed training, and how to execute for maximize performance through the entire workflow and utilization of your servers. 

In this webinar, you will learn how cnvrg.io mitigates the headaches of configuring distributed training by leveraging compute templates on top of your infrastructure, allowing users to easily use their preferred framework such as Horovod or PyTorch distributed. These templates provide a simple interface to describe the number of resources or GPUs required for training. cnvrg.io automatically tracks necessary resource metrics such as GPU utilization and network throughput so Data Scientists and ML Engineers can easily spot model training bottlenecks.

What you’ll learn:

  • How to optimize your AI infrastructure for performance 
  • How to use sophisticated meta-scheduling to accomplish high infrastructure utilization
  • Important considerations for IT leaders
  • Important considerations for data science practitioners
  • How to build infrastructure for simple server scalability

Distributed Training can be complex, with many factors contributing to the overall success of a deployment. Training at-scale requires a well designed AI infrastructure with proper storage, networking, compute, and software design. In this webinar, you’ll learn strategies to optimize your infrastructure for AI. We’ll sit with Supermicro System Engineer, Jeff Liu, and cnvrg.io AI Architect, Sean Rowan to discuss strategies to unify your hardware and software technology to maximize AI performance. Supermicro, a global leader in high performance, high efficiency server technology innovation, will share information on their latest purpose-built AI servers, and how to maximize utilization for AI workloads. You’ll also hear from cnvrg.io to learn how to build a container-based infrastructure and unified control plane to enable scalability and simple allocation of resources. We’ll share a few end to end examples and use cases for performing distributed training, and how to execute for maximize performance through the entire workflow and utilization of your servers. 

In this webinar, you will learn how cnvrg.io mitigates the headaches of configuring distributed training by leveraging compute templates on top of your infrastructure, allowing users to easily use their preferred framework such as Horovod or PyTorch distributed. These templates provide a simple interface to describe the number of resources or GPUs required for training. cnvrg.io automatically tracks necessary resource metrics such as GPU utilization and network throughput so Data Scientists and ML Engineers can easily spot model training bottlenecks.

What you’ll learn:

  • How to optimize your AI infrastructure for performance 
  • How to use sophisticated meta-scheduling to accomplish high infrastructure utilization
  • Important considerations for IT leaders
  • Important considerations for data science practitioners
  • How to build infrastructure for simple server scalability