Managing GPU HPC environment for Data scientists

NGC containers

Industry

Education

Resources

Use-case

Deep Learning Computer Vision

TOOLS

cnvrg.io replaced the need for a team of 10 engineers to manage the hundreds of experiments and vast array of resources we have in our organization

Derrick Cho

AI Director at ST Unitas

About ST Unitas

ST Unitas is a leading Global EdTech company and owner of The Princeton Review – the top US based education service provider. Their vision is to “remove the gap between the world’s rich and poor through education.” It has seen accelerated growth in the last few years and has deployed multiple AI applications across their services. The company is responsible for artificial intelligence (AI) education services STELLA, a personalized system to help each student overcome their weaknesses, as well as a knowledge platform called CONECTS.

Overview

ST Unitas is a fast growing Korean enterprise enhancing education systems with state of the art machine learning and AI. ST Unitas serves over 3 million students with its “CONECTS Q&A” application, processing millions of images of math problems and providing instant materials to boost learning and increase test scores.

Challenges

Managing rapid team growth & resource utilization

The applied AI team at ST Unitas builds state-of-the-art models and algorithms, and continuously improves their models by implementing the latest published papers and research. Demand for the research team’s services grew rapidly after multiple successful deployments across its services. The team was faced with the challenge of massively scaling model development to deliver more applications in a short period of time. Resource management and maintenance of ML compute more than 40 Titan RTX GPUs (and growing) in a hybrid-cloud setup, slowed project agility and performance of the AI team’s progress, adding complexity and delays in setting up dependencies, leveraging open source technologies and tracking all built models. The team diverged from their focus on research and spent more than 50% of their time on low-value DevOps and MLOps. ST Unitas needed a solution to reduce complexity and allow the team to realign their focus on the what was important – the machine learning algorithms, at scale with high impact.

The team is able to save time with cnvrg.io’s integration to NVIDIA GPU Cloud, enabling researchers to quickly launch any hyper-optimized Deep Learning container on any compute, in a single click.
• Automated DevOps and eliminated need for installation and setting dependencies.
• 100% utilization of all on-prem GPUs, with the ability to burst to the cloud when needed.
• Saved time with instant launch of Pytorch and Tensorflow.
• Optimized model management and ability to compare complex models with Tensorboard.

Solution

ADOPTED CNVRG.IO TO ACCELERATE TEAM PRODUCTIVITY

With cnvrg.io, the ST Unitas AI team is able to build and deliver high performing deep learning models at a pace that matches the company’s growth. The team has a unified environment for all their machine learning compute. With cnvrg.io’s meta-scheduling technology, the 40+ Titan RTX GPUs are easily managed along with their cloud compute instances; leveraging on-premise compute and bursting to cloud when necessary The team is able to save time with cnvrg.io’s integration to NVIDIA GPU Cloud, enabling researchers to quickly launch any hyper-optimized Deep Learning container on any compute, in a single click.
• Automated DevOps and eliminated need for installation and setting
dependencies.
• 100% utilization of all on-prem GPUs, with the ability to burst to the
cloud when needed.
• Saved time with instant launch of Pytorch and Tensorflow.
• Optimized model management and ability to compare complex models
with Tensorboard.

Results

Using cnvrg.io, the ST Unitas AI team seamlessly scaled AI and data science activity by 10x, in a few months, without hiring more engineers.

The team decreased time spent on DevOps by 80% for each job and gained 100% control of utilizing their GPUs, saving them over 1 million dollars in wasted GPU and human resources.

ML pipelines that took weeks to set up are now fully automated and run in less than 2 hours with cnvrg.io flows, and continuously update in production maintaining accuracy above 80% at all times.

• Maintain over 40 Titan RTX GPUs on-premise with Kubernetes (AIOps).
• Fast development of OCR and image vector representation applications.
• Quick iteration for peak performing models.
• Automated transition from on prem to cloud.
• Data scientists can focus on research and algorithms instead of IT/DevOps.
• Building reproducible research with cnvrg.io Flows pipelines.