Reducing inference times and increasing throughput for model deployment on GPUs

Speakers:

Mark Moyou PhD, Senior Data Scientist, Nvidia

In this talk we will discuss some best practices for reducing inference times and increasing model throughput when deploying machine learning models on GPUs. We will explore how model compression and speedup is accomplished through building hardware specific inference engines. Then, by leveraging open-source inference servers we can maximize the throughput on said GPUs by hosting multiple optimized models and take advantage of multiple backends for Pytorch, Tensorflow and Python based models.