In recent years, increasingly large Transformer-based models such as BERT have demonstrated remarkable state-of-the-art (SoTA) performance in many NLP tasks. However, these models are highly inefficient and require massive computational resources and large amounts of data for training and deploying. As a result, the scalability and deployment of NLP-based systems across the industry is severely hindered. In this talk I’ll present a few methods to efficiently deploy NLP in production, among them Quantization, Sparsity and Distillation.