Save up to 80% in cloud costs when building machine learning models

By Yochay Ettun

Building machine learning is expensive. It’s expensive not only because of the high salaries of data scientists, but also the high cost of cloud compute. If you want your model to be built properly, you can’t compromise compute power. The best way to reduce the cost of machine learning is to maximize your savings on compute.

This article will present a way for you to save on cloud costs so you can focus on the model that needs to be built. Not only that, but it will increase the ROI of your model, which will make your stakeholders happy.

Why is machine learning development so expensive?

Machine learning models are inherently expensive to build. Building a model often requires a lot of computational power. Depending on the size of your dataset, it might require massive CPU/Memory instances to handle larger datasets with applications like Spark for distributed computing. Or, if you’re training deep learning, high-end GPU instances are almost always a must. These use-cases are pretty common in the machine learning world, and the expensive compute resources can accrue thousands of dollars in cloud costs.

The second costly condition of machine learning is that it requires a lot of experimentation. Training a model to converge usually does not happen in a single run. In fact, many of our users report running hundreds and thousands of experiments per machine learning project. To produce the best model possible, these costs are unavoidable when searching for your champion model.

The solution: Spot instances

Spot instances on AWS are your way of saving up to 80% in cloud costs when building models. According to AWS documentation “A Spot Instance is an unused EC2 instance that is available for less than the On-Demand price”. What does that mean? It means that AWS has many unused servers available at a given time. In order to get the most out of these servers, AWS has found a clever way to monetize them by selling them at a major discount. With Spot instances, you’re in control of your budget. Once you set the maximum price you are willing to pay, the system automatically sets you up with an available instance at that price. In a few seconds, the system will automatically find you a machine, at the best price possible.

GPU Instances on AWS and their OnDemand vs. spot price

AWS Instance	Spec	On Demand	Spot Price
P2.2xlarge	1x NVIDIA K80 GPU	0.9	0.27
P2.8xlarge	8x NVIDIA K80 GPU	7.2	2.16
P2.16xlarge	16x NVIDIA K80 GPU	14.4	4.32
P3.2xlarge	1x Tesla V100 GPU	3.06	0.918
P3.8xlarge	4x Tesla V100 GPU	12.24	3.672
P3.16xlarge	8x Tesla V100	24.48	7.344

This hidden gem is almost too good to be true. So, before you get carried away with excitement, there are two things to keep in mind before using spot instances:

1. Price surges – The spot price is determined by demand and bidding trends. This means that when the demand for that instance is high, the spot price increases by the second. Just like Uber, you want to avoid busy times. You can optimize your savings by monitoring fluctuations in prices with Spot Instance Pricing History and by running at low demand times.

2. Termination – Spot instances can be automatically terminated by AWS with only 2-minute notice. This can happen if your bidding price is low and another person has bid higher, or if all on-demand instances are being used.

Spot instances and your machine learning

So, we have summarized spot instances, as well as their pros and cons. How does this fit into your machine learning workflow? As outlined above, there are a few risks to using spot instances. While it’s great to save all that money, what if you train a model for a week and then your instance is terminated? How can you take that risk? That is where checkpoints and S3 come into play.

A lot of machine and deep learning frameworks have introduced checkpointing capabilities which allow data scientists and developers to save versions of the model created during training. Then, if something happens, you can continue training from last saved checkpoint.

How to create model checkpoints

To be more practical, we created a few code samples to help you checkpoint a model in TensorFlow and Keras training. We’re simply using the Keras callbacks API, and more specifically ModelCheckpoint

checkpoint = ModelCheckpoint(‘checkpoints/weights_checkpoint.h5’, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks = [checkpoint]
model.fit(X, Y, validation_split=split, epochs=epochs, batch_size=batch_size, callbacks=callbacks, verbose=0)```

Load a checkpoint and continue training

# Reload model 
model = load_model(‘checkpoints/weights_checkpoint.h5’) 
# Train 
model.fit(X, Y, validation_split=split, epochs=epochs, batch_size=batch_size, callbacks=callbacks, verbose=0)

We would recommend to also store some meta data about the latest store checkpoint. It can be in the filename or even a simple text file.

Summary

Data science teams that use cnvrg.io automatically benefit from it’s built-in spot instances integration. Every experiment can run on a spot instance and cnvrg.io will handle persisting the bidding and the spot-instance request. Not only that, but cnvrg.io has a built-in solution for automatically persisting and syncing checkpoints and data, so you don’t have to worry about termination. Upon termination, cnvrg.io will automatically relaunch training on a new instance, using the latest checkpoint without you lifting a finger.

Customers have trained massive GANs over weeks on spot-instances, saving them thousands of dollars. You think that’s too good to be true? See for yourself with a free demo.