Quick tips on ML Infrastructure

Dawid Laszuk published on
6 min, 1016 words

ML infrastructure plays a crucial role in the success of a machine learning project. Building and maintaining ML infrastructure can be complex and time-consuming, especially when started from scratch for each project. Obviously, horses for courses; some projects are one-offs and optimising might derail your enthusiasm, and some projects are so large that everything becomes an exception. Most projects, however, are (un)fortunately going to be very similar from a high-level perspective. Each will need some data storage, computation units and means to deploy models. Here are a few short tips.

Plan ahead

First and foremost, it's important to have a clear understanding of the requirements of the project. This includes the types of data that will be used, the size of the data sets, and the computational resources needed. Think a bit ahead but don't over complicate. Yes, we all have a grand plan that eventually our models are going to be multi-modal and we'll somehow be able to add every dataset we find. It might be true in your case. However, be honest and think how long it'd take you to get there. Tooling and services' offerings are improving at such a quick pace that most projections over two years are unlikely, and over three years are unreasonable. Plan next year and think ahead about what are going to be your growing opportunities.

Once you know where you're heading, consider how you are going to get there. Start locally with just your computer and small datasets. The goal should be to run an end-to-end solution in the simplest manner possible. Probably, eventually, you will put it into the public cloud (AWS or Azure) or some dedicated services (Hugging Face or Netlify) but don't start with it; there's plenty to debug in the beginning and "somewhere else" is simply slow.

Automate

While you're developing and debugging, think about how you're going to automate all components. Yes, get those computers to finally do something for you. Not only will you shave off some minutes (hours when scaled), you'll also reduce fat-finger errors. Automation through scripts and tools also help to easier include other people into the project. Scripts can be followed and third party tools typically have documentation for better understanding. Codifying everything can be a bit of an investment, especially in the beginning, but(!), the code is going to live and grow with the solution and it'll be a source of truth when the documentation is outdated.

As a side note, codifying infrastructure, e.g. with Terraform, and deployment, e.g. docker swarm mode or kubernetes, allows for quick start-up and tear-down. This is very useful if you're on a budget and always looking for the best deal, or if you're a startup and getting some credits from the cloud providers. Having a flexible deployment can literally save you thousands of dollars.

Reproducibility

The automation also helps with reproducibility. Models that can't reproduce scores are like developers saying that something worked on their machine. A "must" which will save you a lot of time in the future is adding plenty of functional and integration tests to your continuous integration (CI) and deployment (CD) pipelines. Capture metrics and check whether the change is sensible. If the model or the input data hasn't changed, scores should be the same. Note that the reverse is also true; if the model or the input data have changed then (some of) the results should also change. The basics are: all code is in version control (e.g. git), the environment stays the same (e.g. docker), you have golden dataset on a side (e.g. csv in s3), and previously trained model artefacts to use and compare. To be clear, reproducibility is very important from a scientific point of view, but it is even more important in the industry when things will break and the higher ups will come down to look for scapegoats. Don't be an easy target.

Monitoring

The best way of avoiding the looks, however, is being on top of the game. Proper monitoring and logging is an invaluable way of understanding models' health. In code, often add sanity checks for trivial stuff, e.g. checking the shape of matrix/dataframe is exactly as expected and that there aren't any NaNs or duplicate records. Panic application if needed but definitely log and capture these. For metrics, you likely have some metric that you optimise your model for and that one needs tracking. But, there's also plenty of descriptive metrics, like the shape of input data, the time it takes for processing, ranges of intermediate values... If you're working with ML models you likely enjoy watching plots, so give yourself a treat and spare no effort here. We're very good at spotting patterns, and lack of them, in good visualisations and graphs. And by "we" I don't mean only ML-folks but the general population. By sharing graphs with others it'll be easier to explain your expectations and what issues you're facing. It's also almost guaranteed that rather quickly you'll notice something that you didn't anticipate and thought it would never happen. For these, it's great if there is a system with alerts. Similarly to logs in code, put alerts on metrics that are obviously bad, e.g. number of people detected in a photo is more than 10000 or price of something is negative. This not only checks for weirdness in data, it also helps future you to just adjust the threshold, rather than come up with the most correct threshold (which doesn't exist).

Monitoring and alarming is crucial. It's the "if the tree falls" in the forest; how do you know your model is running even if you don't observe it? It's the "are we there yet"; when do we need to work on a better version of the model? It's the "can't we just"; why are we spending money on boxes with parameters instead of just writing a few custom cases? And, if other ifs are iffy; it allows you to bluntly stare at the screen and tell others that you're working.