ML Expectation vs. Reality, Part 1: Don't build a house on sand!

Machine Learning has seen a huge boom in the past decade, with many industries now investing heavily in Machine Learning powered solutions. As a result, we’ve seen an uptick in the number of job opportunities for Data Scientists, and several universities have begun offering degrees specializing in Data Science/ Artificial Intelligence.

Naturally, there is a huge influx of talent into AI/ML occurring right now. While universities and online courses focus on the math and theoretical concepts, the skillset and knowledge required to train and deploy Machine Learning models to solve a real-world problem can be quite different from what a student expects. As experienced Machine Learning practitioners, we detail how our expectations were subverted in this four-part blog post series, and how we adapted to it.

When you’re first introduced to Machine Learning, it seems so magical; the possibilities seem endless. The mind starts imagining all the problems that could so easily be solved using Random Forests and Deep Neural Networks. But when you embark on your first real-world project, that’s when you realize …

In the wild, there is a different set of criteria to determine whether a problem can be solved using ML or not. You realize that more often than not, it is the availability of data that determines whether Machine Learning can be used to solve a certain problem. So one of the most important questions you can ask yourself before you start working on a new idea is:

Will a model trained on this data produce the right answers most of the time?

This question is completely independent of the model, library or language you choose for your ML experiments. It is also a multi-part question. Your model is only as good as the data you feed it, so you need to ask yourself:

Do I have enough data to train a good model? Unless you hit your hardware budget, using more data is almost always the right thing to do.
(For a supervised learning problem) Can I trust my labels? Am I feeding my model the right information?
Is this data an accurate representation of the real-world distribution? Do I have enough variation in my samples to cover the problem space?
Do I have access to a constant stream of new data that I can use to update my model and keep it current?

Getting your data together (AKA herding cats)

Often, when you’re trying to build a dataset to solve a problem with ML, the data is distributed among several different sources. Different parts of a sample are collected via different products, and managed by different teams on different platforms. So, the next step in the process is often to aggregate all this data into a single format and store it in a way that makes the data easily accessible.

Internally within Sophos AI, we manage our data using Amazon Redshift. We ingest feeds from various sources, including vendor aggregation services and user telemetry. We then aggregate this data into several downstream tables, and use these tables to collect datasets for experimentation.

Because nothing lasts forever, and you’ll have to retrain…

Concept/data drift is an important problem you’ll need to address when designing your ML system. Once you train a model, the model becomes less accurate over time as the distribution of new incoming data changes.

This drift is a lot faster in some domains than others, so you’ll need to determine a frequency for updating your model to make sure it still performs within expected bounds. In the security domain for instance, we see a lot of drift, as threat actors change their exploits and behavior over time and vulnerabilities are discovered and patched.

Does your data spark joy?

Once you’ve got all your data in one place, it’s time to FINALLY get started and use that shiny new ML algorithm you can’t wait to try, right? Well not yet, because with real-world data, what often happens is…

Once you collect and aggregate your data, you will inevitably find that there are some issues that you’ll need to work through first. Some of the many issues you may encounter that you’ll need to take care of in this phase are:

Missing data: Sometimes, you may not have valid values for all your observations. Data might have gotten corrupted during collection, storage or transfer, and you’ll need to find those missing data points and consider purging them from your dataset
Duplicate data: While not an especially alarming issue when it comes to model performance, you might want to clear duplicate data out of your data store in order to make the model training process more efficient, and potentially avoid overfitting
Different normalization schemes: Minor differences in the way your data is processed and stored can cause major headaches when training a model. For instance, different products may trim the same free text field to different lengths, or anonymize data differently, causing inconsistencies in your data. When one of these sources predominantly contains malware and the other contains benign samples, then your ML model might learn to identify them based on (for instance) the trim length, which becomes utterly useless during deployment.
Free text field data: This deserves a category all for itself, because it can be so hard to deal with. Free text fields are the bane of the data engineer’s life, since you’ll need to deal with typos, slang, near-duplicates, variations in case, whitespace, punctuation and a whole host of other inconsistencies.

Conclusion

That was a quick summary of the typical steps that need to be taken to choose, collect and clean data for your ML solution.

If you’ve got this far, that means you now probably have a clean dataset and you can finally begin to experiment (about time, right?). You are now ready to move on to the second part of this blog series (here!), that deals with the unexpected challenges you may face while experimenting with different models and feature schemes. Good luck and may you never overfit!