ML Expectation vs. Reality, Part 2: Doing the Actual Analysis

So, you’ve followed the advice in Part 1 of this series.  Now you’ve got a nice big data set and you’re pretty sure that it covers the space of data that your model is likely to see in production. You’ve spent a lot of time thinking about how to label it and feel like you’ve got that problem under control.  You’re tracking concept drift and refreshing your data regularly, and you’ve got a nice data cleaning pipeline set up so that you can normalize samples and remove duplicates.  With all that out of the way, now you’re excited to get to the fun part and actually fit the model.  Now’s the time you get to flex your ML expertise, show off all the latest techniques you read about from the last big machine learning conference, and impress your boss with your skill and talent by building a dazzlingly complex model that fits the data perfectly, right? 

I’m gonna stop you right there…

Hold up. 

Industry research vs academic research

Research and industrial machine learning groups have a lot of differences, but perhaps the most significant one is that – for research groups – the models are often ends in themselves; a fancy new model that solves a hard problem (regardless of how much that model cost to train and test) gets publications, citations, more research funding for the group that produced it, and raises the academic status of the people who worked on it.  For industrial ML groups, by contrast, the models that they produce are means to an end, and usually part of a much larger and much more complex system. Using an ML model imposes costs, not just with respect to the memory or CPU footprint of the model, how long it takes to get a prediction from the model, and so on, but also in terms of the cost of regular retraining, the complexity of deployment (see the next entry in the series about that), and ongoing maintenance and support.   

“Pure” ML research Industrial ML research 
Get to pick the problem to work on  Problems are driven by business need 
Models are the complete solution – high accuracy is paramount Models are usually a component in a larger process – “good enough” accuracy can be mitigated by compensating controls 
Get to choose how to measure success Definition of ‘success’ is determined by business need 
Must be able to claim “state of the art” results under chosen metric Must be able to show business value 
Bias towards new techniques Bias towards effective techniques 
Models can be as big, slow, expensive to train, and complex as needed to obtain state of the art results Models must balance effectiveness against cost of development and maintenance 
Standard benchmark datasets, often with fixed features Novel, never-before-seen datasets with lots of space for feature engineering 
Done for the paper, generally no ongoing maintenance burden Will be used continuously for a long time; maintenance and retraining is a significant concern 
Some major differences between industrial and academic research

So, while that fancy new technique that you saw in an ICML paper seems great, and will no doubt be tremendous fun to implement and play with, it’s important to remember that the field of ML as a whole favors research that 1) achieves “State of the Art” performance on some well-known, benchmark task and 2) does so using a novel technique.  While some papers that introduce a new model will briefly consider cost or speed or robustness of their method, this is frequently a fairly brief investigation relative to the amount of time and effort spent analyzing the algorithm itself.  Almost no published papers report negative results, and it can be extremely difficult to get a more detailed analysis of an older technique accepted into a prestigious academic venue.  All of this biases academic research towards a high focus on novelty, regardless of time, complexity, or expense (for instance: GPT-3 has over 170 million parameters and would cost over $4,600,000 to train on a commercial cloud provider). 

Industrial ML, by contrast, is about making valuable decisions on messy, constantly changing data on an ongoing basis, and frequently within hard constraints on speed, model size, and training time. This means that you’re almost always going to be biased towards using simpler models first, particularly ones that are available in robust, off-the-shelf, well-tested libraries that don’t require a lot of time and effort to implement and test.  Models such as logistic regression (LR) or a random forest (RF) model, are well studied, battle-tested, and very often provide surprisingly good results for almost no effort.  And most industrial ML projects stop right there; a few rounds of testing regularization parameters for LR, or tweaking hyperparameters for a RF model, gets you a “good enough” model that’s relatively quick to train and retrain and is built using software that someone else is maintaining for you.   

But what if off-the-shelf models can’t quite meet your needs?  Surely then it’s time to reach for your stack of dog-eared ICML and NeurIPS papers, right?   

Not quite.   

The magic of domain expertise 

The next thing to do is reach out to domain experts – if you haven’t already – and talk to them about features.  You can often get 80% of the benefit of deep learning models with about 10% of the effort by simply improving your feature representation of the data based on domain expertise (you could make a pretty good argument that deep learning is simply a really complicated feature extractor bolted on to a linear model, but that’s a topic for a different post).  A good starting point here is to ask domain experts how they typically tackle the problem you’re trying to solve and try to you can replicate that process in code, at least up to a point.  Taking any tools that they might use and wrapping some simple automation around them is also another excellent strategy, as is showing them the kinds of features you’re already using and seeing if they have any recommendations.  While some feature extraction processes may be too expensive to pursue for some models (e.g. running every sample in a sandbox for an hour), or might not be able to run in the same place that the model will live (e.g. sandboxes on a customer endpoint), very often with a little expert help you can find simple to collect features that provide a significant performance improvement.   

So; now that we’ve set a good baseline with some off-the-shelf approach, and improved our features… surely now we can reach for those academic papers and try the latest, greatest machine learning techniques, right?   

Nope! But we’re getting closer!  

…just not quite yet.

Always try what has worked before 

Before you try to get fancy, go get your favorite stock feedforward neural network (or your favorite set of gradient boosted decision tree hyperparameters).  The dirty secret of the ML community is that most deep learning models are massively over-parameterized and rescued from poor performance only through careful regularization. This means that the difference between any two models on a particular problem will typically not be all that big, especially after a little room to do some hyperparameter tuning, so it’s always worth checking to see if a model you already use for a different problem will work.  

This is an in-house version of the “start with an open source library” approach.  Take a deep learning model you’ve already used and understand and try that out with a short hyperparameter search.  Again, the bias here is towards something that is quick and easy to implement and that you’ve already used successfully and presumably have had some experience debugging.   

Just by way of example: Sophos AI has around 12 deep learning models out in the field (depending on how you count them); most of them use the same basic building block (dropout, dense layer, some sort of normalization, ELU or ReLU activation) with varying layer size and number of blocks, and half of what’s left use a basic embedding+convolution architecture.  Less than a quarter of our models have unique architectures that don’t share anything in common with any other model.  

Alright.  So now 1) we set baselines with LR and RF and decided that neither is good enough; 2) we’ve spent a while iterating on our features, and haven’t really made enough progress on that front; 3) we’ve used a tried-and-true deep learning architecture and found that that doesn’t quite fit the bill.  Now, at long last, we’re ready to try out some fancy architectures and show off our ML skills, right? 

So close. One last step before we get there.   

Remember that your model is part of a system 

Now it’s time to step away from the data analysis altogether and go back to whoever is using your model, show them what you’ve got as far as results, and ask them if they can change their process to accommodate the model performance.  Very often it’s easier to implement some simple compensating controls for things like false positive suppression than it is to sink a lot of time and energy into squeezing the last 0.001% out of the FPR (fun though that is to do).  Doing some quick analysis on what kinds of errors the model tends to make – are there, for instance, specific classes of items that it just seems to do poorly on – can often lead to modeling insights and help you develop rule-based controls around the model that will let you field what you have without further research.  In other cases, focusing the model on a subset of the data where it does better might be another deployment strategy that will let you get value from the model without spending expensive development time on it.   

Or it might not. It might be that you really do need to improve the model for it to serve the function it needs to in production.  But you need to remember that going off the map and trying to develop a new architecture or loss function or training technique is a potentially expensive endeavor, and historically most new research fails.  Make sure the potential upside is worth it, and before you take the next step into completely new model development territory, make sure that you absolutely have to. 

Now? Now.

And so now, at long last, we have arrived.   

That’s like two whole epochs of GPT-3 training on CPU

We can’t get to good enough performance any other way.  We’ve checked with our customer and we know how big a model we can make and how fast it has to run.  We know the boundaries within which we can explore, and we’re ready to start doing some fancy ML research.  Now you’re allowed to get that fancy new architecture you’ve been dying to try out, start rolling some complicated loss functions,  

If it sounds like we’ve done everything possible to avoid coming to the point of having to do novel research on modeling approaches and architectures to solve our unique problem, it’s because we have.  While this is – arguably – one of the most fun and interesting parts of ML research, it’s also the riskiest and the most expensive.  Every time we have to deploy a completely new model, that’s another large investment of effort into a project of unknown outcomes, where we’ll be assuming a heavy maintenance burden for as long as the model is out in the field.  You have to be sure that you actually need to run that risk and pay that cost to build and deploy the model.   

Remember: we’re not here to write conference papers (though that, too, is a lot of fun), we’re here to make ML models that meet a business need.  Make sure you understand the need and meet it as efficiently as possible.