Garbage In, Garbage Out: How Purportedly Great Machine Learning Models can be Screwed Up by Bad Data

As processing power and deep learning techniques have improved, deep learning has become a powerful tool to detect and classify increasingly complex and obfuscated malware at scale.

A plethora of white papers exist touting impressive malware detection and false positive rates using machine learning – often deep learning. However, virtually all of these rates are only shown in the context of a single source of data the authors choose to train and test on. Accuracy statistics are generally the result of training on a portion of some dataset (like VirusTotal data), and testing on a different portion of the same dataset. But model effectiveness (specifically detection rates in the extremely low false-positive-rate region) may vary significantly when used on new, different datasets – specifically, when used in the wild on actual consumer data.

In this presentation, I will present sensitivity results from the same deep learning model designed to detect malicious URLs, trained and tested across 3 different sources of URL data. After reviewing the results, we’ll dive into what caused our results by looking into: 1) surface differences between the different sources of data, and 2) higher level feature activations that our neural net identified in certain data sets, but failed to identify in others.

Deep learning uses a massive amount of unseen complex features to predict results, which enables them to fit beautifully to datasets. But it also means that if the training and testing data is even slightly biased with respect to the real-world test case data, some of those unseen complex features will end up damaging accuracy instead of bolstering it. Even with great labels and a lot of data, if the data we use to train our deep learning models doesn’t mimic the data it will eventually be tested on in the wild, our models are likely to miss out on a lot.