Scalable Infrastructure for Malware Labeling and Analysis

Konstantin Berlin

One of the best-known secrets of machine learning (ML) is that the most reliable way to get more accurate models is by simply getting more training data and more accurate labels. This observation is also true for malware detection models like the ones we deploy at Sophos. Unfortunately, generating larger, more accurate datasets is arguably a much bigger challenge in the security domain than in most other domains, and poses unique challenges. Malware labeling information is usually not available at time of observation, but comes from various internal and external intelligence feeds, months or even years after a given sample is first observed. Furthermore, labels from these feeds can be inaccurate, incomplete, and even worse, change over time, necessitating joining multiple feeds and frequently adjusting the labeling methodology over entire datasets. Finally, realistic evaluations of an antimalware ML model often require being able to “roll back” to a previous label set, requiring a historical record of the labels at the time of training and evaluation.

All this, under the constraint of a limited budget.

In this presentation, we will show how to use AWS infrastructure to solve the above problems in a fast, efficient, scalable, and affordable manner. The infrastructure we describe can support the data collection and analysis needs of a global Data Science team, all while maintaining GDRP compliance and being able to efficiently export data to edge services that power a global reputation lookup.

We start by describing why developing the above infrastructure at reasonable cost is surprisingly difficult. We focus specifically on the different requirements, such as the need to correlate information from internal and external sources across large time ranges, as well ability to roll back knowledge to particular timeframes in order to properly develop and validate detection models.

Next, we describe how to build such a system in a way that is scalable, agile, and affordable, using AWS cloud infrastructure. We start out describing how to effectively aggregate and batch data across regions into a regional data lake in a way that is GDPR complaint, utilizing SQS, auto-scaling spot instance clusters, and S3 replication. We then describe how we store, join, and analyze this data at scale by batch inserting the replicated data into a distributed columnar Redshift database.

We then describe how we organize the data effectively in Redshift tables, taking proper care to define distribution and sort keys, deal with duplicate entries (as key uniqueness is not enforced), and perform SQL joins efficiently on daily cadence. Finally, we show how we can export the data at scale to local storage for ML training, or propagate this data to edge services, like replicated DynamodDB, to power a global reputation service.