Machine Learning (ML) to Secure Big Data

One of the best-known secrets of machine learning (ML) is that the most reliable way to get more accurate models is by getting more training data and more accurate labels. Unfortunately, generating larger, more relevant datasets is arguably a bigger challenge in the security domain than in most other domains, due to two major complications. The first is that labeling information is usually not available at time of observation, but slowly evolves over time (days to months) as more information is observed. The second complication is the constant concept drift in the in-the-field distributions, as well as accuracy of the labeling algorithm itself. In such an environment, developing a model around a specific “gold” dataset is not enough, and any deployed model is constantly retrained on newer data, with constantly evolving labeling strategies, that needs to be propagated to all observables immediately.

The above requirements are complex, and a lot of times cannot be solved with off the shelf solutions, even using solutions that are designed for ML focused groups, such as ours. Sophos AI build a dedicated team of engineers that works closely with our researchers to develop the infrastructure and tooling required to research, develop, deploy, monitor and maintain our ML multiple models and associated products. Some common problems we work on every day:

– How do we collect telemetry from across the company in a scalable way and ingest into our cloud infrastructure at reasonable cost?

– How do we index and store our data internally?

– How to monitor the performance of our ML models in the field?

– How to consistently deploy dozen of ML model updates every month?

– How to make developing and retraining of new models as simple and simples to our researchers and engineers?

Our engineering team is focused on several major infrastructure areas. Working with other teams in the organization on architecting the company wide big data strategy. Developing internal infrastructure to support ingesting data in a scalable way from a variety of different internal and external organization. Creating a flexible set of tools and databases to allow simple access to all the collected data in a way that allows data scientists to easily combine and slice terabytes of data based on constantly evolving requirements. Developing tools to easily retrain and deploy large number of models and monitor them in the field.