Sophos AI Peer-Reviewed Research

Generating up to date, well labeled datasets for machine learning (ML) security models is a unique engineering challenge, as large data volumes, complexity of labeling, and constant concept drift makes it difficult to generate effective training datasets. Here we describe a simple, resilient cloud infrastructure for generating ML training and testing datasets, that has enhanced the speed at which our team is able to research and keep in production a multitude of security ML models.

Publications Our documented research findings

Garbage in, garbage out: how purportedly great ML models can be screwed up by bad data

Malware Data Science: Attack Detection and Attribution

De-anonymizing programmers via code stylometry

Crafting adversarial input sequences for recurrent neural networks

Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments

MEADE: Towards a Malicious Email Attachment Detection Engine

Towards Principled Uncertainty Estimation for Deep Neural Networks

SeqDroid: Obfuscated Android Malware Detection Using Stacked Convolutional and Recurrent Neural Networks

A Deep Learning Approach to Fast, Format-Agnostic Detection of Malicious Web Content

A Simple and Agile Cloud Infrastructure to Support Cybersecurity Oriented Machine Learning Workflows