Konstantin Berlin, Author at Sophos AI

Generating up to date, well labeled datasets for machine learning (ML) security models is a unique engineering challenge, as large data volumes, complexity of labeling, and constant concept drift makes it difficult to generate effective training datasets. Here we describe a simple, resilient cloud infrastructure for generating ML training and testing datasets, that has enhanced the speed at which our team is able to research and keep in production a multitude of security ML models.

Malware detection is a popular application of Machine Learning for Information Security (ML-Sec), in which an ML classifier is trained to predict whether a given file is malware or benignware. Parameters of this classifier are typically optimized such that outputs from the model over a set of input samples most closely match the samples true malicious/benign (1/0) target labels. However, there are often a number of other sources of contextual metadata for each malware sample, beyond an aggregate malicious/benign label, including multiple labeling sources and malware type information (e.g. ransomware, trojan, etc.), which we can feed to the classifier as auxiliary prediction targets. In this work, we fit deep neural networks to multiple additional targets derived from metadata in a threat intelligence feed for Portable Executable (PE) malware and benignware, including a multisource malicious/benign loss, a count loss on multi-source detections, and a semantic malware attribute tag loss. We find that incorporating multiple auxiliary loss terms yields a marked improvement in performance on the main detection task. We also demonstrate that these gains likely stem from a more informed neural network representation and are not due to a regularization artifact of multi-target learning. Our auxiliary loss architecture yields a significant reduction in detection error rate (false negatives) of 42.6% at a false positive rate (FPR) of 10−3 when compared to a similar model with only one target, and a decrease of 53.8% at 10−5 FPR.

Machine learning (ML) used for static portable executable (PE) malware detection typically employs per-file numerical feature vector representations as input with one or more target labels during training. However, there is much orthogonal information that can be gleaned from the \textit{context} in which the file was seen. In this paper, we propose utilizing a static source of contextual information — the path of the PE file — as an auxiliary input to the classifier. While file paths are not malicious or benign in and of themselves, they do provide valuable context for a malicious/benign determination. Unlike dynamic contextual information, file paths are available with little overhead and can seamlessly be integrated into a multi-view static ML detector, yielding higher detection rates at very high throughput with minimal infrastructural changes. Here we propose a multi-view neural network, which takes feature vectors from PE file content as well as corresponding file paths as inputs and outputs a detection score. To ensure realistic evaluation, we use a dataset of approximately 10 million samples — files and file paths from user endpoints of an actual security vendor network. We then conduct an interpretability analysis via LIME modeling to ensure that our classifier has learned a sensible representation and see which parts of the file path most contributed to change in the classifier’s score. We find that our model learns useful aspects of the file path for classification, while also learning artifacts from customers testing the vendor’s product, e.g., by downloading a directory of malware samples each named as their hash. We prune these artifacts from our test dataset and demonstrate reductions in false negative rate of 32.3% at a 10−3 false positive rate (FPR) and 33.1% at 10−4 FPR, over a similar topology single input PE file content only model.

With the rapid proliferation and increased sophistication of malicious software (malware), detection methods no longer rely only on manually generated signatures but have also incorporated more general approaches like machine learning detection. Although powerful for conviction of malicious artifacts, these methods do not produce any further information about the type of threat that has been detected neither allows for identifying relationships between malware samples. In this work, we address the information gap between machine learning and signature-based detection methods by learning a representation space for malware samples in which files with similar malicious behaviors appear close to each other. We do so by introducing a deep learning based tagging model trained to generate human-interpretable semantic descriptions of malicious software, which, at the same time provides potentially more useful and flexible information than malware family names.
We show that the malware descriptions generated with the proposed approach correctly identify more than 95% of eleven possible tag descriptions for a given sample, at a deployable false positive rate of 1% per tag. Furthermore, we use the learned representation space to introduce a similarity index between malware files, and empirically demonstrate using dynamic traces from files’ execution, that is not only more effective at identifying samples from the same families, but also 32 times smaller than those based on raw feature vectors.

For years security machine learning research has promised to
obviate the need for signature based detection by automatically learning
to detect indicators of attack. Unfortunately, this vision hasn’t come to
fruition: in fact, developing and maintaining today’s security machine
learning systems can require engineering resources that are comparable
to that of signature-based detection systems, due in part to the need
to develop and continuously tune the “features” these machine learning
systems look at as attacks evolve. Deep learning, a subfield of machine
learning, promises to change this by operating on raw input signals and
automating the process of feature design and extraction. In this paper
we propose the eXpose neural network, which uses a deep learning approach we have developed to take generic, raw short character strings as
input (a common case for security inputs, which include artifacts like potentially malicious URLs, file paths, named pipes, named mutexes, and
registry keys), and learns to simultaneously extract features and classify using character-level embeddings and convolutional neural network.
In addition to completely automating the feature design and extraction
process, eXpose outperforms manual feature extraction based baselines
on all of the intrusion detection problems we tested it on, yielding a 5%-
10% detection rate gain at 0.1% false positive rate compared to these
baselines.

Enterprise networks are in constant danger of being breached by cyber-attackers, but making the decision about what security tools to deploy to mitigate this risk requires carefully designed evaluation of security products. One of the most important metrics for a protection product is how well it is able to stop malware, specifically on” zero”-day malware that has not been seen by the security community before. However, evaluating zero-day performance is difficult, because of larger number of previously unseen samples that are needed to properly measure the true and false positive rate, and the challenges involved in accurately labeling these samples. This paper addresses these issues from a statistical and practical perspective. Our contributions include first showing that the number of benign files needed for proper evaluation is on the order of a millions, and the number of malware samples needed is on the order of tens of thousands. We then propose and justify a time-delay method for easily collecting large number of previously unseen, but labeled, samples. This enables cheap and accurate evaluation of zero-day true and false positive rates. Finally, we propose a more fine-grain labeling of the malware/benignware in order to better model the heterogeneous distribution of files on various networks.

As antivirus and network intrusion detection systems have increasingly proven insufficient to detect advanced threats, large security operations centers have moved to deploy endpoint-based sensors that provide deeper visibility into low-level events across their enterprises. Unfortunately, for many organizations in government and industry, the installation, maintenance, and resource requirements of these newer solutions pose barriers to adoption and are perceived as risks to organizations’ missions. To mitigate this problem we investigated the utility of agentless detection of malicious endpoint behavior, using only the standard built-in Windows audit logging facility as our signal. We found that Windows audit logs, while emitting manageable sized data streams on the endpoints, provide enough information to allow robust detection of malicious behavior. Audit logs provide an effective, low-cost alternative to deploying additional expensive agent-based breach detection systems in many government and industrial settings, and can be used to detect, in our tests, 83% percent of malware samples with a 0.1% false positive rate. They can also supplement already existing host signature-based antivirus solutions, like Kaspersky, Symantec, and McAfee, detecting, in our testing environment, 78% of malware missed by those antivirus systems.

Malware remains a serious problem for corporations, government agencies, and individuals, as attackers continue to use it as a tool to effect frequent and
costly network intrusions. Today malware detection
is still done mainly with heuristic and signature-based
methods that struggle to keep up with malware evolution. Machine learning holds the promise of automating
the work required to detect newly discovered malware
families, and could potentially learn generalizations
about malware and benign software (benignware) that
support the detection of entirely new, unknown malware
families. Unfortunately, few proposed machine learning based malware detection methods have achieved the
low false positive rates and high scalability required to
deliver deployable detectors.
In this paper we introduce an approach that addresses these issues, describing in reproducible detail
the deep neural network based malware detection system that Invincea has developed. Our system achieves
a usable detection rate at an extremely low false positive rate and scales to real world training example volumes on commodity hardware. Specifically, we show
that our system achieves a 95% detection rate at 0.1%
false positive rate (FPR), based on more than 400,000
software binaries sourced directly from our customers
and internal malware databases. We achieve these results by directly learning on all binaries, without any
filtering, unpacking, or manually separating binary files
into categories. Further, we confirm our false positive
rates directly on a live stream of files coming in from
Invincea’s deployed endpoint solution, provide an estimate of how many new binary files we expected to see
a day on an enterprise network, and describe how that
relates to the false positive rate and translates into an
intuitive threat score.
Our results demonstrate that it is now feasible to
quickly train and deploy a low resource, highly accurate
∗Authors contributed equally to the work.
machine learning classification model, with false positive rates that approach traditional labor intensive signature based methods, while also detecting previously
unseen malware. Since machine learning models tend
to improve with larger data-sizes, we foresee deep neural network classification models gaining in importance
as part of a layered network defense strategy in coming
years.

Author Konstantin Berlin’s publications

A Simple and Agile Cloud Infrastructure to Support Cybersecurity Oriented Machine Learning Workflows

ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation

Learning from Context: Exploiting and Interpreting File Path Information for Better Malware Detection

Automatic Malware Description via Attribute Tagging and Similarity Embedding

eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys

Improving zero-day malware testing methodology using statistically significant time-lagged test samples

Malicious Behavior Detection Using Windows Audit Logs

Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features