Malware remains a serious problem for corporations, government agencies, and individuals, as attackers continue to use it as a tool to effect frequent and
costly network intrusions. Today malware detection
is still done mainly with heuristic and signature-based
methods that struggle to keep up with malware evolution. Machine learning holds the promise of automating
the work required to detect newly discovered malware
families, and could potentially learn generalizations
about malware and benign software (benignware) that
support the detection of entirely new, unknown malware
families. Unfortunately, few proposed machine learning based malware detection methods have achieved the
low false positive rates and high scalability required to
deliver deployable detectors.
In this paper we introduce an approach that addresses these issues, describing in reproducible detail
the deep neural network based malware detection system that Invincea has developed. Our system achieves
a usable detection rate at an extremely low false positive rate and scales to real world training example volumes on commodity hardware. Specifically, we show
that our system achieves a 95% detection rate at 0.1%
false positive rate (FPR), based on more than 400,000
software binaries sourced directly from our customers
and internal malware databases. We achieve these results by directly learning on all binaries, without any
filtering, unpacking, or manually separating binary files
into categories. Further, we confirm our false positive
rates directly on a live stream of files coming in from
Invincea’s deployed endpoint solution, provide an estimate of how many new binary files we expected to see
a day on an enterprise network, and describe how that
relates to the false positive rate and translates into an
intuitive threat score.
Our results demonstrate that it is now feasible to
quickly train and deploy a low resource, highly accurate
∗Authors contributed equally to the work.
machine learning classification model, with false positive rates that approach traditional labor intensive signature based methods, while also detecting previously
unseen malware. Since machine learning models tend
to improve with larger data-sizes, we foresee deep neural network classification models gaining in importance
as part of a layered network defense strategy in coming
years.