An AI Approach to Malware Similarity Analysis: Mapping the Malware Genome With a Deep Neural Network

In recent years, cyber defenders protecting enterprise networks have started incorporating malware code sharing identification tools into their workflows. These tools compare new malware samples to a large databases of known malware samples, in order to identify samples with shared code relationships. When unknown malware binaries are found to share code “fingerprints” with malware from known adversaries, they provides a key clue into which adversary is generating these new binaries, thus helping develop a general mitigation strategy against that family of threats. The efficacy of code sharing identification systems is demonstrated every day, as new family of threats are discovered, and countermeasures are rapidly developed for them.

Unfortunately, these systems are hard to maintain, deploy, and adapt to evolving threats. First and foremost, these systems do not learn to adapt to new malware obfuscation strategies, meaning they will continuously fall out of date with adversary tradecraft, requiring, periodically, a manually intensive tuning in order to adjust the formulae used for similarity between malware. In addition, these systems require an up to date, well maintained database of recent threats in order to provide relevant results. Such a database is difficult to deploy, and hard and expensive to maintain for smaller organizations. In order to address these issues we developed a new malware similarity detection approach. This approach, not only significantly reduces the need for manual tuning of the similarity formulate, but also allows for significantly smaller deployment footprint and provides significant increase in accuracy. Our family/similarity detection system is the first to use deep neural networks for code sharing identification, automatically learning to see through adversary tradecraft, thereby staying up to date with adversary evolution. Using traditional string similarity features our approach increased accuracy by 10%, from 65% to 75%. Using an advanced set of features that we specifically designed for malware classification, our approach has 98% accuracy. In this presentation we describe how our method works, why it is able to significantly improve upon current approaches, and how this approach can be easily adapted and tuned to individual/organization needs of the attendees.