Although powerful for conviction of malicious artifacts, machine learning based detection do not generally produce further information about the type of malware has been detected. In this work, we address the information gap between ML and signature-based detection methods by introducing an ML-based tagging model that is trained to generate human-interpretable semantic descriptions of malicious software (e.g. file-infector, downloader, etc.).
Even though much has changed over the last 30 years of malware detection, most anti-malware solutions still rely on the concept of malware families for describing the capabilities of malicious software. The increased number of malware specimens along with the introduction of techniques such as polymorphism, packing, and obfuscation, has turned the task of malware description via family classification into a difficult and oftentimes intractable one. This has led to a (very) large number of mutually exclusive malware families, typically highly vendor-specific (oftentimes inconsistent across vendors) and not necessarily designed for human consumption.
We propose an alternative approach to malware description based on semantic tags. In contradistinction to (family) detection names, semantic tags aim to convey high-level descriptions of the capabilities and properties of a given malware sample. They can refer to their purpose (e.g. ‘dropper’, ‘downloader’), malware family (e.g. ‘ransomware’), file characteristics (e.g. ‘packed’), etc. Semantic tags are non-exclusive, meaning that a malware campaign can be associated with multiple tags, and a given tag can be associated with multiple malware families. By moving the focus of malware description from a large set of mutually exclusive malware families to an intelligible set of malware tags we also enable the possibility of learning the relationship between files and semantic tags with machine learning techniques.
With this in mind, we first introduce a simple annotation method for deriving high-level descriptions of malware files based on (but not necessarily constrained to) an ensemble of vendor family names. We then formulate the problem of malware description as a tagging problem and formalize it under the framework of multi-label learning. We further propose a joint-embedding deep neural network architecture that maps both semantic tags and Windows portable executable files to the same low-dimensional embedding space. We can then use the similarity between files and tags in this embedding space to automatically annotate previously unseen samples.
We empirically demonstrate that when evaluated against tags extracted from an ensemble of anti-virus detection names, the proposed tagging model correctly identifies about 94% of eleven possible tag descriptions for a given sample, at a deployable false positive rate (FPR) of 1% per tag. Furthermore, we show that it is feasible to learn behavioral characteristics of malicious software samples from a static representation of the file by fitting a deep neural network to predict the proposed set of tags and evaluating the results on ground truth tags extracted from behavioral traces of files’ execution.