Getting Insight Out Of and Back Into Deep Neural Networks

Deep learning has emerged as a powerful tool for classifying malicious software artifacts, however the generic black-box nature of these classifiers makes it difficult to evaluate their results, diagnose model failures, or effectively incorporate existing knowledge into them. In particular, a single numerical output – either a binary label or a ‘maliciousness’ score – for some artifact doesn’t offer any insight as to what might be malicious about that artifact, or offer any starting point for further analysis. This is particularly important when examining such artifacts as malicious HTML pages, which often have small portions of malicious content distributed among much larger amounts of completely benign content.

In this applied talk, we present the LIME method developed by Ribeiro, Singh, and Guestrin, and show – with numerous demonstrations – how it can be adapted from the relatively straightforward domain of “explaining” text or image classifications to the much harder problem of supporting analysts in performing forensic analysis of malicious HTML documents. In particular, we can not only identify features of the document that are critical to performance of the model (as in the original work), but also use this approach to identify key components of the document that the model “thinks” are likely to contain malicious elements. This allows analysts to quickly assess both the validity of the model’s conclusion and rapidly identify regions that require additional inspection and evaluation. In doing so the deep learning model is converted from a gnomic “black box” into a useful exploratory tool for malicious artifacts, even when the deep learning model itself may label the sample incorrectly.

We complement this work by showing how knowledge extracted by this method – as well as existing expert knowledge – can be readily re-incorporated into deep learning models.