In the wild, we often see that malware in user systems persists well hidden in obfuscated or randomized file locations. It does this to avoid detection by basic methods like disk scans or suspicious users looking through their file systems. Since many legitimate files also reside in these randomized file paths, and sometimes follow similar obfuscated naming conventions, these locations present opportunities for malware to disguise itself as just another file.
Wouldn’t it be great if we could identify malicious files in these partially obfuscated or randomized file locations by identifying common patterns using Machine Learning models? This is especially useful if access to the actual file content is limited!
We set out to do exactly this, with the view that using file paths in static detection would provide additional contextual information that a machine learning model could leverage during static detection.
To achieve this, we used file paths as an additional input to a Deep Learning based detector that learns to identify common patterns in malicious files, while considering both the file content and the file path as inputs. In this article, we show that combining information derived from the content of the PE, and the context information provided by the file paths can lead to better convictions with greater confidence.
In order to train a Machine learning model to detect malware using filepaths, we needed an initial set of data, with adequate number of examples of both malware and benign files and their associated file paths. For this purpose, we combined information from 2 sources. We obtained files from a vendor aggregation service, and file paths from our internal telemetry. We then included any samples for which we had both the files and the file paths for experimentation.
We collected three distinct datasets: A training set, a validation set and a test set. The training set comprised of 9,148,143 samples which were first seen between June 1 and November 15 2018, out of which 693,272 samples were labeled malicious. The validation set consisted of 2,225,094 distinct samples seen between November 16 and December 1 2018, out of which 85,041 were labeled as malicious. Finally, the test set had 249,783 total samples, seen between Jan 1 to Jan 30 2019, out of which 38,767 were labeled as malicious.
For each sample, we processed the PE content into a feature vector of length 1024, and the file path into a feature vector of length 100.
The features extracted from the PE file included the following:
· We created a 2 dimensional 16X16 histogram of byte entropy values, computed by sliding a 1024-length window over the PE binary and computing entropy values across the 1024 bytes. The entropy values are then paired with the individual byte values, and a 2D histogram is created with the entropy value along one axis and the byte value along another. Finally, this histogram is flattened into a 256-dimensional vector
· A 256-dimensional bin of hashes of DLL name and import function pairs in the import address table of the PE binary.
· A 256-dimensional bin of hashes of metadata from the PE header, including PE metadata, imports, exports, etc.
· A 256-dimensional (16×16) string 2D histogram.
For file paths, we first analyzed the character counts in all the file path data that we had, and built a vocabulary of the 150 most common characters. We assigned an integer to each character, ranging from 0 to 150, numbers 0-149 represented the 150 most common characters, and 150 represented all other characters not in that list. We then trimmed each file path to the first 100 characters, and converted them to 100 integer vectors using the vocabulary that we had developed.
More details on how we pre-processed the data and generated features from it is available here: https://arxiv.org/pdf/1905.06987.pdf
The neural network architecture we use is shown in the figure above. The model has two input branches, and one output. The input layers are represented as white boxes on the left end of the model.
The top branch contains and input layer of length 1024 that consumes features extracted from the PE file, represented by a white box. This input is then passed to a stack of several neural network blocks of reducing sizes, from 1024 to 512 nodes. Each of these purple blocks in the figure contain a densely connected layer, a Layer Normalization layer and a Dropout Layer.
The bottom branch contains an input layer that consumes a 100 element integer vector with features extracted from the file path. This is then passed to an embedding layer, which produces a 100X32 embedding output. The embedding layer converts each character in the vocabulary into a 32-bit float representation based on the frequency and location of its occurrence in relation to other characters.
The output of the embedding layer is then passed to 4 convolution blocks. Each of these convolution blocks contains a 1D convolution layer (with kernel sizes 2,3,4 and 5 respectively), a Layer Normalization layer and a 1D sum layer. The 1D convolution layer with 128 filters goes through the 100X32 embedding output and produces a 100X128 sized output. This output is then passed to a Layer Normalization layer and then a 1D sum layer, that sums up the values along one axis of the 100X128 sized input to produce an output of size 128.
Finally, the outputs of both branches are concatenated and passed through a series of neural network blocks of sizes 512, 256 and 128, that contain a densely connected layer, a layernorm layer and a dropout layer. The output layer is a single node dense layer with a sigmoid activation that produces a value between 0 to 1 to indicate whether the sample is malware or not.
Here are some useful links that explain the inner workings of these different layers in detail: 1D convolutions: https://missinglink.ai/guides/keras/keras-conv1d-working-1d-convolutional-neural-networks-keras/ Layer Normalization: https://arxiv.org/abs/1607.06450 Embedding layers: https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526 Dropout layers: https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/ Activation functions: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
In order to evaluate the performance of our proposed model (PE + FP), we compared it to two other models, a neural network model that looks only at the static PE content(PE), and another model that looks only at the file paths(FP). The comparison of model performance for all 3 models in the form of ROC curves is shown in the figure above. Each curve represents the mean and variance of the ROC curve across 5 individual training runs, in order to take model stability into account in our performance analysis as well.
We see that the proposed (PE content + File Path) model substantially outperforms the content-only model in terms of net AUC and across the vast majority of the ROC curve, slightly dipping below the PE baseline between 10−2 and 10−3 FPR, an effect which could potentially be alleviated with a larger training set. At lower FPRs, the performance improvements from the PE+FP model compared to both baselines is substantial. Specifically, we see that there is a 27% increase in True Positive rate for the PE + FP model as opposed to the PE model at 10−3 FPR, and a 64% increase at 10−4 FPR. This increase is also accompanied by a reduction in variance of performance, making the PE+FP model a better choice in terms of both stability and overall detection performance.
How do file paths influence detection?
Now that we know using file paths leads to better detection rates, let’s dig deeper and analyze how the file path actually contributes to a prediction!
We did 3 different analyses using LIME (Local Interpretable Model-Agnostic Explanations). This paper provides more details on how LIME works: https://arxiv.org/abs/1602.04938
First, we picked one of our trained PE + FP models. Using LIME, we generated weights for each token in the file path. This weight represented the contribution of that token to the final classification outcome. Some example file paths with (a) positive and (b) negative ground truth labels are shown in the figure above. Red highlights indicate that the token increased the overall malware score and blue highlights indicate that the token reduced the overall malware scores. Darker red and blue highlights indicated a larger magnitude impact and vice versa.
We see several interesting tokens in the figure above. In the first positive example we can see that that the token “kmsauto” is being identified as a maliciousness indicator by our PE+FP
model. KMS Auto is legally dubious Microsoft product activator, and this file is identified as “PUA:Win32/AutoKMS” by Microsoft. Similarly, in the second positive example our PE+FP model gave high score to “pcrepairkit”. Repair kits are often questionable software products that usually contain spyware or malware.
On the other hand, in the several negative examples we can see that management tools are being down-weighted by the PE+FP model, as compared to the PE model. Management tools are notoriously difficult to distinguish from spyware, as their functionality is basically the same, the only difference is the intent of the user. In this case, using filepath information
provided us more context for the detection, thus allowing more accurate identification by the PE+FP model.
In our second analysis, we replaced an existing PE file associated with a file path with another random PE file, and noted the difference in weight contributions as a result. It was encouraging to see that the contributions of tokens in the file path changed based on the PE content. This indicates that the file path contribution has a non-linear dependence on the PE content, which means that both inputs are not being considered independently by the model, thus making it
more robust. For example, when we kept the same path for the first negative example, but replaced the file with a randomly chosen malicious file, the importance of the token “management” was significantly reduced
Finally, we performed an aggregate LIME analysis to identify prominent tokens throughout our dataset. We performed isotonic regression to calibrate the sigmoid outputs of the PE and the PE+FP model, and identified 200 samples which had the highest score variation between the two models – 100 samples which saw the largest increase in score and 100 samples which saw the largest decrease in score over the baseline. We then aggregated LIME parameter weights across tokens and normalized by token frequency, looking at tokens of highest and lowest weights for the selected 200 samples. The top 10 tokens which increased and decreased response are shown in the above table.
For malicious samples, we see that the tokens of highest weight consisted of strings with randomized content, that were not cryptographic digests, perhaps an attempt at obfuscation. The remaining high-weight token, setup is perhaps indicative of an infected installer. Tokens with large negative weights consist of common looking benign software names, as one might expect. Of the benign samples that we assessed, tokens that increased response tended to have very short length, e.g., “t”, “d”, and “z”, very high or very low entropy, e.g., “219805786” and “xxxxx”, and have “miner” in their names, e.g., “miner”, “mineropt” – indicating the likely
presence of a (benign) cryptocurrency miner, potentially downloaded by the user voluntarily. It is not surprising that the string “miner” increased response as many types of malware and potentially unwanted benignware steal CPU cycles to mine cryptocurrency. With respect to tokens that most attenuated the response, they appear to be components of standard software.
Our main takeaway from this research is that file paths are a valuable source of information that can be leveraged during static detection to improve model performance. We have demonstrated this with a detailed study using a dataset of about 10 million examples. We also performed LIME analysis on the trained models to demonstrate how a file path contributes to the final model verdict. In doing so, we identified that the model is able to learn not only statistical correlations in the data that point to possible malicious content, but also contextual information that points to actual malicious/benign concepts.
The results we obtained are very encouraging; they suggest that the PE content + file path model can practically be deployed to endpoints to detect malware. In addition to this, we also think that this model and the accompanying LIME analysis is a great candidate for application in an Endpoint Detection and Response context. Analytic tools built on LIME model explanations can better visualize the threat associated with a certain file.