DeepSign: Deep Learning For Automatic Malware Signature Generation And Classification

April 27, 2023

This the summary and discussion for the paper DeepSign: Deep learning for automatic malware signature generation and classification by Omid E. David and Nathan S. Netanyahu.

Summary

YARA Rules can be used to identify specific malware based on patterns of the a file
YARA rules are typically used to identify malware when they are initially discovered by anti-virus companies since adding them to a training dataset takes a lot of time
Are we able to identify new variants of the same malware?
- YES that is what this paper gets into
160,000 - 360,000 new malware files are generated everyday, and most conventional signature techniques fails to identify new variants of these malware files
This paper tries to answer the question whether it is possible to develop a signature to represent a behavior that is invariant to small changes?

Background Knowledge

Deep Belief Neural Networks

Composed of unsupervised networks like Restricted Boltzmann

Machines (RBMs)

RBMs are two layer networks, each layer is trained and then its weights are freezed
Output from one RBM layer is input to the next

Denoising AutoEncoders

Trained to predict original uncorrupted data (will receive corruppted data)
- classic example is noising MINST data

Generating signatures for malware files

Run each malware file in a sandbox and obtain the log file
Parse the log and convert it into a binary bit string (1-gram) -> this is input to the neural network
Train a DAE by training one layer at a time
Test to see how well signatures represent malware using a supervised classification technique

When getting the 1-grams, remove the logs that appear in every single file (does not reveal any new information)
Select the top 200,000 unigrams
For the DAE, noise is added, and then provided to the model as input -> noise here represents any variations to the malware files
1800 malware files (1200 for train, 600 for test), Kaspersky was used for ground truth labels
30 unique signatures were generated for these files
Trained SVM and KNN leading to pretty good accuracy; a Neural Network was also trained for the classification leading to an event greater accuracy

Discussion

Where in the pipeline would we put this?
- Since we need to make use of dynamic analysis here, this would be placed at the end of the pipeline, just after the malware detector
- likely would be utilized for files that are not scored with high confidence as malware since it is VERY expensive to run dynamic analysis compared to static analysis
Why not use signatures for both static features and dynamic features?
- Dynamic analysis shows you what is actually happening when you execute a file - obviously you will not be running this on a user's machine
Is this approach about generalization?
- NO, you are intentionally biasing the dataset (they kept the dataset fairly small here!) so you can identify specific malware!
- There are many different malware families, each have their own behaviors
There are SOME generalization made here. For example the C:\\ path will change from system to system, in some systems it will be D:\\, etc.
Unigrams are able to capture small variations
Why use this approach? Pipeline is utilized to generalized, and then you go fine grained if required. Can use this as an oracle for drift detection
What will happen if there is a new class of malware introduced?
- Probably not detected using a machine learning model -> can be classified as unknown class of malware or goodware
- Analyst will need to confirm the new class of malwares
- Typically will use outlier detection to catch new classes
What has greater level of drift, new variations or new classes?
- there is greater level of concept drift with new variations of malware files
- it is much more simpler to create static drift then dynamic drift, since at the end of the day behaviors of new variants have the same behavior when ran dynamically; statically the way it is presented can be different
How can we fool a dynamic analysis approach?
- Include many malware variants in one file so fool the system, such that it is unable to classify a specific malware class, or it provides one label where there can be many
- mix in malware in between goodware functions
- have distributed malwares
What is the difference between YARA rules and the approach proposed in this paper?
- YARA makes use of static analysis, this paper uses dyanamic analysis to create signatures
- YARA rules are limited to a specific variant, whereas dynamic is very broad
ML can be applied to things which are changing often
signatures can be used for API functions