Enterprise Security

My first experience with using Machine Learning in a cyber-security context

Introduction

For my Cyber Security Project, my team and I were tasked to design and develop and enterprise security architecture that prioritized resiliency, response, best practices and unknown threat detection. Here is our chosen attack scenario.

Having some basic experience in Malware Analysis from Matt Kiely's PMAT course, I chose the Malware Attack Category and as such was in charge of implementing malware mitigation and detection measures.

Security Stack

Here is my security stack of the measures that I implemented for the detection and prevention of malware on the windows endpoint.

The main security tool I used was Wazuh, an open-source endpoint security tool.

The main two features that I wanted to highlight and explain more in this blog are Malware PE Classification with Machine Learning and Automated Bulk Analysis using Cuckoo's API.

Malware PE Classification with Machine Learning

To detect unknown malware threats, I used machine learning to classify whether an PE file was malicious or legitimate based on its PE attributes.

PE stands for Portable Executable and is a widely used file format across Windows

Some common kinds of PE files include DLLs (Dynamic Link Libraries) and EXEs (Executables)

The motivation behind this feature was because I wanted a fast and easy way to classify executables without relying on traditional static detection methods such as VirusTotal or YARA.

Data Collection

To start off, I first had to collect my samples. I downloaded over 200,000 PE executables from https://practicalsecurityanalytics.com/ which comprised of 86,000 legitimate and 114,000 malicious files.

To create my labelled dataset, I automatically extracted 54 features from the executable based on PE file header using https://github.com/Te-k/malware-classification/blob/master/generatedata.py.

These results were then stored in a csv file, creating the labelled dataset.

Data Analysis & Feature Extraction

During Data Analysis, our goal is to reduce the number of features extracted to select the only most interesting features for training. This can be done easily with the use of feature importance algorithms.

I made use of the Extra Trees Classifier algorithm to narrow do to the 12 most relevant features for differentiating malicious and legitimate PE files.

The total feature importance of these 12 features is 0.71709 over 1 which means that these 12 features carry the most weight of all the 54 features that I extracted at the start.

Feature Analysis

To verify that the features selected by the algorithm made sense, I also manually performed feature analysis to see the distribution of values.

We can first inspect the Machine attribute and with reference to Microsoft, it states the following

The Machine field has one of the following values, which specify the CPU type. An image file can be run only on the specified machine or on a system that emulates the specified machine.

We can also see a table of the different hexadecimal values and their corresponding CPU architectures.

We can easily gauge the values of legitimate and malicious PE files with the use of value_counts().

If we convert the values 332 and 33404 to hexadecimal, we get 0x14C and 0x8664, which represent x86 and x64 architecture respectively.

We can see that the majority of the legitimate PE files are based in x86 and x64 while a overwhelming majority of the malicious PE files are based in x86 architecture.

In my opinion, this further validates why the machine attribute is a good feature and has such as high importance score as there is a clear difference in values between the malicious and legitimate PE files.

We can also look at the second most highly scored feature, SectionsMaxEntropy, which calculates the highest entropy value of all the sections within a PE file.

High entropy sections may be indicative of obfuscation or encryption techniques which are malware characteristics

Similarly for the featureSectionsMaxEntropy, we can see a obvious correlation as majority of the malware binaries (denoted in red) have a entropy value 7 to 8 while legitimate samples denoted in green mostly have an entropy value below 7.

Thus, there is once again a clear difference in values between malicious and legitimate PE files which confirms that the features select by the algorithm are good.

Training and Evaluation

Since the act of identifying malicious and legitimate PE files is a classification problem, we can start by testing different classification algorithms via scikit-learn to determine the best classification algorithm.

Using a 80/20 split of my labelled dataset, I managed to achieve a 98.3% accuracy using the Random Forest algorithm.

I then exported the model out to be used for integration.

Integration

This was arguably the hardest part of the implementing this feature. What was the point of coming up with a good AI/ML model if you had no real way to deploy and integrate it in the real-world.

After many trial and errors, I managed to integrate the ML model with Wazuh's Active Response with the following steps

Create a python script to extract the PE information to be analyzed by the ML model
Compiling the executable using cx_freeze
Creating a custom batch script that does the following
1. Utilizes the ML model to classify whether the PE file is malicious/legitimate
2. Place scanned malicious and legitimate PE files in their respective write-protected directories
Configuring Wazuh with custom rules and decoders to process the logs and create the respective alerts

Automated Bulk Analysis via Cuckoo's API

What if your machine learning model is wrong? How are you going to verify that the PE files scanned by the model is really malicious or legitimate as it claims?

This was my motivation behind this feature, as I wanted an alternative method to validate and verify the results of my machine learning model besides raw accuracy.

To implement this, I had to

Setting up Cuckoo
1. Ensuring all required services run (E.g cuckoo-api, cuckoo, cuckoo-web)
2. Used cuckoo community as it comes with extra signatures that will deepen the analysis
Creation of file_sync.bat on client machine
1. Bulk transfers all quarantined files (legitimate/malicious) to sandbox server via Putty
2. Automated with task scheduler that executes the script at midnight daily
Creation of cuckoo_scan.py on sandbox machine
1. Scans all files in the quarantine folder and moves the executable and reports into another folder once finished
2. Creates custom YARA rule for the executable with YarGen
3. Automated with cronjob that executes the script at midnight daily

Reflection

I was pretty satisfied with the outcome of this project as I managed to apply what I learnt about the PE format into a somewhat practical use case.

I also learnt the importance of understanding the my data in the context of ML/AI.

If I were to ever do something like this again, I would probably like to explore Deep Learning applications in malware such as this example.

In case you are interested to look my slides for this particular project, you can click here.

Thanks for reading!

PreviousProjects NextDomain-Adaptive Speech Transcription and Comprehension

Last updated 9 months ago