Microsoft Malware Detection Using Machine Learning

10 min readNov 20, 2018

Dataset location @ https://www.kaggle.com/c/malware-classification/data

What is Malware?

The term malware is a combination of malicious software. Put simply, malware is any piece of software that was written with the intent of doing harm to data, devices or to people.

Problem Statement

In the past few years, the malware industry has grown very rapidly that, the group of individuals invest heavily in technologies to avoid traditional protection, forcing the anti-malware groups/communities to build more robust software's to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to identify whether a given piece of file/software is a malware.

Some background on dataset:

Microsoft has been very active in building anti-malware products over the years and it runs it’s anti-malware utilities over 150 million computers around the world. This generates tens of millions of daily data points to be analyzed as potential malware. In order to be effective in analyzing and classifying such large amounts of data, we need to be able to group them into groups and identify their respective families.

This dataset provided by Microsoft contains about 9 classes of malware.

Source: https://www.kaggle.com/c/malware-classification

Here my goal is to analyze this data set:

Problem / Objective:There are nine different classes of malware that we need to classify for a given a data point.

Here are the constraints to follow for this dataset :

I. For constructing a model -

Minimize multi-class error.
Multi-class probability estimates are requried.
Malware detection should not take hours and block the user’s computer. It should finish in a few seconds or a minute.

II. Performance metrics/measurements -

Multi class log-loss( Log-Loss is chosen because it actually uses probability which is our business constraint)
Confusion matrix

Solution:

Following are the high level steps to analyse the dataset in ML:

Data overview
Mapping the real world problem to ml problem
Data Preprocessing and Exploratory Data Analysis(EDA)
Train,Test and CV split
Modeling
Prediction

1. Data Overview

Source : https://www.kaggle.com/c/malware-classification/data
For every malware, we have two files
.asm file (read more: https://www.reviversoft.com/file-extensions/asm)
.bytes file (the raw data contains the hexadecimal representation of the file’s binary content, without the PE header)
Total this train dataset consist of 200 GB data out of which 50 GB of data is bytes files and 150 GB of data is asm files.
There are total 10,868 .bytes files and 10,868 asm files total 21,736 files
There are 9 types of malwares (9 classes) in this dataset.
Here are the list of Malware types present in the dataset:

Ramnit
Lollipop
Kelihos_ver3
Vundo
Simda
Tracur
Kelihos_ver1
Obfuscator.ACY
Gatak

What is .ASM file?

Data with assembly language code may be saved in the ASM format.For better understanding search in google.

.bytes files

In the below image you can see that .bytes files are in hexadecimal format( Hexadecimal uses the decimal numbers and includes six extra symbols a,b,c,d,e,f (where in total 15, o-9 are decimal and 10 to 15 are characters.)

Below is how the asm file looks in the dataset

.text:00401000                                       assume es:nothing, ss:nothing, ds:_data,    fs:nothing, gs:nothing
.text:00401000 56                                   push    esi
.text:00401001 8D 44 24    08                               lea     eax, [esp+8]
.text:00401005 50                                   push    eax
.text:00401006 8B F1                                   mov     esi, ecx
.text:00401008 E8 1C 1B    00 00                               call    ??0exception@std@@QAE@ABQBD@Z ; std::exception::exception(char const * const &)
.text:0040100D C7 06 08    BB 42 00                           mov     dword ptr [esi],    offset off_42BB08
.text:00401013 8B C6                                   mov     eax, esi
.text:00401015 5E                                   pop     esi
.text:00401016 C2 04 00                                   retn    4
.text:00401016                               ; ---------------------------------------------------------------------------
.text:00401019 CC CC CC    CC CC CC CC                           align 10h
.text:00401020 C7 01 08    BB 42 00                           mov     dword ptr [ecx],    offset off_42BB08
.text:00401026 E9 26 1C    00 00                               jmp     sub_402C51
.text:00401026                               ; ---------------------------------------------------------------------------
.text:0040102B CC CC CC    CC CC                               align 10h
.text:00401030 56                                   push    esi
.text:00401031 8B F1                                   mov     esi, ecx
.text:00401033 C7 06 08    BB 42 00                           mov     dword ptr [esi],    offset off_42BB08
.text:00401039 E8 13 1C    00 00                               call    sub_402C51
.text:0040103E F6 44 24    08 01                               test    byte ptr    [esp+8], 1
.text:00401043 74 09                                   jz      short loc_40104E
.text:00401045 56                                   push    esi
.text:00401046 E8 6C 1E    00 00                               call    ??3@YAXPAX@Z    ; operator delete(void *)
.text:0040104B 83 C4 04                                   add     esp, 4
.text:0040104E
.text:0040104E                               loc_40104E:                   ; CODE XREF: .text:00401043j
.text:0040104E 8B C6                                   mov     eax, esi
.text:00401050 5E                                   pop     esi
.text:00401051 C2 04 00                                   retn    4
.text:00401051

The .bytes file looks as below:

00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00

2. Mapping the real problem to ML Problem

From this Dataset — there are nine different classes of malware that we need to classify for a given a data point. Since there are 9 classes we could map this this to a multi class classification problem in ML.

3. Data Preprocessing and Exploratory data analysis

In general, part of data pre-processing, we would do the following:

I. Data loading with Python(Pandas), II. Data Analysis, III. EDA, IV. Vectorization.

Needed the below list of Python libraries for this dataset:

In EDA first thing that we do is we will separate byte files and asm files into different folders. To understand whether the problem is belongs to balanced or imbalanced data. Lets plot a histogram on class labels to see that.

In the above histogram we can see that our class distributions are high for 1,2,3 and for remaining classes the distributions are significantly low. From this we can understand that our data set is imbalanced data set.

Lets’ do feature engineering to know if there is any useful information for prediction we would get to know based on the size of the file from this dataset.

Here, we will do feature engineering based on the size of files. We will get the list of files from the data set and figure the size of each file. Then based on the each file size we would conclude whether that would be useful or not.

For this we will use simple boxplot.

From the above box plot — based on the file size we see that class 2 and class 5 are well separated when compared to other. Also the file sizes are different for each class.

Now we will convert all the hexadecimal files into text data by using Uni Grams.

Note: There is another approach we could convert into text data with “Count Vectorizer” but the problem is in this use-case the data is not available in main memory. This is because storing very huge(GB’s) data in main memory is not good idea here.

Lets implement the Uni Gram with our own bag of words.

And after that we will do simple column normalization. Now for each file we have file size and bag of words as features. Lets see how these featurization helps.

After that we will perform feature extraction on byte files by converting them into text files by using technique like bag of words.

Lets visualize this data using t-sne based on class labels.

we can see that our features are some what useful because they have nice grouping. To be really sure whether the features are useful or not we have to actually build models. Remember we are not using .asm files, we are just using byte files for analysis.

4. Train, Test and Cross-validation

we divided our data into train(64%), test(20%) and crossvalidation(16%) randomly. Lets’ check how the distribution classes are done in train, test and cv.

Since we had split the data randomly all the classes distributions are equally done in train , test and cross validation.

5. Modeling

Here, we are modeling only on byte files. While modeling we will go with Random modeling first to get the worst log-loss and then go with another modeling like KNN, Logistic Regression, etc.. and then compare the log-loss of each and then decide whether its a GOOD model. Since log-loss would be like Threshold.

In general, the multi log-loss has minimum value ‘0’ and maximum value is ‘Infinity’. But we need to know what is the worst log-loss that the model could give.

To find the worst log-loss we use random model. In random model the sum of all the probabilities of classes is equal to one.

In the above figure we can see that the log loss is 2.485 and the number of classified points is 88.5%.The base of this is if our model gives the log loss less than this random model then we will say that it is a good model.

K-NN Model:

Now lets’ apply K-NN model to train our model and we use calibrated classifier to get the exact probabilities values.We have to find the best value of K.We use cross validation for getting the best value as k.

we can see that for best alpha train loss is 0.0782 and the cross validation log loss is 0.0225 and the test loss is 0.24 and no of miss classified points are 4.5%. Comparing with random model K-NN is doing good but the problem is u can see that training log-loss is much less than cross-validation log-loss that says our model seems to be overfitting.

our model is slightly confusion on class 5 we will see which other models can rectify it.otherwise it is working well on remaining classes.

Logistic regression.

Logistic regression is one of classification techniques that is used used extensively in handling categorical data.We use sgdclassifier as long as our metric is log loss it acts as logistic regression and our hyper parameter is alpha( lambda in logistic regression).To get the best ‘C’ we have to perform cross validation.

In the above we can see that the log loss is very high for train data is 0.49 and for test data log loss is 0.54 and for cross validation log loss is 0.54. This is because it is showing that there are no points in class 5.Our logistic regression is not performing well compared to K-NN.

K-NN is giving minimum log-loss compared to Logistic Regression.

So it is better to use K-NN for modeling in this use case.

Conclusion:

From this data set, we built a model which gives minimum miss-classified points with which we can predict the given input file is belongs to the given 9 classes in dataset or not. Based on that classification we would predict if the given file is a malware or not.