Malware Detection With Convolutional Neural Networks in Python
Learn the basics of artificial network architectures and how to use Convolutional Neural Networks to help malware analysts and information security professionals detect and classify malicious code.
Join the DZone community and get the full member experience.
Join For FreeIn this post, we will learn about artificial network architectures and how to use one of them (Convolutional Neural Networks) to help malware analysts and information security professionals to detect and classify malicious code.
Malware is a nightmare for every modern organization. Attackers and cybercriminals are always coming up with new malicious software to attack their targets. Security vendors are doing their best to defend against malware attacks but, unfortunately, with millions of malware discovered monthly, they cannot achieve that. Thus, novel approaches such as deep learning are needed.
Before diving into the technical details and the steps for the practical implementation of the DL method, it is essential to learn and discover the other different architectures of artificial neural networks. The major artificial neural networks are discussed now.
This excerpt is taken from the book Mastering Machine Learning for Penetration Testing by Packt Publishing. This book teaches you extensive skills to become a master at penetration testing using machine learning with Python.
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks (CNNs) are a deep learning approach to tackle the image classification problem, or what we call computer vision problems, because classic computer programs face many challenges and difficulties to identify objects for many reasons, including lighting, viewpoint, deformation, and segmentation.
This technique is inspired by how the eye works, especially the visual cortex function algorithm in animals. CNN are arranged in three-dimensional structures with width, height, and depth as characteristics. In the case of images, the height is the image height, the width is the image width, and the depth is RGB channels.
To build a CNN, we need three main types of layers:
Convolutional layer: A convolutional operation refers to extracting features from the input image and multiplying the values in the filter with the original pixel values
Pooling layer: The pooling operation reduces the dimensionality of each feature map
Fully-connected layer: The fully-connected layer is a classic multi-layer perceptrons with a softmax activation function in the output layer
To implement a CNN with Python, you can use the following Python script:
import numpy
from keras.datasets
import mnist
from keras.models
import Sequential
from keras.layers
import Dense
from keras.layers
import Dropout
from keras.layers
import Flatten
from keras.layers.convolutional
import Conv2D
from keras.layers.convolutional
import MaxPooling2D
from keras.utils
import np_utils
from keras
import backend
backend.set_image_dim_ordering('th')
model = Sequential()
model.add(Conv2D(32, (5, 5), input_shape = (1, 28, 28), ))
model.add(MaxPooling2D(pool_size = (2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, ))
model.add(Dense(num_classes, ))
model.compile(, , metrics = ['accuracy'])
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs) are artificial neural networks where we can make use of sequential information, such as sentences. In other words, RNNs perform the same task for every element of a sequence, with the output depending on the previous computations. RNNs are widely used in language modeling and text generation (machine translation, speech recognition, and many other applications). RNNs do not remember things for a long time.
Long Short Term Memory networks
Long Short Term Memory (LSTM) solves the short memory issue in recurrent neural networks by building a memory block. This block sometimes is called a memory cell.
Hopfield networks
Hopfield networks were developed by John Hopfield in 1982. The main goal of Hopfield networks is auto-association and optimization. We have two categories of Hopfield network: discrete and continuous.
Boltzmann Machine Networks
Boltzmann machine networks use recurrent structures and they use only locally available information. They were developed by Geoffrey Hinton and Terry Sejnowski in 1985. Also, the goal of a Boltzmann machine is optimizing the solutions.
Malware Detection With CNNs
For this new model, we are going to discover how to build a malware classifier with CNNs. But I bet you are wondering how we can do that while CNNs are taking images as inputs. The answer is really simple, the trick here is converting malware into an image. Is this possible? Yes, it is. Malware visualization is one of many research topics during the past few years. One of the proposed solutions has come from a research study called Malware Images: Visualization and Automatic Classification by Lakshmanan Nataraj from the Vision Research Lab, University of California, Santa Barbara.
The following diagram details how to convert malware into an image:
The following is an image of the Alueron.gen!J malware:
This technique also gives us the ability to visualize malware sections in a detailed way:
By solving the issue of how to feed malware machine learning classifiers that use CNNs by images, information security professionals can use the power of CNNs to train models. One of the malware datasets most often used to feed CNNs is the Malimg dataset. This malware dataset contains 9,339 malware samples from 25 different malware families. You can download it from Kaggle.
These are the malware families:
Allaple.L
Allaple.A
Yuner.A
Lolyda.AA 1
Lolyda.AA 2
Lolyda.AA 3
C2Lop.P
C2Lop.gen!G
Instant access
Swizzor.gen!I
Swizzor.gen!E
VB.AT
Fakerean
Alueron.gen!J
Malex.gen!J
Lolyda.AT
Adialer.C
Wintrim.BX
Dialplatform.B
Dontovo.A
Obfuscator.AD
Agent.FYI
Autorun.K
Rbot!gen
Skintrim.N
After converting malware into grayscale images, you can get the following malware representation so you can use them later to feed the machine learning model:
The conversion of each malware to a grayscale image can be done using the following Python script:
import os
import scipy
import array
filename = '<Malware_File_Name_Here>';
f = open(filename,'rb');
ln = os.path.getsize(filename);
width = 256;
rem = ln%width;
a = array.array("B");
a.fromfile(f,ln-rem);
f.close();
g = numpy.reshape(a,(len(a)/width,width));
g = numpy.uint8(g);
scipy.misc.imsave('<Malware_File_Name_Here>.png',g);
For feature selection, you can extract or use any image characteristics, such as the texture pattern, frequencies in image, intensity, or color features, using different techniques such as Euclidean distance, or mean and standard deviation, to generate later feature vectors. In our case, we can use algorithms such as a color layout descriptor, homogeneous texture descriptor, or global image descriptors (GIST). Let's suppose that we selected the GIST; pyleargist is a great Python library to compute it. To install it, use PIP as usual:
# pip install pyleargist=.0.1
As a use case, to compute a GIST, you can use the following Python script:
import Image
Import leargist
image = Image.open('<Image_Name_Here>.png');
New_im = image.resize((64,64));
des = leargist.color_gist(New_im);
Feature_Vector = des[0:320];
Here, 320 refers to the first 320 values while we are using grayscale images. Don't forget to save them as NumPy arrays to use them later to train the model.
After getting the feature vectors, we can train many different models, including SVM, k-means, and artificial neural networks. One of the useful algorithms is that of the CNN.
Once the feature selection and engineering is done, we can build a CNN. For our model, for example, we will build a convolutional network with two convolutional layers, with 32 * 32 inputs. To build the model using Python libraries, we can implement it with the previously installed TensorFlow and utils libraries.
So, the overall CNN architecture will be as in the following diagram:
This CNN architecture is not the only proposal to build the model, but at the moment we are going to use it for the implementation.
To build the model and CNN in general, I highly recommend Keras. The required imports are the following:
import keras
from keras.models
import Sequential, Input, Model
from keras.layers
import Dense, Dropout, Flatten
from keras.layers
import Conv2D, MaxPooling2D
from keras.layers.normalization
import BatchNormalization
from keras.layers.advanced_activations
import LeakyReLU
As we discussed before, the grayscale image has pixel values that range from 0 to 255, and we need to feed the net with 32 * 32 * 1 dimension images as a result:
train_X = train_X.reshape(-1, 32,32, 1)
test_X = test_X.reshape(-1, 32,32, 1)
We will train our network with these parameters:
batch_size = 64
epochs = 20
num_classes = 25
To build the architecture, with regards to its format, use the following:
Malware_Model = Sequential()
Malware_Model.add(Conv2D(32, kernel_size=(3,3),,input_shape=(32,32,1),))
Malware_Model.add(LeakyReLU(.1))
Malware_model.add(MaxPooling2D(pool_size=(2, 2),))
Malware_Model.add(Conv2D(64, (3, 3), ,))
Malware_Model.add(LeakyReLU(.1))
Malware_Model.add(Dense(1024, ))
Malware_Model.add(LeakyReLU(.1))
Malware_Model.add(Dropout(0.4))
Malware_Model.add(Dense(num_classes, ))
To compile the model, use the following:
Malware_Model.compile(.losses.categorical_crossentropy, .optimizers.Adam(),metrics=['accuracy'])
Fit and train the model:
Malware_Model.fit(train_X, train_label, ,,,validation_data=(valid_X, valid_label))
As you noticed, we are respecting the flow of training a neural network that was discussed in previous chapters. To evaluate the model, use the following code:
Malware_Model.evaluate(test_X, test_Y_one_hot, )
print('The accuracy of the Test is:', test_eval[1])
Promises and Challenges in Applying Deep Learning to Malware Detection
Many different deep network architectures were proposed by machine learning practitioners and malware analysts to detect both known and unknown malware; some of the proposed architectures include restricted Boltzmann machines and hybrid methods.
New approaches to detect malware and malicious software show many promising results. However, there are many challenges that malware analysts face when it comes to detecting malware using deep learning networks, especially when analyzing PE files because to analyze a PE file, we take each byte as an input unit, so we deal with classifying sequences with millions of steps, in addition to the need of keeping complicated spatial correlation across functions due to function calls and jump commands.
You just read an excerpt from the book Mastering Machine Learning for Penetration Testing written by Chiheb Chebbi and published by Packt Publishing.
We just discovered how to build malware detectors using different machine learning algorithms, especially using the power of deep learning techniques.
Interested in reading more? Here’s how you can learn how to detect botnets by building and developing robust intelligent systems.
Opinions expressed by DZone contributors are their own.
Comments