Download source code for predator here
Without data there is no machine learning
Without
data there is no machine learning. That is because ML is about finding
"structure" in data, it really is that simple. This also means that
you can never get better results from ML than your data allows. Often a bad
model choice will provide decent results if the data is good, but with bad data
no model will ever provide good results. Thus while this blog might be a little
boring it’s subject is of significant importance.
I've used
dummy data for my two previous ML post which focused more about what ML methods
is and what they can do and less about the data. Now I'm faced with real data
and this blog post is about how I'll treat my data before I even get to think
about models and methods for finding structure. Ironically these steps are
rarely even considered in a generalized fashion in the literature I've read. Usually
it goes straight to the methods for finding structure. I suppose this is
because what you actually need to do depend heavily on what kind of data we are
talking about. This means of cause that this blog post will not be well founded
in the scientific literature, but rather on my experience – but trust me I’m good with data. If you have no prior knowledge and haven't
read the two first blog posts on machine learning with malware you might wish
to read those first before proceeding.
In my mind
I've always seen the process of doing econometrics or in this case ML of
running over the stages below. Unlike a shopping list often you'll find
yourself going back and improving previous steps as you become acquainted with
the data. The steps in my mind are:
·
Data
acquisition: Getting Raw data
·
Feature
engineering: Getting from raw data to a data set ( observations)
·
Preprocessing:
Getting from data set to input data
·
Model
selection: Getting from input data to information
·
Model
analysis: Analyze results and become aware of problems with your model
This blog
post will consider only the first three steps. Also the distinction between the
steps is in real life application often much less clear than my list above.
Never the less I think of the distinction as useful when working with the data.
When I’m all done I’ll have a data set that you in theory could use for machine
learning on malware – arguably a very crude data set, but that’s the way it’ll
always be given my time constraints.
Data acquisition
When
getting raw data you need to think ahead. For questionnaire type of data
acquisition it's obvious that you get different results if you ask about
"Obamacare" as opposed to "Death panels". It's less obvious
in a malware setting, but no less important. Say you collect your raw malware
data through honey pots. It'll probably influence the type of malware you
collect if your honey pots are placed on servers in "Advanced weapons
technology" companies as opposed to on aunt Emma's home PC. The first are
far more likely to see custom build and highly sophisticated attacks than the
second, and much less likely to fall prey to rapid spreading email worms. If
you're trying to deduct general things about malware from this data, you'll be biased
in either scenario because the data already reflect different realities. This
bias is called selection bias. We’ll visit selection bias again below. Thus
beware of your data, ML is data driven and it’s very much a garbage in garbage
out discipline.. In my case I didn't have honey pots, but a password to a
collection where I could download some malware. Thus I ponder that the chance
of a malware making it into the collection is much higher for aunt Emma style
malware than for sophisticated attacks.
Feature engineering
In all ML,
except “deep learning”, finding the structure of the data requires us to
specify the features of the data. You don't just stuff an executable into a
machine and out comes if it's malicious or not. In my first blog post I stuffed
into Bayes the section of the entrypoint and out came an estimate if the file
was packed or not. The "Section of the entry point" is usually called
a feature of the data and getting from raw data to features is what feature
engineering is about. Sometimes features are obvious, unfortunately that is not
so in malware research. Here we'll have to think hard about what we think is
different between malware and non malware (at least if that's what we are
examining with ML). The good thing is that we don't need to care about the
actual relationship between the features, we just need to extract the features
we think are relevant. This is because figuring out relationship in data is the
job of the model or algorithm. Extracting too much is better than too little.
ML has ways of dropping what is too much in data driven ways instead of the ad
hoc assumption style. When we start working with the data we might come to
think of a feature that wasn't part of the raw data we got and that might send
us back to acquire more data. As a student I got access to a large database of
socio-economic indicators over time for 1% of the Danish work force. I started
analyzing unemployment and found out that I had to go back and find data on the
business cycle because business cycles are fairly important in the labour
market. Once you start working with the data things often become more clear and
you need to be open to rework your decisions again and again. As you may
recognize from the experience I just mentioned, feature extraction is a job for
people who understand the subject at hand. So this is a good reason why being
malware knowledgeable cannot be replaced by machine learning, however
leveraging machine learning can be extremely powerful.
Preprocessing
Usually
there are significant advantages in preprocessing data before entering it into
a model. When trying to use machine learning to read hand written postal
addresses classic image processing - that as such has nothing to do with
machine learning - is often used. For example often CCD noise are removed, the
picture of each letter is made the same size (in pixels) and if there is a
slight tilt in the hand writing the pictures might be rotated etc. Also some ML
methods have requirements of the input in question. If you wish to do say
principle component analysis on the data you at least should consider normalizing
the mean to 0 and the variance to 1 of the features, where say an “ordinary
least squares model” is much more forgiving. The distinction between
preprocessing and feature extraction is not always clear. In the same way
reading out features from malware it often makes good sense to preprocess. For
example a thing we'd wish to look for when trying to determine if an executable
is malware or not could be the protection characteristics of the sections in
the executable. In the Portable Executable format (PE) that's just a simple 8
bit long bit field for each section. First off no simple model will make much
use of such a bit field as input, it is grammar, not semantic. But the
"Writeable" bit might be very interesting. Is the feature the
characteristics bit-field or is it the separate bits? But further we might
consider combinations of bits to be features say executable and writable might
seem like an unsecure combination where as observing either bit in an
executable would be considered perfectly normal. The thing is that we might
take these things into account when extracting the features, but often times
we'll have to go back and work the data again before we can proceed with a
model. My recommendation is do obvious preprocessing early on as it'll save work.
Anything preprocessing that is less than obvious make a mental note of and
return once you get to know your data and a picture starts to emerge where you
wish to go with the data. For instance don’t normalize your data until you know
if it is an advantage in the model you wish to use – and even then analyzing
the raw data before hand might yield insights even if the end goal might be a
PCA. I once estimated a fairly advanced
model where I where trying to predict business cycles. In my first attempt I used
something close to raw quarterly data for my variables. After hours of waiting
for my model to converge I had a model that was able to perfectly predict at
what time of year Christmas sales picked up. I had done my preprocessing poorly
and my model picked up Christmas sales instead of business cycles. Had I
preprocessed my data to correct for seasonal effects I would’ve saved myself a
lot of trouble.
So much for theory
Data acquisition
for me was as stated just plain downloading virus_share_0000.zip (And obtaining
access - thanks to those who made this possible). This archive contains around
130.000 malware files in some 20 gigabytes. 130.000 is a big number for ad hoc
management. To further complicate things the files names are only a hash with
no mention of original extension, which means there is no simple identification
to identify java script,c source codes, DLL’s, EXE etc. Manual analysis is out
of the question because of the amount of data. Enter “Predator”. Predator is
the stupid code name I had in my head and as I didn’t figure out a better name
for a tool it’s now called Predator. Predator is a simple tool that’ll extract
features from all files in a directory and organize them. A simple text file or
a spread sheet seemed not quite up to the task for organization. A SQL database
seemed the right way to go. Enter SQLite. I was up and running with SQLite
library in less than 20 minutes including creating tables, adding observations
and sending queries. Unfortunately the SQLite interface is just bare bones to
put it politely. Initially I thought I'd just install POCO library using
biicode and use poco's abstraction for SQLite. Then I thought hey I can write
something up in a couple of hours and in the process learn something about
SQLite and that might prove beneficial down the road. So that's how that code
came to be.
The second thing
I needed to figure out features to extract to get to a data set. And in step
with keeping things simple I went for features of the PE format. My logic
where:
A) there are
going to be plenty of PE files as opposed to analyzing PDF exploits
B) I'm knowledgeable
in the PE format and pe features are relatively easy to extract
C) They are
very relevant in the real world malware.
Obviously I
cannot extract PE features from a java script or PDF-based exploit. So I needed
to throw away anything without a PE-header. As a general rule when working with
feature engineering is that throwing data away is bad unless you have strong
reasons to do so and if you do have these strong reasons you need to articulate
them and consider the effects. This is because throwing away information will
lead to selection bias - just imagine throwing out smokers in a lung cancer
study. And don’t forget you’ll not be the first who have data problems, so the
literature is full of methods to deal with most issues in better ways than
throwing information away. However I do have strong reasons - PE files are
likely to be an entirely different kind of malware than java scripts. They have
different kinds of features; they work in different ways to achieve mostly
different things. In fact I’d go so far to say that the only thing all malwares
have in common is that they are malicious. The price I pay is that any result I
get from now on no longer is general about malware but general only to PE
files. We might find it reasonable to cut down further at some point (How about
analyzing viruses separate from adware?). But since at this point I have no a
priori (data based) reason to think they cannot be analyzed with the same
features and same methodology so in line with the reasoning above I keep everything
for now. As I wrote in the beginning of this blog post, ML is more of an
iterative process than working a shopping list.
The
original idea was to write up a small PE parsing library based on my previous
works. I did not have much luck with that. The PE files in the achieve where
off specs, broken and shattered. So after a few hours of trying to harden up my
own code I realized there was no way I could get both sufficient flexibility
and stability on a time schedule like mine (Remember I try to stay under 24 hours
total time for a blog post and I’d already spend time on SQL and my own PE
stuff). Thus I decided to go for a finished library, despite not being keen on
using too many libraries especially in core functions, because my blog should
not descend into a “using tools blogs”, but rather be about how stuff works. I
considered going with Katja Hahn’s PortEx java library but dropped the idea
because I’m not too strong in java and didn’t want to interface with my c++
code. Especially since I’m considering for another blog post to add features
from the code utilizing my emulator (see Unpacking using emulation blog).
Instead I went with PeLib by Sebastian Porst. I think I actually met mr. Porst
once which ironically became the real reason for going with it. I know sloppy,
but hey… It’s a very nice library,
(Unfortunate it seems to be orphaned as there has been no changes since 2005 or
so) but never the less it too was not quite up to the task of working with
these mangled files. In fact it asserted on the very first file I parsed with
it. So instead of just linking it, I took it into my own project and started
fixing bugs as I went. Thus you’ll find Mr. Porsts code in my project. Now keep
in mind that I was not trying to do good fixes for the bugs, I was trying to
get the parsing done. Most bugs in the code where integer overflows. I did my
best to keep every PE file in the sample for reasons I’ve discussed above. I
might have excluded a few files, but I am certain that those files could not
possibly be executed on any NT platform. So I think the selection bias from
this point of view should be minimal.
The
features I’ve selected is summarized in this table and I’ve sometimes tried to
explain my motivation behind choosing the feature:
Feature
|
Reason
|
Filename
|
|
Machine
|
Maybe I’ll wish to
look at 32bit and 64 bit exes sperately at one point
|
NumberOfSections
|
Many sections can be
an indicator of evil
|
MajorLinkerVersion
|
Linker version is
often used as “infected” marker by viruses
|
MinorLinkerVersion
|
|
SizeOfCode
|
|
SizeOfInitializedData
|
|
SizeOfUninitializedData
|
|
ImageBaseNonDefault
|
This might be used
as indicator for being a DLL
|
EntropyOfEP
|
The entropy of the
first 1000 bytes after the entrypoint might be an indicator for polymorphic
encryption of a virus.
|
EPAsText
|
See below
|
Hash
|
If doesn’t compare
to filename probably damanged
|
Entropy
|
The entropy of the
entire file will indicated if file is encrypted or packaged as opposed to
infected. It of cause might be both, but that is another question
|
EPInFirstSection
|
This is normal in
most executables. It’s not quite as common for packaged and infected files.
|
EPeq0
|
An entrypoint in the
PE header is definitely an indicator that something evil is going on. These
however will execute!
|
EPExecutable
|
A virus might forget
to set the entrypoint as executable. A Linker won’t. Exe will run on 32bit
|
WriteExecuteSection
|
Writeable code
section is definitely a sign of evil
|
NumberOfExports
|
Exports are an
indicator of DLL
|
ExportError
|
An error occurred
during export parsing. Unlikely for a real linker.
|
NumberOfImports
|
|
NumberOfImportDll
|
|
IsDotNet
|
.Net might be an
entirely different kind of analysis,
|
ImportError
|
Same as export
errror
|
I ended up
with 86447 PE files in my database. They came as 84925 32bit native code, 1503
DotNet executables and 19 of other architectures. Seems like my malware
collection is a bit dated as 19 is a tiny number considering that that number
contains 64 bit PE’s. In selecting my features my recollection of Peter Szor’s
excellent book was useful.
Resources:
Hahn, Katja – PortEx. https://github.com/katjahahn/PortEx
Szor,
Peter: The Art of Computer VirusResearch and Defense
No comments:
Post a Comment