Tuesday, April 21, 2015

Machine learning with Malware Part 1: Data

Download my database here
Download source code for predator here 

Without data there is no machine learning

Without data there is no machine learning. That is because ML is about finding "structure" in data, it really is that simple. This also means that you can never get better results from ML than your data allows. Often a bad model choice will provide decent results if the data is good, but with bad data no model will ever provide good results. Thus while this blog might be a little boring it’s subject is of significant importance.
I've used dummy data for my two previous ML post which focused more about what ML methods is and what they can do and less about the data. Now I'm faced with real data and this blog post is about how I'll treat my data before I even get to think about models and methods for finding structure. Ironically these steps are rarely even considered in a generalized fashion in the literature I've read. Usually it goes straight to the methods for finding structure. I suppose this is because what you actually need to do depend heavily on what kind of data we are talking about. This means of cause that this blog post will not be well founded in the scientific literature, but rather on my experience – but trust me  I’m good with data.  If you have no prior knowledge and haven't read the two first blog posts on machine learning with malware you might wish to read those first before proceeding.
In my mind I've always seen the process of doing econometrics or in this case ML of running over the stages below. Unlike a shopping list often you'll find yourself going back and improving previous steps as you become acquainted with the data. The steps in my mind are:
·         Data acquisition: Getting Raw data
·         Feature engineering: Getting from raw data to a data set ( observations)
·         Preprocessing: Getting from data set to input data
·         Model selection: Getting from input data to information
·         Model analysis: Analyze results and become aware of problems with your model

This blog post will consider only the first three steps. Also the distinction between the steps is in real life application often much less clear than my list above. Never the less I think of the distinction as useful when working with the data. When I’m all done I’ll have a data set that you in theory could use for machine learning on malware – arguably a very crude data set, but that’s the way it’ll always be given my time constraints.

Data acquisition

When getting raw data you need to think ahead. For questionnaire type of data acquisition it's obvious that you get different results if you ask about "Obamacare" as opposed to "Death panels". It's less obvious in a malware setting, but no less important. Say you collect your raw malware data through honey pots. It'll probably influence the type of malware you collect if your honey pots are placed on servers in "Advanced weapons technology" companies as opposed to on aunt Emma's home PC. The first are far more likely to see custom build and highly sophisticated attacks than the second, and much less likely to fall prey to rapid spreading email worms. If you're trying to deduct general things about malware from this data, you'll be biased in either scenario because the data already reflect different realities. This bias is called selection bias. We’ll visit selection bias again below. Thus beware of your data, ML is data driven and it’s very much a garbage in garbage out discipline.. In my case I didn't have honey pots, but a password to a collection where I could download some malware. Thus I ponder that the chance of a malware making it into the collection is much higher for aunt Emma style malware than for sophisticated attacks.

Feature engineering

In all ML, except “deep learning”, finding the structure of the data requires us to specify the features of the data. You don't just stuff an executable into a machine and out comes if it's malicious or not. In my first blog post I stuffed into Bayes the section of the entrypoint and out came an estimate if the file was packed or not. The "Section of the entry point" is usually called a feature of the data and getting from raw data to features is what feature engineering is about. Sometimes features are obvious, unfortunately that is not so in malware research. Here we'll have to think hard about what we think is different between malware and non malware (at least if that's what we are examining with ML). The good thing is that we don't need to care about the actual relationship between the features, we just need to extract the features we think are relevant. This is because figuring out relationship in data is the job of the model or algorithm. Extracting too much is better than too little. ML has ways of dropping what is too much in data driven ways instead of the ad hoc assumption style. When we start working with the data we might come to think of a feature that wasn't part of the raw data we got and that might send us back to acquire more data. As a student I got access to a large database of socio-economic indicators over time for 1% of the Danish work force. I started analyzing unemployment and found out that I had to go back and find data on the business cycle because business cycles are fairly important in the labour market. Once you start working with the data things often become more clear and you need to be open to rework your decisions again and again. As you may recognize from the experience I just mentioned, feature extraction is a job for people who understand the subject at hand. So this is a good reason why being malware knowledgeable cannot be replaced by machine learning, however leveraging machine learning can be extremely powerful.



Usually there are significant advantages in preprocessing data before entering it into a model. When trying to use machine learning to read hand written postal addresses classic image processing - that as such has nothing to do with machine learning - is often used. For example often CCD noise are removed, the picture of each letter is made the same size (in pixels) and if there is a slight tilt in the hand writing the pictures might be rotated etc. Also some ML methods have requirements of the input in question. If you wish to do say principle component analysis on the data you at least should consider normalizing the mean to 0 and the variance to 1 of the features, where say an “ordinary least squares model” is much more forgiving. The distinction between preprocessing and feature extraction is not always clear. In the same way reading out features from malware it often makes good sense to preprocess. For example a thing we'd wish to look for when trying to determine if an executable is malware or not could be the protection characteristics of the sections in the executable. In the Portable Executable format (PE) that's just a simple 8 bit long bit field for each section. First off no simple model will make much use of such a bit field as input, it is grammar, not semantic. But the "Writeable" bit might be very interesting. Is the feature the characteristics bit-field or is it the separate bits? But further we might consider combinations of bits to be features say executable and writable might seem like an unsecure combination where as observing either bit in an executable would be considered perfectly normal. The thing is that we might take these things into account when extracting the features, but often times we'll have to go back and work the data again before we can proceed with a model. My recommendation is do obvious preprocessing early on as it'll save work. Anything preprocessing that is less than obvious make a mental note of and return once you get to know your data and a picture starts to emerge where you wish to go with the data. For instance don’t normalize your data until you know if it is an advantage in the model you wish to use – and even then analyzing the raw data before hand might yield insights even if the end goal might be a PCA. I once estimated a  fairly advanced model where I where trying to predict business cycles. In my first attempt I used something close to raw quarterly data for my variables. After hours of waiting for my model to converge I had a model that was able to perfectly predict at what time of year Christmas sales picked up. I had done my preprocessing poorly and my model picked up Christmas sales instead of business cycles. Had I preprocessed my data to correct for seasonal effects I would’ve saved myself a lot of trouble.

So much for theory

Data acquisition for me was as stated just plain downloading virus_share_0000.zip (And obtaining access - thanks to those who made this possible). This archive contains around 130.000 malware files in some 20 gigabytes. 130.000 is a big number for ad hoc management. To further complicate things the files names are only a hash with no mention of original extension, which means there is no simple identification to identify java script,c source codes, DLL’s, EXE etc. Manual analysis is out of the question because of the amount of data. Enter “Predator”. Predator is the stupid code name I had in my head and as I didn’t figure out a better name for a tool it’s now called Predator. Predator is a simple tool that’ll extract features from all files in a directory and organize them. A simple text file or a spread sheet seemed not quite up to the task for organization. A SQL database seemed the right way to go. Enter SQLite. I was up and running with SQLite library in less than 20 minutes including creating tables, adding observations and sending queries. Unfortunately the SQLite interface is just bare bones to put it politely. Initially I thought I'd just install POCO library using biicode and use poco's abstraction for SQLite. Then I thought hey I can write something up in a couple of hours and in the process learn something about SQLite and that might prove beneficial down the road. So that's how that code came to be.

The second thing I needed to figure out features to extract to get to a data set. And in step with keeping things simple I went for features of the PE format. My logic where:
A) there are going to be plenty of PE files as opposed to analyzing PDF exploits
B) I'm knowledgeable in the PE format and pe features are relatively easy to extract
C) They are very relevant in the real world malware.
Obviously I cannot extract PE features from a java script or PDF-based exploit. So I needed to throw away anything without a PE-header. As a general rule when working with feature engineering is that throwing data away is bad unless you have strong reasons to do so and if you do have these strong reasons you need to articulate them and consider the effects. This is because throwing away information will lead to selection bias - just imagine throwing out smokers in a lung cancer study. And don’t forget you’ll not be the first who have data problems, so the literature is full of methods to deal with most issues in better ways than throwing information away. However I do have strong reasons - PE files are likely to be an entirely different kind of malware than java scripts. They have different kinds of features; they work in different ways to achieve mostly different things. In fact I’d go so far to say that the only thing all malwares have in common is that they are malicious. The price I pay is that any result I get from now on no longer is general about malware but general only to PE files. We might find it reasonable to cut down further at some point (How about analyzing viruses separate from adware?). But since at this point I have no a priori (data based) reason to think they cannot be analyzed with the same features and same methodology so in line with the reasoning above I keep everything for now. As I wrote in the beginning of this blog post, ML is more of an iterative process than working a shopping list.

The original idea was to write up a small PE parsing library based on my previous works. I did not have much luck with that. The PE files in the achieve where off specs, broken and shattered. So after a few hours of trying to harden up my own code I realized there was no way I could get both sufficient flexibility and stability on a time schedule like mine (Remember I try to stay under 24 hours total time for a blog post and I’d already spend time on SQL and my own PE stuff). Thus I decided to go for a finished library, despite not being keen on using too many libraries especially in core functions, because my blog should not descend into a “using tools blogs”, but rather be about how stuff works. I considered going with Katja Hahn’s PortEx java library but dropped the idea because I’m not too strong in java and didn’t want to interface with my c++ code. Especially since I’m considering for another blog post to add features from the code utilizing my emulator (see Unpacking using emulation blog). Instead I went with PeLib by Sebastian Porst. I think I actually met mr. Porst once which ironically became the real reason for going with it. I know sloppy, but hey…  It’s a very nice library, (Unfortunate it seems to be orphaned as there has been no changes since 2005 or so) but never the less it too was not quite up to the task of working with these mangled files. In fact it asserted on the very first file I parsed with it. So instead of just linking it, I took it into my own project and started fixing bugs as I went. Thus you’ll find Mr. Porsts code in my project. Now keep in mind that I was not trying to do good fixes for the bugs, I was trying to get the parsing done. Most bugs in the code where integer overflows. I did my best to keep every PE file in the sample for reasons I’ve discussed above. I might have excluded a few files, but I am certain that those files could not possibly be executed on any NT platform. So I think the selection bias from this point of view should be minimal.

The features I’ve selected is summarized in this table and I’ve sometimes tried to explain my motivation behind choosing the feature:

Maybe I’ll wish to look at 32bit and 64 bit exes sperately at one point
Many sections can be an indicator of evil
Linker version is often used as “infected” marker by viruses




This might be used as indicator for being a DLL
The entropy of the first 1000 bytes after the entrypoint might be an indicator for polymorphic encryption of a virus.
See below
If doesn’t compare to filename probably damanged
The entropy of the entire file will indicated if file is encrypted or packaged as opposed to infected. It of cause might be both, but that is another question
This is normal in most executables. It’s not quite as common for packaged and infected files.
An entrypoint in the PE header is definitely an indicator that something evil is going on. These however will execute!
A virus might forget to set the entrypoint as executable. A Linker won’t. Exe will run on 32bit
Writeable code section is definitely a sign of evil
Exports are an indicator of DLL
An error occurred during export parsing. Unlikely for a real linker.


.Net might be an entirely different kind of analysis,
Same as export errror

I ended up with 86447 PE files in my database. They came as 84925 32bit native code, 1503 DotNet executables and 19 of other architectures. Seems like my malware collection is a bit dated as 19 is a tiny number considering that that number contains 64 bit PE’s. In selecting my features my recollection of Peter Szor’s excellent book was useful.
SQLite. http://www.sqlite.org
Porst, Sebastian: PeLib. http://www.pelib.om
Hahn, Katja – PortEx. https://github.com/katjahahn/PortEx

Szor, Peter: The Art of Computer VirusResearch and Defense

No comments:

Post a Comment