Monday, June 29, 2015

Machine learning with malware part 2: Model selection

Unrelated background information:

I wished I'd gotten a lot further a lot sooner with this. Also this blog post is only barely up to the standard I set for myself. Compressing so much statistics and machine learning into a blog post of reasonable length while avoiding excessive math is a terribly difficult thing to do. I hope that it lifts the vail a bit around what machine learning is and how it can be leveraged in a malware setting, but if you really wish to get down with it there can be no two opinions about you need a real text book with all the rigor that it requires.

Finally it took a long time because most of my family conspired to celebrate round birthdays. Also climbing season started which keeps me away from the computer. Finally I have this feeling that the "Predator" from Machine Learning Part 1. blog produced insufficient amount of features. So spend a significant amount of time embedding my code emulator (from unpacking with primitive emulation) into Predator at first thinking it would be easy. Emulating on 80000 files malformed files, where a significant amount is missing DLL's etc. turned out to be way more time consuming than I imagined.

The blog post is here:
Machine learning with malware part 2: Model Selection

Thursday, June 18, 2015

LangSec opinions and a case INTO overflows

On a entirely different note: The next machine learning and malware post is almost finished...

A very short introduction to LangSec

Sassaman, Patterson & Bratus gives a better and more formal introduction and they are probably less boring that I am and definately more accurate.

When I first read about LangSec (language security) I did as I always do reading about a new topic. Boil it down as much as I can and my distillate was “If you don’t make errors, your code will be secure”. It is a truism and an annoying one at that, because it either asks the impossible or points fingers at you for being what you are - a human. It is in many way still how I think of LangSec but digging below the surface of this new computer science discipline is never the less very much worth while . In fact I think it poses one of the most promising ways of approaching one of the roots of the insecurity problem. Which is of cause why I bother to write about it. Though it is mostly defensive, anybody playing offense might benefit from the way of thinking.

I think the two most import insights I’ve found in LangSec is:

1) User input should not be treated as data. Once user input wasn’t expected it is no longer data, it’s a program running on computer made up of your software, hardware and the state of it. A so called “weird machine”. The program is written in a language that can run on the weird machine does weird things and weird is bad.

2) Postel’s law that states that you should be “liberal in what data you accept, and strict in what you send”. Kill this law and replace it with “Be strict with user data”.

The reason why 1) is really important is because it translates a security problem which are often diffuse into a very well defined language theoretic problem. Writing secure code becomes a question of developing a language that is well defined for any user data. What I mean by "well defined" is backed by standard theory. Awesome!

If 1) describes the way to think about information security, 2) is a large part of how you do it in practice. Being strict with input and output radically reduces the chance that we as coders will retreat to assumptions. Assumptions are the mother of all f-ups.

Despite not working in classical infosec I’ve spend a significant part of my career exploiting that people made their protocols more complex than they had to or that those who implemented the protocol wasn’t particularly strict in interpreting it. As an example I once developed a DVD video copy protection though it isn’t infosec it’s very much an exercise in utilizing that programmers had not taken 1 and 2 to heart. Part of that copy protection is just a Denial of Service (DoS) attack on ripping software. Three components made copy protection for video DVD possible in the first place. The first is that the DVD Video specification has inconsistencies, undefined behavior, unnecessary flexibility, it is huge and confusing. The complete documentation is about two feet of shelves space. This almost surely make programmers of rippers and players offer me a weird machine to program in the first place. Secondly neither DVD rippers nor players are strict with the input data. The third element is that rippers and players react differently to fringe input. The challenge is then boiled down to writing a program for the weird machine in rippers that'll cause denial of service of some kind, while making sure that particular program does not disrupt players.

A lot of LangSec has focused on parsers (and the "reverse parser" that is building the data to be parsed) and this seems reasonable. With the two shelve-feet of documentation most of it written only in the notoriously difficult to precisely parse language of human-readable english, errors are bound to be made when implementing it. LangSec has recommendations how you can improve the processes of writing the documentation in the first place. For instance replace .h files with something that also describes relations of the fields. LangSec also has recommendation on how you should deal with implementing a parser and this is something most coders should read up on and take to heart. It will significantly improve security of software. It's a different angle of attack than the classic approach of leaving security to compilers, operating systems and hardware. Now I'm a great fan of strcpy_s type functions, ASLR, DEP, CFG, sandboxes etc. and all the approaches made in this spirit, but they obviously aren't sufficient for security.

A real life integer overflow programming error

Below I have listed the sanity checking of the Import Directory in Sebastian Porst's PeLib (ImportDirectory.h). I've chosen this an example of a classic int overflow problem. I had a couple of reasons why I chose this. First reason was that I'd stumbled upon it resonantly and thus was readily available to me. The second reason is that it's a pretty classic example of not taking 1) and 2) above to heart. The third is that it's written by somebody who was sensitive to security issues. Mr. Porst has made himself an impressive career in infosec. Yet he made a mistake. I'm not arguing that Mr. Porst is a bad coder (quite the contrary). I'm arguing if he made such a mistake most programmers not only can but are likely to make this kind of mistake at one point or another. Before this leads to misunderstandings: I consider Mr. Porst's code generally of good quality and Mr. Porst did indeed have good reasons to ignore 1) and 2) above - he wanted his lib to be able to repair broken PE files which means he cannot be strict when parsing PE's.

So let's turn or attention to the code:

/**

* Read an import directory from a file.

* \todo Check if streams failed.

* @param strFilename Name of the file which will be read.

* @param uiOffset Offset of the import directory (see #PeLib::PeHeader::getIDImportRVA).

* @param uiSize Size of the import directory (see #PeLib::PeHeader::getIDImportSize).

* @param pehHeader A valid PE header.

**/

1: int ImportDirectory<bits>::read(const std::string& strFilename, unsigned int 2:uiOffset, unsigned int uiSize, const PeHeaderT<bits>& pehHeader)

3:{

4: std::ifstream ifFile(strFilename.c_str(), std::ios_base::binary);

5: if (!ifFile)

6: {

7: return ERROR_OPENING_FILE;

8: }

10: unsigned int uiFileSize = fileSize(ifFile);

11:

12: if (uiFileSize < uiOffset + uiSize)

13: {

14: return ERROR_INVALID_FILE;

15: }

16: ifFile.seekg(uiOffset, std::ios_base::beg);

17: std::vector<unsigned char> vImportdirectory(uiSize);

18: ifFile.read(reinterpret_cast<char*>(&vImportdirectory[0]), uiSize);

19: PELIB_IMAGE_IMPORT_DIRECTORY<bits> iidCurr;

20: unsigned int uiDesccounter = 0;

21: InputBuffer inpBuffer(vImportdirectory);

22: std::vector<PELIB_IMAGE_IMPORT_DIRECTORY<bits> > vOldIidCurr;

23: do // Read and store all descriptors

24: {

25: inpBuffer >> iidCurr.impdesc.OriginalFirstThunk;

26: inpBuffer >> iidCurr.impdesc.TimeDateStamp;

27: inpBuffer >> iidCurr.impdesc.ForwarderChain;

28: inpBuffer >> iidCurr.impdesc.Name;

29: inpBuffer >> iidCurr.impdesc.FirstThunk;

30:

31: if (iidCurr.impdesc.OriginalFirstThunk != 0 || iidCurr.impdesc.TimeDateStamp != 32: 0 || iidCurr.impdesc.ForwarderChain != 0 ||

33: iidCurr.impdesc.Name != 0 || iidCurr.impdesc.FirstThunk != 0)

34: {

35: vOldIidCurr.push_back(iidCurr);

36: }

37:

38: uiDesccounter++;

39:

40: if (uiSize < (uiDesccounter + 1) * PELIB_IMAGE_IMPORT_DESCRIPTOR::size()) break;

41: } while (iidCurr.impdesc.OriginalFirstThunk != 0 ||

42: iidCurr.impdesc.TimeDateStamp != 0 ||

43: iidCurr.impdesc.ForwarderChain != 0 ||

44: iidCurr.impdesc.Name != 0 || iidCurr.impdesc.FirstThunk != 0);

Though there are a few layers of code above this, essentially "uiSize" and "uiOffset" parameters is unverified user data (uiOffset is checked against 0, but no checks otherwise). We have the verification of the parameters in line 12.What Mr. Porst must have thought is pretty clear if the sum of these two is bigger than the filesize it's wrong. What he forgot was that uiFileSize > uiOffset + uiSize if uiOffset = 0xFFFFFFFF and uiSize = 2 because of an unsigned integer overflow in the calculation[1]. In fact we can craft PE files with arbitrary values of uiOffset and that is not expected. We are now programming a wierd machine. In Mr. Porst's code what we can do with our overflow error is fairly limited. We can cause a regular crash but beyond that the code looks solid - see what happens if we use uiSize=0xFFFFFFFF and uiOffset=2. What had happed if we changed lines 16,17,18 and 21 a wee bit so that we read the entire file and just parse the right portion of the file instead of reading only the import section:

16: ifFile.seekg(0, std::ios_base::beg);

17: std::vector<unsigned char> vImportdirectory(uiFileSize /*uiSize*/);

18: ifFile.read(reinterpret_cast<char*>(&vImportdirectory[0]), uiFileSize /*uiSize*/);

21: InputBuffer inpBuffer(vImportdirectory); inpBuffer.set(uiOffset);

In the well behaved case everything remains functional. But we now have the potential to leak information. The point being the sanity checking doesn't work. With uiSize > uiFileSize and uiOffset making sure that the check in line 12 works we'd be able to read beyond the buffer allocated with a vector as much as we want. If some webservice dumps to the user the imports using this function we'd be able to dump heap content of the webservice following vector from line 17 and that might contain information not meant for anybodies eyes - and that can be quite valuable to attackers go google Heart Bleed. If we had a write operation instead we'd be writing arbitrary memory and with a memory full of user data and lots of vtables lying around we'd have code execution in no time! It's pretty much standard exploit stuff - except well call it by a different name: Programming the wierd machine.

The LangSec perspective on this error

There is any number of ways to fix the problem above. For example checking that the file read in line 18 succeeds would in the unmodified case stop the DoS from happing. You could easily do fixes of this type. And a great many developers do.

What LangSec suggest is instead that you honor what I've write as point 2). We should be strict. A check for uiSize < uiFileSize should be added. A check for the overflow itself too. Both should abort parsing if they fail. It would solve the problem. Also being strict the check in line 40 should return an error too instead of proceeding. Even though you could probably find dllnames etc. it's still a breach of protocol and aborting processing will minimize the risk of relying on assumptions that'll lead to another instance of a wierd machine. Idealy you'd even go so far that you'd do that as part of sanity checking before you start pushing values into other variables say line 35. Be sure the data is correct and you know exactly what to do with it, before you use it.

If we step back into 1) what we need to notice is that with the malformed case what happens becomes dependent on what's on the heap after the vector - that is our code isn't well defined. We could use a theorem solver to check for this. At least in this case. I found the bug by running some 80000 malware files through it which I suppose would count as a kind of fuzzing. The key point is, if we first make sure any data gives a predictable outcome, even if that outcome means turning down the request for processing we have written safe code.

The old school compiler solution for this bug

The 0x86 platform always had a number of exception interrupts. I sometimes think of them in terms of old school CPU developers making deep thoughts about what could go wrong in a computer. Probably because the first exception I always think of happens to be division by 0 and that happens to be the first in the x86 list - the first I'd think about. On 5th place on the founding fathers of the x86 CPU list comes the "overflow" interupt. It's trigged by the INTO instruction which essentially checks if the overflow flag is set and if then causes an interrupt 4. In short the CPU has since the dawn of time held the solution to the integer overflow problem. add eax, ebx; into. Done. Overflows no longer serves as "mov" type instructions in a wierd machine, but are always reduced to DoS - in fact a kind of DoS most developers know very well how to deal with using structured exception handling. Unfortunately the INTO instruction is almost never generated by compilers. Even the jo/jno instructions wired to the overflow flag is hardly ever generated by modern compilers. All three are listed in Z0mbie's opcode frequency list with 0% and that they are in there in the first place is more likely to be errors in Z0mbie's disassembler than because they actually sees use. So this remains an illusion. To make it worse integers cannot be overridden in C++ so I can't even just make up an overflow checking + operator. I have no clue how many security breaches are the result of over/underflows in addition and subtraction but it's probably not that uncommon. As we seen above it's an easy mistake to do because the mathematical + which we all know and love turns out to behave differently than the x86 "add". And while I'd love to have a "overflow free" integer available in C++, the langsec solution of doing things right seems like where we'd want to go.(Well if I had a choice I'd do both).

No bugs, no insecurity. Even if it's a truism.

Literature:

Sassaman, Patterson & Bratus: http://langsec.org/insecurity-theory.pdfZombie's opcode statistics: http://z0mbie.daemonlab.org/opcodes.html

[1] There is another error in the code. uiSize < sizeof(PELIB_IMAGE_IMPORT_DIRECTORY<bits>) will lead to leaking information too. I'll not discuss that error here.

Speculation on rowhammer

Update:

27.07.2015: Daniel Gruss, Clementine Maurice and Stefan Mangard released a paper today detailing how row hammer can be done in java script on Sandy Bridge, Ivy Bridge and Haswell. The method is indeed a pattern read in memory to evict the aggressor address from the cache. They utilize that a cache set only has 12 or 16 entries(ways) and thus using (repeatedly) 13 or 17 addresses that map to a single cache set will cause L3 cache misses on the 13th/17th access, instead of my experiments of keeping the entire cache full (which works like a charm, but it seem too slow for rowhammer). What they ended up with is certainly more advanced that what I've played with, but it's the same ball game. Thus the speculation in the old post below post unfortunately holds true.

It's time to test your ram! Disabling java script will go a long way towards protecting you. White list sites that you consider safe and HTTPS is important to avoid MiM injecting of java scripts. However these measures are insufficient on their own - other advanced scripting languages could probably be abused too (flash, silverlight,....).

A bit more speculation: Apple made an update to their EFI bios to mitigate it. I speculate that they increased the refresh rate, so that there is only 32ms between refreshes instead of the usual 64ms - the reason for this is that a 32ms refresh rate is recommmend for running at high-temperature and thus likely to be readily available. This is insufficient to entirely rule out row hammer as you'd need to drop interval as low as 8ms to be safe. It's important to realize that less time between refresh means speed penalties because you cannot read during the refresh. For 8ms intervals pentalty will be pretty steep, though probably acceptable for 32ms. Also more refresh causes more power consumption. I have seen wildly differing estimate of this, but I speculate that on a running laptop it's not a real issue and if implemented right (say 64ms interval refresh while computer is sleeping) no issue at all .

The paper :

http://arxiv.org/pdf/1507.06955v1.pdf

For the technically inclined Mark Seaborn's blog is an awesome resource:

http://lackingrhoticity.blogspot.de/

Original Post

It's cool to be a malware hobbyist. I can write purely speculative blogs. It's like being a comedian doing news - it's just comedy... And this blog is pretty speculative - sorry.

The speculation

In my last post on the subject "Row hammer fix POC and generic root kit detection using performance counters" I wrote that I doubt that we'd ever see a script version of this bug. My reasons for this was that scripts are unlikely to use clflush or any non-temporal (MOVNT)instructions in any predictable manner and would probably be on the slow end to flush the cache through accessing the it in a pattern that the CPU wouldn't predict while hammering quickly enough. The first thing that made me change my mind was when I stumbled upon this article: http://iss.oy.ne.ro/SpyInTheSandbox.pdf which manipulates the cache and then uses it as a side channel to obtain information from outside a sandbox through java script. This chipped away on my confidence. Then came this article https://xuanwulab.github.io/2015/06/09/Research-report-on-using-JIT-to-trigger-RowHammer/ which concludes that JIT compilers would not generate the instructions we needed for row hammering. This scared me because I'd forgotten all about JIT compiling. Because while we might be fast enough without JIT, having JIT would definitely make it fast enough. And finally came a tweet from Lava (@lavados)

Yesterday @BloodyTangerine and I flipped bits on Ivy Bridge without clflush... #rowhammer

So the plot thickens. I don't know how BloodyTangerine and Lavados are flipping the bits but if I were a betting man I'd place my money on the an approach like Spy In the Sandbox. There is a bit of evidence in that direction. The cache is build differently on different chipsets and this could be a reason to mention it. Conclusion: I was wrong. My new opinion is that we'll see a java script row hammer exploit.

Full disclosure: I had 280 char chat with Lava after I wrote this. This conversation is not part of this blog, because everything was said in private. I dare say that if this stuff interests you it would probably be worth while following Lava's tweet.

My Experiments

I played around a bit with the row hammer after my last blog. The reason that row hammer doesn't occur naturally all the time in memory is because the cache catches reads to any address being hammered. This is also why the performance counter on last level cache works well for detecting and preventing row hammering. To get around this Dullien and Seaborn (the original row hammer stuff) used the clflush instruction. I tried to hammer using the MOVNTQ instruction and was not succesful. Then after reading the "Spy in the sandbox" I started writing up a cache flush routine using multiple threads inspired by "Spy in the sandbox". The idea behind using multiple threads is that the biggest cache on modern CPU's is the level 3 cache and that is shared on all cores (and hyper threads) making it much easier to keeping the cache full with content unrelated to the row I'm hammering easier. The 1st and second level caches I don't consider too much of a problem since they quite small and could probably be kept full with unrelated stuff on a single thread. Unfortunately I never finished up this code so I'm not sure if it'd actually work for row hammer. The real issue isn't keeping my hammering row out of the cache, but doing it while hammering enough to cause bit flips.

Why row hammer in scripts would be really bad news

If Lavados and BloodyTangerine is indeed using the method I were playing with - or even a derivative of it then it's really bad. Mark Seaborn's fix with black listing clflush in the validator of NaCL sandbox would not extend because now normal mov instructions would suffice. Even adc, cmp, sub,inc,... would suffice and these are common and useful instructions. Even worse since browsers all too easy use Java script row hammer could easily be extend from a local privileged elevation to a full remote breach of the host attacked. Worse yet such an attack could be inserted automatically through man-in-the-middle on non https connections. Like the China's "Great Cannon" in a worst case scenario. You might argue that ECC ram would mitigate this, but that's only a half truth. ECC will most of the time reduce it to a DoS attack, but row hammer would from time to time flip more than 1 bit and dance right through the ECC on the ram. It wouldn't be the perfect storm because there is a tiny bit of good news, though not much: row hammer is difficult to weaponize and we know how to defeat it - even if the method is far more complex than traditional fixes.

Literature:

xuanwulab (2015): https://xuanwulab.github.io/2015/06/09/Research-report-on-using-JIT-to-trigger-RowHammer/

Oren et al(2015): http://iss.oy.ne.ro/SpyInTheSandbox.pdf

Dullien and Seaborn(2015): http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html

Fogh (2015): Row hammer fix POC and generic root kit detection using performance counters (This blog)

Tuesday, April 21, 2015

Anti-virus 1980ies style

This post was originally part of the “Machine Learning with malware Part 1: Data” post. But to avoid confusing I’ve made it a separate post. The reason is that it’s only marginally about machine learning and I do not wish to confuse the exposition. Unfortunately you’d probably need to read at least the last section of the previous blog post for this blog to make sense. Also if you wish to play along with me, you definitely need the data from that post.

In the late eighties viruses began to pop up in the wild. My first contact with computer virus (which I had until then known only from magazines) was when I got my hands on a version of I believe it was IBM anti virus. It had a signature data base which was a plain text file with I think some twenty entries. It looked like hex strings like those you’d see in the hex editor of choice from that time Norton Disk Edit. And by poking around (not really knowing what I was doing ) I managed to make IBM anti virus identify Norton Disk Edit as a virus by modifying a string to match what I’d extracted with NDE about itself. In those days virus was really rare there where only a few and they were all carefully examined by malware researchers. To detect these just plain strings where picked from the virus and if a file (or boot sector) contained the string it probably was a virus. Mutating code where not yet an issue. With an SQL data base full of malware we can easily make it search for likely malware strings. For this reason one of the features, were the bytes of the entrypoint of each file in my previous blog on data for malware machine learning. Since I don’t expect each file to be a unique malware, but rather multiple manifestations of a smaller number of malwares I can now just take a look at how many entry point strings that are common. Using an SQLite browser (download from SQLite.org) I can enter this command in the “Execute SQL” section after loading the database:

select EPAsText, count(EPAsText) as StrCount from MALWARE group by EPAsText order by StrCount

The most common entry point strings could be compiler generated ones. I couldn’t tell without comparing either comparing with non-malware binaries or looking at an actual disassembly. So for now I’ll just talk about those that look suspicious. The first suspicious entry is FF 25 00 02 04 00 00 00 00… This I can disassemble in my head. It would be Jmp [402000] which is a strange thing to do at an entrypoint. However you quickly discover that this is simply .Net files. The next immediately suspicious one is this: EB 01 68 60 e8 00 00 00 00 ….Of the top of my hat I recognize Jmp $+3 and call $+5. Jmp $+3 is just out right weird, unless it’s purpose is obfuscation which makes perfect sense in badly behaved code. Call $+5 is not weird at all. It’s a classic way for a virus or other piece of code of pushing EIP to the stack to find out where the loader actually put it in memory. It’s quite common in viruses and I used it myself in my PE-executable packers and I’ve never seen it in a well behaved program. In short I just found a signature string of either a virus or perhaps a packer. To minimize the chances of it being a packer we can do this:

select * from malware where EPASText like 'eb016860e8000%'

And whoopty they all have an entropy between 6 and 8 which is normal for unpacked files. Obviously there are PE-Encrypters that use silly encryption that keep entropy that high. My first PE Crypter is among them – see my Unpacking with emulator blog for more on that. So there is good evidence that we just found a string signature for a virus. Looking at the number of imports I’d say we just found a signature that matches two variants of the same virus. If you want to play more with strings try looking for the call $+5:

select count(EPAsText) as MyCount, EPAstext from malware where EPASText like

'%e800000000%' group by EPAsText order by MyCount

My guess is that simple query finds signatures for about 5% of the sample, but I don’t care to check it.

We could easily automate what I did above but there are better ways. Obviously if we had a database of non-malware executables we could use the simple Bayes machine learning on these entry point strings (see my Machine Learning and Bayes blog – if you look in the source codes for that blog you’ll find that I actually had planned to do exactly this without a database and not just entry point strings, but dropped the idea because of the time I’d been spending). I have no idea how big a role strings plays in modern anti-virus business. My immediate guess would be that it’s the method of choice for everything simple which remains the bulk of malware in executables. The last 30 years however made the method more sophisticated but this core method probably remains exactly what I outlined here.

Literature:

Szor, Peter: The Art of Computer VirusResearch and Defense

Machine learning with Malware Part 1: Data

Download my database here
Download source code for predator here

Without data there is no machine learning

Without data there is no machine learning. That is because ML is about finding "structure" in data, it really is that simple. This also means that you can never get better results from ML than your data allows. Often a bad model choice will provide decent results if the data is good, but with bad data no model will ever provide good results. Thus while this blog might be a little boring it’s subject is of significant importance.

I've used dummy data for my two previous ML post which focused more about what ML methods is and what they can do and less about the data. Now I'm faced with real data and this blog post is about how I'll treat my data before I even get to think about models and methods for finding structure. Ironically these steps are rarely even considered in a generalized fashion in the literature I've read. Usually it goes straight to the methods for finding structure. I suppose this is because what you actually need to do depend heavily on what kind of data we are talking about. This means of cause that this blog post will not be well founded in the scientific literature, but rather on my experience – but trust me I’m good with data. If you have no prior knowledge and haven't read the two first blog posts on machine learning with malware you might wish to read those first before proceeding.

In my mind I've always seen the process of doing econometrics or in this case ML of running over the stages below. Unlike a shopping list often you'll find yourself going back and improving previous steps as you become acquainted with the data. The steps in my mind are:

· Data acquisition: Getting Raw data

· Feature engineering: Getting from raw data to a data set ( observations)

· Preprocessing: Getting from data set to input data

· Model selection: Getting from input data to information

· Model analysis: Analyze results and become aware of problems with your model

This blog post will consider only the first three steps. Also the distinction between the steps is in real life application often much less clear than my list above. Never the less I think of the distinction as useful when working with the data. When I’m all done I’ll have a data set that you in theory could use for machine learning on malware – arguably a very crude data set, but that’s the way it’ll always be given my time constraints.

Data acquisition

When getting raw data you need to think ahead. For questionnaire type of data acquisition it's obvious that you get different results if you ask about "Obamacare" as opposed to "Death panels". It's less obvious in a malware setting, but no less important. Say you collect your raw malware data through honey pots. It'll probably influence the type of malware you collect if your honey pots are placed on servers in "Advanced weapons technology" companies as opposed to on aunt Emma's home PC. The first are far more likely to see custom build and highly sophisticated attacks than the second, and much less likely to fall prey to rapid spreading email worms. If you're trying to deduct general things about malware from this data, you'll be biased in either scenario because the data already reflect different realities. This bias is called selection bias. We’ll visit selection bias again below. Thus beware of your data, ML is data driven and it’s very much a garbage in garbage out discipline.. In my case I didn't have honey pots, but a password to a collection where I could download some malware. Thus I ponder that the chance of a malware making it into the collection is much higher for aunt Emma style malware than for sophisticated attacks.

Feature engineering

In all ML, except “deep learning”, finding the structure of the data requires us to specify the features of the data. You don't just stuff an executable into a machine and out comes if it's malicious or not. In my first blog post I stuffed into Bayes the section of the entrypoint and out came an estimate if the file was packed or not. The "Section of the entry point" is usually called a feature of the data and getting from raw data to features is what feature engineering is about. Sometimes features are obvious, unfortunately that is not so in malware research. Here we'll have to think hard about what we think is different between malware and non malware (at least if that's what we are examining with ML). The good thing is that we don't need to care about the actual relationship between the features, we just need to extract the features we think are relevant. This is because figuring out relationship in data is the job of the model or algorithm. Extracting too much is better than too little. ML has ways of dropping what is too much in data driven ways instead of the ad hoc assumption style. When we start working with the data we might come to think of a feature that wasn't part of the raw data we got and that might send us back to acquire more data. As a student I got access to a large database of socio-economic indicators over time for 1% of the Danish work force. I started analyzing unemployment and found out that I had to go back and find data on the business cycle because business cycles are fairly important in the labour market. Once you start working with the data things often become more clear and you need to be open to rework your decisions again and again. As you may recognize from the experience I just mentioned, feature extraction is a job for people who understand the subject at hand. So this is a good reason why being malware knowledgeable cannot be replaced by machine learning, however leveraging machine learning can be extremely powerful.

Preprocessing

Usually there are significant advantages in preprocessing data before entering it into a model. When trying to use machine learning to read hand written postal addresses classic image processing - that as such has nothing to do with machine learning - is often used. For example often CCD noise are removed, the picture of each letter is made the same size (in pixels) and if there is a slight tilt in the hand writing the pictures might be rotated etc. Also some ML methods have requirements of the input in question. If you wish to do say principle component analysis on the data you at least should consider normalizing the mean to 0 and the variance to 1 of the features, where say an “ordinary least squares model” is much more forgiving. The distinction between preprocessing and feature extraction is not always clear. In the same way reading out features from malware it often makes good sense to preprocess. For example a thing we'd wish to look for when trying to determine if an executable is malware or not could be the protection characteristics of the sections in the executable. In the Portable Executable format (PE) that's just a simple 8 bit long bit field for each section. First off no simple model will make much use of such a bit field as input, it is grammar, not semantic. But the "Writeable" bit might be very interesting. Is the feature the characteristics bit-field or is it the separate bits? But further we might consider combinations of bits to be features say executable and writable might seem like an unsecure combination where as observing either bit in an executable would be considered perfectly normal. The thing is that we might take these things into account when extracting the features, but often times we'll have to go back and work the data again before we can proceed with a model. My recommendation is do obvious preprocessing early on as it'll save work. Anything preprocessing that is less than obvious make a mental note of and return once you get to know your data and a picture starts to emerge where you wish to go with the data. For instance don’t normalize your data until you know if it is an advantage in the model you wish to use – and even then analyzing the raw data before hand might yield insights even if the end goal might be a PCA. I once estimated a fairly advanced model where I where trying to predict business cycles. In my first attempt I used something close to raw quarterly data for my variables. After hours of waiting for my model to converge I had a model that was able to perfectly predict at what time of year Christmas sales picked up. I had done my preprocessing poorly and my model picked up Christmas sales instead of business cycles. Had I preprocessed my data to correct for seasonal effects I would’ve saved myself a lot of trouble.

So much for theory

Data acquisition for me was as stated just plain downloading virus_share_0000.zip (And obtaining access - thanks to those who made this possible). This archive contains around 130.000 malware files in some 20 gigabytes. 130.000 is a big number for ad hoc management. To further complicate things the files names are only a hash with no mention of original extension, which means there is no simple identification to identify java script,c source codes, DLL’s, EXE etc. Manual analysis is out of the question because of the amount of data. Enter “Predator”. Predator is the stupid code name I had in my head and as I didn’t figure out a better name for a tool it’s now called Predator. Predator is a simple tool that’ll extract features from all files in a directory and organize them. A simple text file or a spread sheet seemed not quite up to the task for organization. A SQL database seemed the right way to go. Enter SQLite. I was up and running with SQLite library in less than 20 minutes including creating tables, adding observations and sending queries. Unfortunately the SQLite interface is just bare bones to put it politely. Initially I thought I'd just install POCO library using biicode and use poco's abstraction for SQLite. Then I thought hey I can write something up in a couple of hours and in the process learn something about SQLite and that might prove beneficial down the road. So that's how that code came to be.

The second thing I needed to figure out features to extract to get to a data set. And in step with keeping things simple I went for features of the PE format. My logic where:

A) there are going to be plenty of PE files as opposed to analyzing PDF exploits

B) I'm knowledgeable in the PE format and pe features are relatively easy to extract

C) They are very relevant in the real world malware.

Obviously I cannot extract PE features from a java script or PDF-based exploit. So I needed to throw away anything without a PE-header. As a general rule when working with feature engineering is that throwing data away is bad unless you have strong reasons to do so and if you do have these strong reasons you need to articulate them and consider the effects. This is because throwing away information will lead to selection bias - just imagine throwing out smokers in a lung cancer study. And don’t forget you’ll not be the first who have data problems, so the literature is full of methods to deal with most issues in better ways than throwing information away. However I do have strong reasons - PE files are likely to be an entirely different kind of malware than java scripts. They have different kinds of features; they work in different ways to achieve mostly different things. In fact I’d go so far to say that the only thing all malwares have in common is that they are malicious. The price I pay is that any result I get from now on no longer is general about malware but general only to PE files. We might find it reasonable to cut down further at some point (How about analyzing viruses separate from adware?). But since at this point I have no a priori (data based) reason to think they cannot be analyzed with the same features and same methodology so in line with the reasoning above I keep everything for now. As I wrote in the beginning of this blog post, ML is more of an iterative process than working a shopping list.

The original idea was to write up a small PE parsing library based on my previous works. I did not have much luck with that. The PE files in the achieve where off specs, broken and shattered. So after a few hours of trying to harden up my own code I realized there was no way I could get both sufficient flexibility and stability on a time schedule like mine (Remember I try to stay under 24 hours total time for a blog post and I’d already spend time on SQL and my own PE stuff). Thus I decided to go for a finished library, despite not being keen on using too many libraries especially in core functions, because my blog should not descend into a “using tools blogs”, but rather be about how stuff works. I considered going with Katja Hahn’s PortEx java library but dropped the idea because I’m not too strong in java and didn’t want to interface with my c++ code. Especially since I’m considering for another blog post to add features from the code utilizing my emulator (see Unpacking using emulation blog). Instead I went with PeLib by Sebastian Porst. I think I actually met mr. Porst once which ironically became the real reason for going with it. I know sloppy, but hey… It’s a very nice library, (Unfortunate it seems to be orphaned as there has been no changes since 2005 or so) but never the less it too was not quite up to the task of working with these mangled files. In fact it asserted on the very first file I parsed with it. So instead of just linking it, I took it into my own project and started fixing bugs as I went. Thus you’ll find Mr. Porsts code in my project. Now keep in mind that I was not trying to do good fixes for the bugs, I was trying to get the parsing done. Most bugs in the code where integer overflows. I did my best to keep every PE file in the sample for reasons I’ve discussed above. I might have excluded a few files, but I am certain that those files could not possibly be executed on any NT platform. So I think the selection bias from this point of view should be minimal.

The features I’ve selected is summarized in this table and I’ve sometimes tried to explain my motivation behind choosing the feature:

Feature	Reason
Filename
Machine	Maybe I’ll wish to look at 32bit and 64 bit exes sperately at one point
NumberOfSections	Many sections can be an indicator of evil
MajorLinkerVersion	Linker version is often used as “infected” marker by viruses
MinorLinkerVersion
SizeOfCode
SizeOfInitializedData
SizeOfUninitializedData
ImageBaseNonDefault	This might be used as indicator for being a DLL
EntropyOfEP	The entropy of the first 1000 bytes after the entrypoint might be an indicator for polymorphic encryption of a virus.
EPAsText	See below
Hash	If doesn’t compare to filename probably damanged
Entropy	The entropy of the entire file will indicated if file is encrypted or packaged as opposed to infected. It of cause might be both, but that is another question
EPInFirstSection	This is normal in most executables. It’s not quite as common for packaged and infected files.
EPeq0	An entrypoint in the PE header is definitely an indicator that something evil is going on. These however will execute!
EPExecutable	A virus might forget to set the entrypoint as executable. A Linker won’t. Exe will run on 32bit
WriteExecuteSection	Writeable code section is definitely a sign of evil
NumberOfExports	Exports are an indicator of DLL
ExportError	An error occurred during export parsing. Unlikely for a real linker.
NumberOfImports
NumberOfImportDll
IsDotNet	.Net might be an entirely different kind of analysis,
ImportError	Same as export errror

I ended up with 86447 PE files in my database. They came as 84925 32bit native code, 1503 DotNet executables and 19 of other architectures. Seems like my malware collection is a bit dated as 19 is a tiny number considering that that number contains 64 bit PE’s. In selecting my features my recollection of Peter Szor’s excellent book was useful.

Resources:

SQLite. http://www.sqlite.org

Porst, Sebastian: PeLib. http://www.pelib.om

Hahn, Katja – PortEx. https://github.com/katjahahn/PortEx

Szor, Peter: The Art of Computer VirusResearch and Defense