Tuesday, April 21, 2015

Machine learning with Malware Part 1: Data

Download my database here
Download source code for predator here 

Without data there is no machine learning

Without data there is no machine learning. That is because ML is about finding "structure" in data, it really is that simple. This also means that you can never get better results from ML than your data allows. Often a bad model choice will provide decent results if the data is good, but with bad data no model will ever provide good results. Thus while this blog might be a little boring it’s subject is of significant importance.
I've used dummy data for my two previous ML post which focused more about what ML methods is and what they can do and less about the data. Now I'm faced with real data and this blog post is about how I'll treat my data before I even get to think about models and methods for finding structure. Ironically these steps are rarely even considered in a generalized fashion in the literature I've read. Usually it goes straight to the methods for finding structure. I suppose this is because what you actually need to do depend heavily on what kind of data we are talking about. This means of cause that this blog post will not be well founded in the scientific literature, but rather on my experience – but trust me  I’m good with data.  If you have no prior knowledge and haven't read the two first blog posts on machine learning with malware you might wish to read those first before proceeding.
In my mind I've always seen the process of doing econometrics or in this case ML of running over the stages below. Unlike a shopping list often you'll find yourself going back and improving previous steps as you become acquainted with the data. The steps in my mind are:
·         Data acquisition: Getting Raw data
·         Feature engineering: Getting from raw data to a data set ( observations)
·         Preprocessing: Getting from data set to input data
·         Model selection: Getting from input data to information
·         Model analysis: Analyze results and become aware of problems with your model

This blog post will consider only the first three steps. Also the distinction between the steps is in real life application often much less clear than my list above. Never the less I think of the distinction as useful when working with the data. When I’m all done I’ll have a data set that you in theory could use for machine learning on malware – arguably a very crude data set, but that’s the way it’ll always be given my time constraints.

Data acquisition

When getting raw data you need to think ahead. For questionnaire type of data acquisition it's obvious that you get different results if you ask about "Obamacare" as opposed to "Death panels". It's less obvious in a malware setting, but no less important. Say you collect your raw malware data through honey pots. It'll probably influence the type of malware you collect if your honey pots are placed on servers in "Advanced weapons technology" companies as opposed to on aunt Emma's home PC. The first are far more likely to see custom build and highly sophisticated attacks than the second, and much less likely to fall prey to rapid spreading email worms. If you're trying to deduct general things about malware from this data, you'll be biased in either scenario because the data already reflect different realities. This bias is called selection bias. We’ll visit selection bias again below. Thus beware of your data, ML is data driven and it’s very much a garbage in garbage out discipline.. In my case I didn't have honey pots, but a password to a collection where I could download some malware. Thus I ponder that the chance of a malware making it into the collection is much higher for aunt Emma style malware than for sophisticated attacks.

Feature engineering

In all ML, except “deep learning”, finding the structure of the data requires us to specify the features of the data. You don't just stuff an executable into a machine and out comes if it's malicious or not. In my first blog post I stuffed into Bayes the section of the entrypoint and out came an estimate if the file was packed or not. The "Section of the entry point" is usually called a feature of the data and getting from raw data to features is what feature engineering is about. Sometimes features are obvious, unfortunately that is not so in malware research. Here we'll have to think hard about what we think is different between malware and non malware (at least if that's what we are examining with ML). The good thing is that we don't need to care about the actual relationship between the features, we just need to extract the features we think are relevant. This is because figuring out relationship in data is the job of the model or algorithm. Extracting too much is better than too little. ML has ways of dropping what is too much in data driven ways instead of the ad hoc assumption style. When we start working with the data we might come to think of a feature that wasn't part of the raw data we got and that might send us back to acquire more data. As a student I got access to a large database of socio-economic indicators over time for 1% of the Danish work force. I started analyzing unemployment and found out that I had to go back and find data on the business cycle because business cycles are fairly important in the labour market. Once you start working with the data things often become more clear and you need to be open to rework your decisions again and again. As you may recognize from the experience I just mentioned, feature extraction is a job for people who understand the subject at hand. So this is a good reason why being malware knowledgeable cannot be replaced by machine learning, however leveraging machine learning can be extremely powerful.

 

Preprocessing

Usually there are significant advantages in preprocessing data before entering it into a model. When trying to use machine learning to read hand written postal addresses classic image processing - that as such has nothing to do with machine learning - is often used. For example often CCD noise are removed, the picture of each letter is made the same size (in pixels) and if there is a slight tilt in the hand writing the pictures might be rotated etc. Also some ML methods have requirements of the input in question. If you wish to do say principle component analysis on the data you at least should consider normalizing the mean to 0 and the variance to 1 of the features, where say an “ordinary least squares model” is much more forgiving. The distinction between preprocessing and feature extraction is not always clear. In the same way reading out features from malware it often makes good sense to preprocess. For example a thing we'd wish to look for when trying to determine if an executable is malware or not could be the protection characteristics of the sections in the executable. In the Portable Executable format (PE) that's just a simple 8 bit long bit field for each section. First off no simple model will make much use of such a bit field as input, it is grammar, not semantic. But the "Writeable" bit might be very interesting. Is the feature the characteristics bit-field or is it the separate bits? But further we might consider combinations of bits to be features say executable and writable might seem like an unsecure combination where as observing either bit in an executable would be considered perfectly normal. The thing is that we might take these things into account when extracting the features, but often times we'll have to go back and work the data again before we can proceed with a model. My recommendation is do obvious preprocessing early on as it'll save work. Anything preprocessing that is less than obvious make a mental note of and return once you get to know your data and a picture starts to emerge where you wish to go with the data. For instance don’t normalize your data until you know if it is an advantage in the model you wish to use – and even then analyzing the raw data before hand might yield insights even if the end goal might be a PCA. I once estimated a  fairly advanced model where I where trying to predict business cycles. In my first attempt I used something close to raw quarterly data for my variables. After hours of waiting for my model to converge I had a model that was able to perfectly predict at what time of year Christmas sales picked up. I had done my preprocessing poorly and my model picked up Christmas sales instead of business cycles. Had I preprocessed my data to correct for seasonal effects I would’ve saved myself a lot of trouble.

So much for theory

Data acquisition for me was as stated just plain downloading virus_share_0000.zip (And obtaining access - thanks to those who made this possible). This archive contains around 130.000 malware files in some 20 gigabytes. 130.000 is a big number for ad hoc management. To further complicate things the files names are only a hash with no mention of original extension, which means there is no simple identification to identify java script,c source codes, DLL’s, EXE etc. Manual analysis is out of the question because of the amount of data. Enter “Predator”. Predator is the stupid code name I had in my head and as I didn’t figure out a better name for a tool it’s now called Predator. Predator is a simple tool that’ll extract features from all files in a directory and organize them. A simple text file or a spread sheet seemed not quite up to the task for organization. A SQL database seemed the right way to go. Enter SQLite. I was up and running with SQLite library in less than 20 minutes including creating tables, adding observations and sending queries. Unfortunately the SQLite interface is just bare bones to put it politely. Initially I thought I'd just install POCO library using biicode and use poco's abstraction for SQLite. Then I thought hey I can write something up in a couple of hours and in the process learn something about SQLite and that might prove beneficial down the road. So that's how that code came to be.

The second thing I needed to figure out features to extract to get to a data set. And in step with keeping things simple I went for features of the PE format. My logic where:
A) there are going to be plenty of PE files as opposed to analyzing PDF exploits
B) I'm knowledgeable in the PE format and pe features are relatively easy to extract
C) They are very relevant in the real world malware.
Obviously I cannot extract PE features from a java script or PDF-based exploit. So I needed to throw away anything without a PE-header. As a general rule when working with feature engineering is that throwing data away is bad unless you have strong reasons to do so and if you do have these strong reasons you need to articulate them and consider the effects. This is because throwing away information will lead to selection bias - just imagine throwing out smokers in a lung cancer study. And don’t forget you’ll not be the first who have data problems, so the literature is full of methods to deal with most issues in better ways than throwing information away. However I do have strong reasons - PE files are likely to be an entirely different kind of malware than java scripts. They have different kinds of features; they work in different ways to achieve mostly different things. In fact I’d go so far to say that the only thing all malwares have in common is that they are malicious. The price I pay is that any result I get from now on no longer is general about malware but general only to PE files. We might find it reasonable to cut down further at some point (How about analyzing viruses separate from adware?). But since at this point I have no a priori (data based) reason to think they cannot be analyzed with the same features and same methodology so in line with the reasoning above I keep everything for now. As I wrote in the beginning of this blog post, ML is more of an iterative process than working a shopping list.

The original idea was to write up a small PE parsing library based on my previous works. I did not have much luck with that. The PE files in the achieve where off specs, broken and shattered. So after a few hours of trying to harden up my own code I realized there was no way I could get both sufficient flexibility and stability on a time schedule like mine (Remember I try to stay under 24 hours total time for a blog post and I’d already spend time on SQL and my own PE stuff). Thus I decided to go for a finished library, despite not being keen on using too many libraries especially in core functions, because my blog should not descend into a “using tools blogs”, but rather be about how stuff works. I considered going with Katja Hahn’s PortEx java library but dropped the idea because I’m not too strong in java and didn’t want to interface with my c++ code. Especially since I’m considering for another blog post to add features from the code utilizing my emulator (see Unpacking using emulation blog). Instead I went with PeLib by Sebastian Porst. I think I actually met mr. Porst once which ironically became the real reason for going with it. I know sloppy, but hey…  It’s a very nice library, (Unfortunate it seems to be orphaned as there has been no changes since 2005 or so) but never the less it too was not quite up to the task of working with these mangled files. In fact it asserted on the very first file I parsed with it. So instead of just linking it, I took it into my own project and started fixing bugs as I went. Thus you’ll find Mr. Porsts code in my project. Now keep in mind that I was not trying to do good fixes for the bugs, I was trying to get the parsing done. Most bugs in the code where integer overflows. I did my best to keep every PE file in the sample for reasons I’ve discussed above. I might have excluded a few files, but I am certain that those files could not possibly be executed on any NT platform. So I think the selection bias from this point of view should be minimal.

The features I’ve selected is summarized in this table and I’ve sometimes tried to explain my motivation behind choosing the feature:
Feature
Reason
Filename

Machine
Maybe I’ll wish to look at 32bit and 64 bit exes sperately at one point
NumberOfSections
Many sections can be an indicator of evil
MajorLinkerVersion
Linker version is often used as “infected” marker by viruses
MinorLinkerVersion

SizeOfCode

SizeOfInitializedData

SizeOfUninitializedData

ImageBaseNonDefault
This might be used as indicator for being a DLL
EntropyOfEP
The entropy of the first 1000 bytes after the entrypoint might be an indicator for polymorphic encryption of a virus.
EPAsText
See below
Hash
If doesn’t compare to filename probably damanged
Entropy
The entropy of the entire file will indicated if file is encrypted or packaged as opposed to infected. It of cause might be both, but that is another question
EPInFirstSection
This is normal in most executables. It’s not quite as common for packaged and infected files.
EPeq0
An entrypoint in the PE header is definitely an indicator that something evil is going on. These however will execute!
EPExecutable
A virus might forget to set the entrypoint as executable. A Linker won’t. Exe will run on 32bit
WriteExecuteSection
Writeable code section is definitely a sign of evil
NumberOfExports
Exports are an indicator of DLL
ExportError
An error occurred during export parsing. Unlikely for a real linker.
NumberOfImports

NumberOfImportDll

IsDotNet
.Net might be an entirely different kind of analysis,
ImportError
Same as export errror


I ended up with 86447 PE files in my database. They came as 84925 32bit native code, 1503 DotNet executables and 19 of other architectures. Seems like my malware collection is a bit dated as 19 is a tiny number considering that that number contains 64 bit PE’s. In selecting my features my recollection of Peter Szor’s excellent book was useful.
Resources:
SQLite. http://www.sqlite.org
Porst, Sebastian: PeLib. http://www.pelib.om
Hahn, Katja – PortEx. https://github.com/katjahahn/PortEx

Szor, Peter: The Art of Computer VirusResearch and Defense

Thursday, April 2, 2015

Thoughts on H.A.R.E.S, harvard, mining and leverageing

Update 19-6-2015

I didn't explicitly write it in the original blog. But I should've. The methods below would be useful for detecting Shadow walker type root kits too.

Original post starts here:

I finally got a new hard disk and I'm now spending most of my malware hobby time trying to extract features for machine learning from 70000 PE executables and put them into a database. This unfortunately takes a while. But stay with me I'll get back to ML soon enough.

Download frequency tables here

Introduction

While I'm working on other stuff here is a short note on something I found extraordinary interesting: The H.A.R.E.S. white paper. by Jacob Torrey. The white paper is pretty short, and the slides from his presentation on SyScan (which I did not attend) doesn't bring much more light to the thing.  So forgive me if I'm in accurate. Also I'm writing this on the side while running 70.000 malware executables through my feature grabbing tool and monitoring that everything is ok. So no code this time, just words.

What is H.A.R.E.S and why do I like it and fear it at once

To summarize really shortly:
Mr. Torrey wants to build a DRM style utility in which the protection is that the code segment can be executed but not read or written. To implement this he uses that the memory manager (MMU) in the modern x86 computers keeps separate caches for page tables (TLB's) for data and execution. He then manipulates the execution cache to point to real data, and the data cache to point in "nirvana". This trick is that makes PaX  on linux work or the Shadow walker root kit. To avoid the attacker from reading the code in the original file he proposes to encrypt with AES-128 for which the modern intel processors has instructions for. To protect the key and the page tables he installs everything in a "thin" virtual machine so that everything the host os (That includes the attacker) does is directly with hardware except for decryption and protection of the TLB's. Read his white paper for details  - I find it quite good.

I find this interesting because in many ways it's a step towards a new computer architecture. x86 was originally build on the von Neumann architecture. For this discussion the important part of the von Neumann architecture is that there is no distinction between code and data.  A competing architecture the Harvard architecture distinguish between them. A great many of the bad behavior that is possible in x86 architecture is possible in the von Neumann architecture but not in the Harvard (stack execution, trampoline hooks, self modifying code, many code injection techniques ...). Mr. Torrey's idea effectively imposes this Harvard feature on a von Neumann platform and that is a very interesting thing - at least from a theoretical perspective. The Gruq however noted that though at first it sounds like a blessing for InfoSec, it's really is a curse in disguise. Imagine no way to audit code for maliciousness, no signature, no code heuristics only behavior analysis to identify if something is malicious or not. Mr. Torrey suggest some weaknesses in his methods that might be useable. I'll get back to that in a moment.

I doubt that Mr. Torrey's thing will find much real world application, but it might be an important stepping stone towards an architectural change that is much needed. That alone makes it worth peaking into and not to mention it seemed like a good semi-technical topic for a fun blog post. I have no time to do code on this and since I doubt this will have real world implication I find it ok to stick with speculation.

A detour

The first problem that Mr. Torrey has is that code in real executables are often interspersed with data. Here a string, there a jump table etc. If you've disassembled enough you've seen it. Since not even the code it self can read this data we have a problem. I'd suggest that a solution to this problem is identify as much code as you can in the code section using standard methods (Performance counters based coverage analysis and static coverage analysis). You make a copy of the entire code section, destroys (overwrite with garbage) what you could positively identify as code and uses the data TLB to project it correctly for data instructions. You sacrifice protection for easy of implementation and that always seems to me to be a worthwhile sacrifice.

Attacking H.A.R.E.S

Now back to the weakness's Mr. Torrey suggested. I think the real potential is attack the implementation of the thin VM it self. This at least currently cannot be protected it self and thus reproducing the VM exactly, but leaking the key seems to me to be the most promising way to go about playing attacker. He also suggests hardware attacks (reading physical memory directly) but shipping hardware with an antivirus software is uneconomical to say the least. This leaves me to his third suggested attack vector: Side channel attacks which will be my topic for the rest of this post.
The side channel attacks I'll talk about is tracing the code and seing what happens and using performance counters to gain additional information about the code you cannot read. You could probably also use cache timing for an attack, but I'll not get into that because I don't know enough about modern caches yet and of the three it seems to me to be the least productive. Forgive me again - since my 15 years de facto infosec pause, I've been reading up only 3 months.

The problem of tracing around observing registers and memory to find out what the code looks likes seems daunting at first. With N byte code segment we have 256 to the power of N possible programs. Even if we say we just need to get close, that looks like a terrible large number. Also with the sheer amount of instructions on x86 platforms it seem near impossible to get them all correctly deduced from just looking at their behavior. Think clflush instruction. It get's worse. Imagine the code segment being full of data that essentially are instructions that crashes some hardware in your computer if run in ring0 and causes exceptions if not. Since you cannot really execute the instruction, you cannot use the trace method and the performance counters are unlikely to be of much use to you. Unfortunately such a scenario can easily be imagined - imagine a long string of outb instructions, that'll crash some attached hardware if executed. Game over?
Not quite. First of the problems mentioned above are all real, but they probably won't shut us down totally.  We can probably squeeze most of the data out of the unreadable code segment. Also we should start up be noticing that not all opcodes are equally likely. Some are used more than others. Zombie did a survey a couple of years back and just one opcode made up for 15% of all opcodes in his sample. Only 54 opcodes made up 95% of the opcodes used in his sample, the majority of these 54 opcodes indeed seems to be analyzable. Looking just at frequencies with which bytes occur in calc.exe from windows XP I found almost 10% to be 0's. Zero is "add al, reg8, r8/m8". About half of those zeroes where followed by another zero making up the instruction "add al, [eax]". You can find my tables as an excel file. This hints that the problem of squeezing enough information out of the code segment might not be quite impossible.

Mining 
The amount of instructions on the x86 platforms is ironically going to help us. Since the instructions are so crammed many opcodes can contains other smaller opcodes. For example mov eax, 90h would 0xb8, 0x90. The 0x90 being a "nop". We can use this to leverage our search.
Let me define some terminology before I start. With "mining" I mean finding out the opcode and/or parameters of said without any prior knowledge of it. With "Leveraging" I mean using known bytes in the code segment to figure out more bytes by using opcodes in embedded in previously mined opcodes.
Obviously since we know nothing we need to mine as a first step. My first step mining would go for the easiest stuff first. So I'm essentially going to setup up a context where I can easily identify if somebody is using my registry values e.g. set EAX to 0xa1a2a3a4, ebx to 0xb1b2b3b4 etc. Including the memory on the stack and then set a thread to execute exactly one instruction for each bytes in the loaded target executable. I do not care where real opcodes meant to be executed start. I'm looking for anything that I can identify. Once we trace we'll either see an except (other than the expected int 1) or we'll get a new context after execution of the instruction. I'll return to execptions later. When talking about a register in the input context I'll use that register with a .0 before it similarly for the output context a .1 will be appended. Thus eip.1 is the output eip. I can for most instructions calculate a length of the instruction by simply doing eip.1-eip.0. This works for all but branching instructions. I'll refer to this imperfect measure as the opcode size (OpcSize) nevertheless. Below is a table of instructions that is particularly easy to mine (and how to mine them)
Opcode
Identification criteria
Comment
0x40-0x4f Inc/dec reg
OpcSize ==1 && reg.0==reg.1 +-1

0x50-0x57 Push reg
OpcSize==1 && esp.0==esp.1+4&& @esp.0==reg

0x58-0x5f pop reg
OpcSize==1 && esp.0==esp.1-4 && reg.1 == @esp.0

0x68 /0x6aPush imm8/32
OpcSize==2/5 && esp.0==esp.1+4
This gives us 1/4 parameters bytes which we can read from the @esp
0xe8 call rel imm32
esp.0==esp.1+4, @esp.1==eip.0+5
We can only partially mine this easily as sometimes it'll cause an exception and not give us a new context
0xC2 retn imm16
Esp.0==esp.1-imm8 && abs(eip.0-eip.1)>16
We can only positively identify this instruction if the branch made was so"long" it couldn't be an instruction and the imm8 value is bigger than 4 ( to differentiate from 0xC3 retN).
Xchng reg, eax, scasb/w/d/, stosb/w/d, pushfd,popfd,push es/cs/ds/ss, int, ...

Mining criteria left as an excersise... sorry..

Say we figured perfectly mined for 30 or so opcodes and identified 1000 bytes. The complexty we are now trying to resolve is only 236 to the power of N-1000. Still a ridiculous number, but notice that our progress is exponential. Finding just one byte reduces the problem a factor of 256!!!
The above mining was the simple mining. Things we could establish by tracing just once per instruction. Tracing more than once can provide even more information. Say cli/sti instructions. Tracing this instruction twice with different preset I-flags will let you see a toggle of this flag and  and OpcSize of 1. Thus we can identify these. We can also mine conditional short jumps with more than one trace. Trace all instructions with all branching-flags off and once on. If once the OpcSize is 2 and once not two we have a conditional short jump. There goes a significant number of bytes again and since we can mine perfectly we can subtract another 14 for the number of possible choices for each opcode.

Leveraging
Once we've mined these we can start leveraging. For this particularly the 1/4 bytes we got from the call rel imm8 or the push imm8/32 or the conditional short jumps are interesting. Imagine that the last byte of the address in the imm was 0x2. We'd then know that the opcode is add al, imm8. Zero out al.0 execute and you'll have the next byte in al.1 Imagine this being 0x3 (opcode add eax, imm32) we can then do the trick again with eax. However not all opcodes can be leveraged because they don't leak more information than the opcode themselves by mining. So we cannot expect a cascade effect, but we will make progress. We can easily leverage add/adc/sbb/xor/sub al/eax, imm8/32 as well as all short jumps opcodes and how about mov reg, imm8/16? We could also look at two bytes leveraging say 0x8d 0x98 would be lea ebx, [eax+imm32] There are more easy leverage to be had.

Harnesing Exceptions
Up to this point I've treated exceptions as something which prevents you for getting information while mining. They can be rather interesting though. Not only does the type of exception tell you something about the type of opcode, it also give gives you information on whether it was a read or write that caused. Imagine leveraging this 0x8b (mov reg32, m32). Now we wish to know the parameters of it. Simply set as much of the memory in the process you're working with to guardpages. Chances are tracing this will now throw an exception. Tracing a couple of times with different register settings if the read address in the exception doesn't change we have a simple IMM32 with indirection . We can also find out what registers change the address and use this information to figure out more about the RM and MOD bits of the ModRM. Which again will allow us to identify the SIB byte (if present). Once this is done you could get lucky and be able to make this instruction not throw an exception and that'll give you the REG bits of ModRM leaving you with the complete instruction. Even if you don't know the instruction you can use this technique to mine for anything that looks like a ModRM byte and figure it and the additional parameters out.

Last words
I've put enough on the table in this direction now. There is plenty more to be harness in terms of trace mining and leveraging. A hint though would be to use frequency maps from old non protected executables to direct your mining operatings towards where they are most likely to give you the highest success rate for your effort. Mining for 0x00, 0x00 in calc.exe in windows XP sounds like a great idea also for other executables. You can mine for imperfect information that won't allow you to with certainty find a given byte, but rule out almost everything making an educated guess easy later. You should also look into info leaks from performance counters. Or hooking system api to get return addresses will allow you to better mine for api calls. Moving yourself into ring 0 might also bring you something. Finally we can make use of bindiff'ing methods if we've managed to get enough information from the above measures we can compare out incomplete functions with those of standard libraries. You can find more about this on the Diaphora webpage included a link to a good article by Thomas Dullien and Rolf Rolles.

Literature
Diaphoraweb page:http://joxeankoret.com/blog/2015/03/13/diaphora-a-program-diffing-plugin-for-ida-pro/
Torrey, Jacob. White paper: http://jacobtorrey.com/HARES-WP.pdf
Torrey, Jacob. SyScan slites: http://jacobtorrey.com/HARESSyscanTalk.pdf

Zombie's opcode statistics: http://z0mbie.daemonlab.org/opcodes.html

Saturday, March 21, 2015

Row hammer fix POC and generic root kit detection using performance counters

Update 2/4-2015
I had a chat with Thomas Dullien about row hammer detection and the google people developed the same idea as I did in parallel. Their results pretty much match my from my second post on "Row hammer dectable". I went one step further than google by setting the INT bit in the MSR which provides me with context of the cache miss which allows me to prevent and better analyze and care for false positive which I unfortunately think remain a part of the "row hammer" detection scheme. My current feeling is that parts of the industry is mostly interested in ignoring row hammer bug to death and that makes me feel quezy about the prospects of a fix ever being made available. My prediction is that only the detection method will be used in real life and that not widespread.

Erratum regarding ROP

As I wrote the original text I where focused on the "Architectural" events on intel processor. With these events the original text about tracking ROP using performance counters remains correct. However  Georg Wicherski (ochsff) commented to me that there is a "Ret Miss ( Event select 0x90 / Umask 0 - for the core duo table 19.22 In Intel Systems programming)" non-architectural counter which would suit this perfectly and thus my method for finding normal code execution outside of known areas in KM could be extended to find ROP'ing in User as well as kernel mode. User mode will probably require a severe trade off between false negatives (not finding ROP when it's there) and performance penalty. This is mainly due to the fact that all rets after a task switch will be mispredicted. I think something useful could be made. Unfortunately push / ret combinations are common in copy protection and packing wrappers and they will probably cause false positives as well using this method.



Original Post

Before I start with the subject of this blog just a few words on the way. The most viewed of my blogs these past days where "Machine Learning and malware with Bayes". Ironically I consider this my worst post so far (not counting the first two row hammers as "real" posts. Well in that and some other posts I used proxies for malware because I didn't have a malware collection. I do now. It'll take me a while to get it sorted out, but then I will return to ML topics again after this short row hammer distraction.
Accompanying sources codes are here:

Row hammer fix POC and generic root kit detection using performance counters


Many thanks to Jerome and his sharp mind that helped me out with this nightmare. Thanks goes to Joey too.

An anecdote

I think it was in 2008 meeting up with the guys from the late 1990'ies and early 2000's not from where I lived. It was a happy and alcohol infused occasion with an awesome amount of brain power present. One of the guys - I'll call him N had just written a kernel mode privilege attack for Windows XP I believe and where developing ideas on how to hide himself from detection. I did as I do best and started trolling him. My first plan of attack was pretty simple, invoke a blue screen and search a kernel mode memory dump that ensued. Counter attack difficult, but possible. This idea is actually still  pretty good - so I should write up something on it some time. N argued about the obvious problems with this method. So I countered how I profile the entire computer using performance counters. You can hide, but you cannot run, not without me knowing at least probability wise. Discussion drowned on that point I think. I've long planned to write up this blog but always postponed it because it's not so closely aligned with my current interest in machine learning. So profiling were in the back of my head. Enter row hammer. Google project zero wrote on their Row hammer attack and I found myself thinking if performance counters can do anything on infosec, then they can do this. Thomas Dullien wrote entry on facebook that he was happy to be done with Row Hammer and up came the troll in me. I trolled him with the same thing which I've trolled N back in the day.  Obviously when you have a good idea you havn't published it, trolling someone in an open facebook post and especially not on Thomas Dullien's facebook is a bad idea. And that forced me into publishing half finished stuff. Sorry. Unfortunately on maximal bad timing with my personal life and on the very day my hard disk had crashed.

 

So now for a bit better write up


Intel CPU's support performance counters to help software developers profile their software. Profiling is essentially the process of seeing  how much, what kind  and where computing resources are being spend to provide clues for optimizing your code. Intel has a number of model specific registers to deal with this. You can enable counting for a number of difference performance events such as time spend, missed cache hits, and so on and read out the counter register using the WDMSR/RDMSR instructions which is available only in kernel mode.  Obviously row hammer is hammering away on the memory causing bottle necks. Start the profiling and you could detect somebody hammering your system and trigger a search for the culprit. Particularly easy if you're just running a sand box. This is pretty much the concept that I showed was viable with my last post on the subject.
It gets better. You can even have the counters trigger an interrupt when the overflow allowing you not only to get information on how often things are happening but also where. Let the thing run long enough and you'll see where interrupts in your code starts to rack up a tally and you'll know your bottle neck or in my case the row hammer.
The source code of my proof of concept is available. It's kept very basic so this should not be considered production code in any meaning (also it's dead insecure!!!!). Remember I never try to write a blog if I don't belive I can finish it in 24 work hours. The first decision I made where to make the POC code on only one operating system and that operating system was going to be winxp32. There are two reason for this, the first that i had a WinXP computer available to me for testing and the second is that WinXP32 does not protect the interrupt descriptor table which I would need. This mean less messing around with the operating system. This of cause means my code won't port directly to modern windows version but it's irrelevant in showing that a software fix is possible since the entire code is based on hardware features. However you probably need to be less than well behaved, or rewrite parts of the operating system. Now my XP system didn't have a kernel mode debugger on it and I decided it wasn't worth spending time setting up the system for debugging with WinDbg. Which is fair to say was a bad decision. Starting out the first task was to find out which interrupt I needed to hook. The interrupt is in Intel documentation called PMI (performance monitor interrupt) but unlike the usual interrupts it isn't assigned a fixed number rather you have to ask the APIC for it's mapping. Having done enough docu reading for a day I decided that LiveKD with symbols would probably give me a few hints. So I downloaded LiveKD set up the symbols and executed the "!idt" and I had been right "hal!HalpPerfInterrupt" sounded just about right. The Interrupt is 0xFE.


Second I need to hook the interrupt. I've done this before for a piece of copy protection software I worked on. The essentials are use KeSetThreadAffinity in a loop to get on all processors in the system, then using the SIDT instruction to  find the interrupt descriptor table. Then modify it. I then realized that my only source codes for hooking interrupts (In fact all drivers for which I still have source code but one) were company property and that I couldn't publish them in a blog. So instead of rewriting everything I started a major cut'n'paste adventure using internet sources - sorry, but clock was ticking see Literature for my cut'n'paste sources.
Next was setting up the performance counters to generate the interrupt. The performance counting mechanism originally allowed two programmable performance counters to run and has since then been vastly expanded. I decided to concentrate on the original implementation since that would be supported on most computers including my ageing Core2Duo system. Obviously you shouldn't just assume that the hardware supports these functions, but ask the CPU if the feature is present using the CPUID instruction. I skipped this step for the usual reasons. To get up and running we need 3 model specific registers
ia32_perf_global_ctrl (0x38f)

This register controls what performance counters are enabled

ia32_pmc0 (0xC1)
This register is the actual counter. Depending on the hardware it's either 40 or 48 bits wide. One should ask the CPUID instruction which it is. pmc1 would exist too, but I don't need it.
ia32_perfevtsel0 (0x186)

This register is the control register for the counter. Here you select the properties of your counter and what events you wish to count.

First step is to make sure there is no profiling going on. Simply set the ia32_perf_global_ctrl to 0. Now since we wish to let the counter overflow to generate an interrupt we set this to a value close to actually overflowing. Interestingly enough you cannot set the bits beyond the 32rd bit (On my processor, some processor can, so again you should use CPUID to figure out what to do), these are automatically filled accounting to the sign bit of the 32 first bits. So I my case I could just set it to -1000 and my overflow would trigger in 1000 events. You would need to reset this if you wish to continue interrupting. This too is a tuning parameter on performance overhead from interupting versus accuracy. I didn't care for tuning anything qua the usual reason. Once the counter is setup you need to deal with perfevtsel0. For me there are 4 important things to notice here. The first is where I set that I wish to monitor user mode only, no reasons for complicating matters for row hammer by watching kernel mode, but theoretically you can. Second and third are the event you wish to monitor, I where looking for cache misses which is set in the event field, the umask field follows from the event you choose and you can find it in a table in the Intel documentation (documented in the source). Fourth I wish to set the interrupt flag to generate the interrupt I want. Fourth ia32_perf_global_ctrl needs to be set to activate performance counting.
Finally I should mention the Interrupt service routine(ISR). To develop a real checking ISR you need to consider lots of things, the most important being the state of the operating system when you receive this interrupt. The operating system has been interrupted and the kernel not given a chance at getting to rest with the situation (any serious discussion of IRQL, etc. are way beyond the scope here) and thus kernel API calls are a bad idea. You can deal with it of cause as the kernel indeed do. Being lazy I figured the original ISR would do it all for me and thus I set a global variable with the EIP of the caller and call the original handler, then deal with EIP later. I figured right. The only problem being that the original handler immediately stops profiling, so I just get my interrupt once. I consider that enough for a POC.
Now you just need to check if your interrupt is surrounded by row hammer type code and act accordingly. Well be aware interrupts get return addresses so the offending instruction is the one prior to where the return address is pointing. Equally there can be some latency with profiling cause interrupts to be delayed another couple of instructions. This can be dealt with, but not in any pretty way and it shouldn't matter any way.
Here is the result:




Finally I didn't write a driver loader routine. Just use the OSR driver load or write one.


My Views on Row hammer

I think row hammer was the coolest exploit I've read about in a long long time. It's novel in so many ways. It's utilizes a hardware "bug". When I say "Bug" it's more of an accepted compromise on price and speed/size versus reliability. I don't think that row hammer will ever be fixed in the manner I described here or in my last blog, it would work, but it's a lot of work for a good integration in the os for fix and since Microsoft and Apple won't feel responsible for bad memory they have nothing to do with. For Linux maybe somebody will include a fix one day, but it's more likely to be of the type in my previous blog, because it's easier to implement the price will possibly be false positives and negatives. In short you should buy ECC memory next time and then row hammer will only be a DOS exploit. As for triggering the problem with other instructions than clflush . There is, in my opinion, a slim chance that it's possible. My guess is that the Non temporal instructions combined to read from multiple different locations in a pattern that'll force cache flushes would be the best guess. (Running on multiple cores at once??) The same could apply to normal integer reads register, but I don't think it will. As for java exploits of the same.. I sincerely doubt that the world will ever see one. Finally Intel should privilege access to clflush  in future processors, the benefits are too small to warrant it to be accessible given that it's potentially dangerous to RAM.

 

Detecting root kits with performance counters in general

Now back to what I had in mind back in 2008. Imagine using profiling on kernel mode only - which is possible.. I will assume kernel mode only for the rest of this blog, though there might be generalizations to user mode. Now you cannot run without risking the counter overflows in your code. It is quite easy to imagine writing an interrupt handler which ignores interrupts in a list of memory regions that corresponds to known signed drivers, but when triggered otherwise an alert that the system has been infected by malware is set. This effectively limits what root kits in general can do in allocated memory. Regardless of the memory is hidden from windows memory manager or not, if it run it detects.

Counter attack

The obvious counter measure is to use just one instruction on entry to the malware to reset the MSR. I'm unlikely to catch this instruction with a performance counter unless it's called a lot (or I'm willing to huge performance penalties). But I can do statistics on the timestamp counter and how many performance events I'd expect in a given amount of clock cycles. Now if performance accounting is shut off these two measures will diverge and could potentially be used for detection of malware. My guess is that short hook routines etc. could get away with this most of the time without being detected, but you will have a disturbance in the signal and statistic analysis could be used to filter out the change. Another counter attack would be to remove the interrupt hook entirely. That on the other hand is easily detectable directly in memory or could be analyzed using same method as above.

Performance counters and ROP

Obviously in my scenario ROP attacks remain possible. Rop'ing is using control over the stack to control execution flow by executing tid bits of existing code. Say you need a call to writefile, instead of writing a buffer with a call to writefile you find a suitable code sequence in existing code and use your stack control to access that instead - you can in theory do that ad infinitum and write up any code you'd like. They are much, much harder to produce as code in allocated memory and would limit functionality (Hooks would be nasty to implement as ROP - if even possible). However performance counters can do something against these kinds of attacks too. One of the thing that makes ROP easier to implement is ROP'ing into existing instructions. Say 0xFF 0x15 0xc0 0x55 0xc3 00 would normally be a call instruction with indirection, but contained in it is also a pop xx, ret which is useful for ROP'ing. If you carefully analyze instruction location and match it up with where you find performance counter interrupts you'd be able to detect this kind of abuse too. I don't think it'll ever have any practical application though.

Literature:
My driver code where shamelessly stolen from here (and the DDK storage filter sample): https://sites.google.com/site/jozsefbekes/Home/windows-programming/windriver-hello-world
The interupt hooks came from here:
http://www.codeproject.com/Articles/36585/Hook-Interrupts-and-Call-Kernel-Routines-in-User-M

The basis for my performance register code came from here:
http://www.mindfruit.co.uk/2012/11/fun-with-msrs-counting-performance.html
See also my previous row hammer blogs on dreamsofastone.blogspot.com

The original row hammer report is here:
http://googleprojectzero.blogspot.de/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
Finally this is the important part of the Intel documentation:
Intel® 64 and IA-32 Architectures,Software Developer’s Manual,Volume 3 (3A, 3B & 3C):


Saturday, March 14, 2015

Row hammer detection is possible

Row hammer detection is possible


Introduction

First off detection isn't fixing, but it's a good step in that direction and I'm growing continually more confident in my claim that it probably is fixable as I work with this. Anyway following the original idea send me deep into writing drivers and I quickly began to think that modifying my method slightly would bring detection and that writing this up would give me a lot of confirmation ahead of time for my original idea which I continue to think is superior, though significantly more technically complicated to write a proof of concept for. And I seriously need a break from doing my day job first and then come home and do a lot of hours row hammering.

Simplified method

First what is the modification to my original idea. Well it's easy, just run the performance counters without setting the request for interrupt on overflow of counter in the MSR. The interrupt would give me the eip/rip of the offending instructions and this again would allow me to have very strong identification if something as row hammering or not. Without the interrupt I won't have eip/rip so I'll essentially just be guessing based on the performance counts over an interval. This gives me two parameters to tune for detection. The first is the interval in which i poll the performance counter and the second is a "cut off level". With "cut off level" I mean how many performance counts are not normal but hammering. Essentially what this means is that false positives and false negatives can occur. Of cause I also have to decided on what performance counter I should use. L2Cache misses sounds promising because if the instruction doesn't get through to the real memory, there is no row hammer. On the other hand the cache is optimized so that normal code won't trigger it.
My conclusion up front: Yes it works good enough that is I believe the correct identification of row hammering is near perfect!

My method for proof of concept is plain using vsperfcmd that comes with visual studio 2013 community edition. Visual studio 2013 is free and awesome, unfortunately that does not extend to vsperfcmd. It's free and horrible, but with a bit a of tweaking it works.
First I had to write up a row hammer program for windows. Since I don't actually wan't to mess with bits I just made row hammer "look a like" program:
UCHAR * Buffer = (UCHAR *)malloc(1024 * 1024);
MessageBox(0, L"Hello world", L"Hello world", MB_OK);
__asm  {
              pushad
             mov ecx, 1000000
             mov eax, Buffer
             mov ebx, eax
             add ebx, 0x1000
       code1a:
              mov    edx, [eax]
              mov esi, [ebx]
              clflush [eax]
              clflush [ebx]
             
              dec ecx
              jnz code1a
                   
              popad
             }
I hammers away like a real row hammer would, but obviously it's just a simulation (But hey.. it could still switch bits on you so be careful).  I commented out the clflush instructions to play "normal program".

Now profiling it. It took a while before I figured out that you should avoid the "launch" parameter of vsperfcmd because it seems to insist on using instrumentation instead of sampling when this parameter is used and that is just plain silly. I found out why without the clflush instructions it complained that it couldn't instrument. Replacing them with 15 nops and vsperfcmd where back in it's unintended bad business. Ugly. However the "attach" works like a charm, but obviously requires me to have a way of first starting the row hammer program and then attaching, and then starting the actual row hammering code. Hence the messagebox above (sleep() produces better results if you wish to batch test).
This would be a command line for vsperfcmd that you guys out there can play with. I automated sampling in a batch file. The output file (c.vsp) can be opened with visual studio.
vsperfcmd /Start:sample /output:c.vsp /attach:5688 /Counter:L2Misses,1000,"L2Misses" /shutdown:20
Now the job was just to sample a lot of runs and see what the results I would get was. I got on average 1367 samples with the clflush instructions and 7 without. In my testing I never got less than 1200 samples with clflush and always single digit results without. This of cause isn't "proof" in any meaningful sense, but it's enough evidence that I'll hold my neck out and say it works. There is a lot more work to be done on this but I'll take a break for now. Though I promise I will blog again on my original idea.

Wait there is more

Wondering about something like this could be implemented in real life scenario the obvious is send an email to the administrator when detected and shut down the computer before root access for the attack could do damage. This is fine in most corporate settings, but for home users it's just plain not the user experience you wish for. So I'm thinking that implementing this check would best be done in the scheduler of the operating system. This could profile every slice of time it allocates to user mode threads. If it come up with a "row hammer" incident it should just take the offending thread out of the scheduling loop so that i cannot do any more damage and set a signal for a verification program. Chances are that the row hammering loop would have been preempted near or in the row hammering core loop and just reading out the context would give you an eip/rip to work with to do some additional analysis. If the additional analyses so choose it could just terminate the process and no system can keep running.

Wednesday, March 11, 2015

Row hammer privilege elevation attack is probably software fixable


To find out what row hammering is please read this projectzero (http://googleprojectzero.blogspot.de/)blog from Google. Unfortunately I’m extremely busy at the moment so I’ll have to follow up on this post later with code. My initial thought where that not only  the clflush instruction would be able to trigger this problem but also the non-temporal mov instructions and unlike clflush these instructions are very useful in user mode. Say if you’re decoding 4k video or running video filtering as you go these are instructions you’ll find quite useful in optimizing your code. Also the clflush instruction can only be banned in good sandboxes as google did for theirs. Unfortunately bad sandboxes and real computers remain vulnerable.
The fix I’d like to propose is to use the performance counters in the intel cpu. Essentially they allow ring 0 code to trigger an interrupt on performance events by setting a model specific register in the CPU. Row hammer is essentially hammering away on the memory and this is bound to generate performance events. In fact it’ll light up performance counting like a Christmas tree. Such events are counted internally and are made so that an interrupt can be generated once this counter overflows. This provides a defense strategy on two levels. One the performance counting in and interrupt generation is costly time wise lowering the frequency with which the ram is being hammered. Second and most importantly it’ll give a direct pointer to the code that generated the event which can then be checked for offending instruction and behavior. Obviously the down side is that there will be a performance penalty for using performance counting to this method.

I’ve still got lots of work to do on this:
-          Write proof of concept code.
-          Figure out which performance counter is best
-          And estimate the performance cost


Performance counters are described in Intel Architecture Software Developer’s Manual, Volume 3, chapter 15.