Saturday, March 21, 2015

Row hammer fix POC and generic root kit detection using performance counters

Update 2/4-2015
I had a chat with Thomas Dullien about row hammer detection and the google people developed the same idea as I did in parallel. Their results pretty much match my from my second post on "Row hammer dectable". I went one step further than google by setting the INT bit in the MSR which provides me with context of the cache miss which allows me to prevent and better analyze and care for false positive which I unfortunately think remain a part of the "row hammer" detection scheme. My current feeling is that parts of the industry is mostly interested in ignoring row hammer bug to death and that makes me feel quezy about the prospects of a fix ever being made available. My prediction is that only the detection method will be used in real life and that not widespread.

Erratum regarding ROP

As I wrote the original text I where focused on the "Architectural" events on intel processor. With these events the original text about tracking ROP using performance counters remains correct. However  Georg Wicherski (ochsff) commented to me that there is a "Ret Miss ( Event select 0x90 / Umask 0 - for the core duo table 19.22 In Intel Systems programming)" non-architectural counter which would suit this perfectly and thus my method for finding normal code execution outside of known areas in KM could be extended to find ROP'ing in User as well as kernel mode. User mode will probably require a severe trade off between false negatives (not finding ROP when it's there) and performance penalty. This is mainly due to the fact that all rets after a task switch will be mispredicted. I think something useful could be made. Unfortunately push / ret combinations are common in copy protection and packing wrappers and they will probably cause false positives as well using this method.

Original Post

Before I start with the subject of this blog just a few words on the way. The most viewed of my blogs these past days where "Machine Learning and malware with Bayes". Ironically I consider this my worst post so far (not counting the first two row hammers as "real" posts. Well in that and some other posts I used proxies for malware because I didn't have a malware collection. I do now. It'll take me a while to get it sorted out, but then I will return to ML topics again after this short row hammer distraction.
Accompanying sources codes are here:

Row hammer fix POC and generic root kit detection using performance counters

Many thanks to Jerome and his sharp mind that helped me out with this nightmare. Thanks goes to Joey too.

An anecdote

I think it was in 2008 meeting up with the guys from the late 1990'ies and early 2000's not from where I lived. It was a happy and alcohol infused occasion with an awesome amount of brain power present. One of the guys - I'll call him N had just written a kernel mode privilege attack for Windows XP I believe and where developing ideas on how to hide himself from detection. I did as I do best and started trolling him. My first plan of attack was pretty simple, invoke a blue screen and search a kernel mode memory dump that ensued. Counter attack difficult, but possible. This idea is actually still  pretty good - so I should write up something on it some time. N argued about the obvious problems with this method. So I countered how I profile the entire computer using performance counters. You can hide, but you cannot run, not without me knowing at least probability wise. Discussion drowned on that point I think. I've long planned to write up this blog but always postponed it because it's not so closely aligned with my current interest in machine learning. So profiling were in the back of my head. Enter row hammer. Google project zero wrote on their Row hammer attack and I found myself thinking if performance counters can do anything on infosec, then they can do this. Thomas Dullien wrote entry on facebook that he was happy to be done with Row Hammer and up came the troll in me. I trolled him with the same thing which I've trolled N back in the day.  Obviously when you have a good idea you havn't published it, trolling someone in an open facebook post and especially not on Thomas Dullien's facebook is a bad idea. And that forced me into publishing half finished stuff. Sorry. Unfortunately on maximal bad timing with my personal life and on the very day my hard disk had crashed.


So now for a bit better write up

Intel CPU's support performance counters to help software developers profile their software. Profiling is essentially the process of seeing  how much, what kind  and where computing resources are being spend to provide clues for optimizing your code. Intel has a number of model specific registers to deal with this. You can enable counting for a number of difference performance events such as time spend, missed cache hits, and so on and read out the counter register using the WDMSR/RDMSR instructions which is available only in kernel mode.  Obviously row hammer is hammering away on the memory causing bottle necks. Start the profiling and you could detect somebody hammering your system and trigger a search for the culprit. Particularly easy if you're just running a sand box. This is pretty much the concept that I showed was viable with my last post on the subject.
It gets better. You can even have the counters trigger an interrupt when the overflow allowing you not only to get information on how often things are happening but also where. Let the thing run long enough and you'll see where interrupts in your code starts to rack up a tally and you'll know your bottle neck or in my case the row hammer.
The source code of my proof of concept is available. It's kept very basic so this should not be considered production code in any meaning (also it's dead insecure!!!!). Remember I never try to write a blog if I don't belive I can finish it in 24 work hours. The first decision I made where to make the POC code on only one operating system and that operating system was going to be winxp32. There are two reason for this, the first that i had a WinXP computer available to me for testing and the second is that WinXP32 does not protect the interrupt descriptor table which I would need. This mean less messing around with the operating system. This of cause means my code won't port directly to modern windows version but it's irrelevant in showing that a software fix is possible since the entire code is based on hardware features. However you probably need to be less than well behaved, or rewrite parts of the operating system. Now my XP system didn't have a kernel mode debugger on it and I decided it wasn't worth spending time setting up the system for debugging with WinDbg. Which is fair to say was a bad decision. Starting out the first task was to find out which interrupt I needed to hook. The interrupt is in Intel documentation called PMI (performance monitor interrupt) but unlike the usual interrupts it isn't assigned a fixed number rather you have to ask the APIC for it's mapping. Having done enough docu reading for a day I decided that LiveKD with symbols would probably give me a few hints. So I downloaded LiveKD set up the symbols and executed the "!idt" and I had been right "hal!HalpPerfInterrupt" sounded just about right. The Interrupt is 0xFE.

Second I need to hook the interrupt. I've done this before for a piece of copy protection software I worked on. The essentials are use KeSetThreadAffinity in a loop to get on all processors in the system, then using the SIDT instruction to  find the interrupt descriptor table. Then modify it. I then realized that my only source codes for hooking interrupts (In fact all drivers for which I still have source code but one) were company property and that I couldn't publish them in a blog. So instead of rewriting everything I started a major cut'n'paste adventure using internet sources - sorry, but clock was ticking see Literature for my cut'n'paste sources.
Next was setting up the performance counters to generate the interrupt. The performance counting mechanism originally allowed two programmable performance counters to run and has since then been vastly expanded. I decided to concentrate on the original implementation since that would be supported on most computers including my ageing Core2Duo system. Obviously you shouldn't just assume that the hardware supports these functions, but ask the CPU if the feature is present using the CPUID instruction. I skipped this step for the usual reasons. To get up and running we need 3 model specific registers
ia32_perf_global_ctrl (0x38f)

This register controls what performance counters are enabled

ia32_pmc0 (0xC1)
This register is the actual counter. Depending on the hardware it's either 40 or 48 bits wide. One should ask the CPUID instruction which it is. pmc1 would exist too, but I don't need it.
ia32_perfevtsel0 (0x186)

This register is the control register for the counter. Here you select the properties of your counter and what events you wish to count.

First step is to make sure there is no profiling going on. Simply set the ia32_perf_global_ctrl to 0. Now since we wish to let the counter overflow to generate an interrupt we set this to a value close to actually overflowing. Interestingly enough you cannot set the bits beyond the 32rd bit (On my processor, some processor can, so again you should use CPUID to figure out what to do), these are automatically filled accounting to the sign bit of the 32 first bits. So I my case I could just set it to -1000 and my overflow would trigger in 1000 events. You would need to reset this if you wish to continue interrupting. This too is a tuning parameter on performance overhead from interupting versus accuracy. I didn't care for tuning anything qua the usual reason. Once the counter is setup you need to deal with perfevtsel0. For me there are 4 important things to notice here. The first is where I set that I wish to monitor user mode only, no reasons for complicating matters for row hammer by watching kernel mode, but theoretically you can. Second and third are the event you wish to monitor, I where looking for cache misses which is set in the event field, the umask field follows from the event you choose and you can find it in a table in the Intel documentation (documented in the source). Fourth I wish to set the interrupt flag to generate the interrupt I want. Fourth ia32_perf_global_ctrl needs to be set to activate performance counting.
Finally I should mention the Interrupt service routine(ISR). To develop a real checking ISR you need to consider lots of things, the most important being the state of the operating system when you receive this interrupt. The operating system has been interrupted and the kernel not given a chance at getting to rest with the situation (any serious discussion of IRQL, etc. are way beyond the scope here) and thus kernel API calls are a bad idea. You can deal with it of cause as the kernel indeed do. Being lazy I figured the original ISR would do it all for me and thus I set a global variable with the EIP of the caller and call the original handler, then deal with EIP later. I figured right. The only problem being that the original handler immediately stops profiling, so I just get my interrupt once. I consider that enough for a POC.
Now you just need to check if your interrupt is surrounded by row hammer type code and act accordingly. Well be aware interrupts get return addresses so the offending instruction is the one prior to where the return address is pointing. Equally there can be some latency with profiling cause interrupts to be delayed another couple of instructions. This can be dealt with, but not in any pretty way and it shouldn't matter any way.
Here is the result:

Finally I didn't write a driver loader routine. Just use the OSR driver load or write one.

My Views on Row hammer

I think row hammer was the coolest exploit I've read about in a long long time. It's novel in so many ways. It's utilizes a hardware "bug". When I say "Bug" it's more of an accepted compromise on price and speed/size versus reliability. I don't think that row hammer will ever be fixed in the manner I described here or in my last blog, it would work, but it's a lot of work for a good integration in the os for fix and since Microsoft and Apple won't feel responsible for bad memory they have nothing to do with. For Linux maybe somebody will include a fix one day, but it's more likely to be of the type in my previous blog, because it's easier to implement the price will possibly be false positives and negatives. In short you should buy ECC memory next time and then row hammer will only be a DOS exploit. As for triggering the problem with other instructions than clflush . There is, in my opinion, a slim chance that it's possible. My guess is that the Non temporal instructions combined to read from multiple different locations in a pattern that'll force cache flushes would be the best guess. (Running on multiple cores at once??) The same could apply to normal integer reads register, but I don't think it will. As for java exploits of the same.. I sincerely doubt that the world will ever see one. Finally Intel should privilege access to clflush  in future processors, the benefits are too small to warrant it to be accessible given that it's potentially dangerous to RAM.


Detecting root kits with performance counters in general

Now back to what I had in mind back in 2008. Imagine using profiling on kernel mode only - which is possible.. I will assume kernel mode only for the rest of this blog, though there might be generalizations to user mode. Now you cannot run without risking the counter overflows in your code. It is quite easy to imagine writing an interrupt handler which ignores interrupts in a list of memory regions that corresponds to known signed drivers, but when triggered otherwise an alert that the system has been infected by malware is set. This effectively limits what root kits in general can do in allocated memory. Regardless of the memory is hidden from windows memory manager or not, if it run it detects.

Counter attack

The obvious counter measure is to use just one instruction on entry to the malware to reset the MSR. I'm unlikely to catch this instruction with a performance counter unless it's called a lot (or I'm willing to huge performance penalties). But I can do statistics on the timestamp counter and how many performance events I'd expect in a given amount of clock cycles. Now if performance accounting is shut off these two measures will diverge and could potentially be used for detection of malware. My guess is that short hook routines etc. could get away with this most of the time without being detected, but you will have a disturbance in the signal and statistic analysis could be used to filter out the change. Another counter attack would be to remove the interrupt hook entirely. That on the other hand is easily detectable directly in memory or could be analyzed using same method as above.

Performance counters and ROP

Obviously in my scenario ROP attacks remain possible. Rop'ing is using control over the stack to control execution flow by executing tid bits of existing code. Say you need a call to writefile, instead of writing a buffer with a call to writefile you find a suitable code sequence in existing code and use your stack control to access that instead - you can in theory do that ad infinitum and write up any code you'd like. They are much, much harder to produce as code in allocated memory and would limit functionality (Hooks would be nasty to implement as ROP - if even possible). However performance counters can do something against these kinds of attacks too. One of the thing that makes ROP easier to implement is ROP'ing into existing instructions. Say 0xFF 0x15 0xc0 0x55 0xc3 00 would normally be a call instruction with indirection, but contained in it is also a pop xx, ret which is useful for ROP'ing. If you carefully analyze instruction location and match it up with where you find performance counter interrupts you'd be able to detect this kind of abuse too. I don't think it'll ever have any practical application though.

My driver code where shamelessly stolen from here (and the DDK storage filter sample):
The interupt hooks came from here:

The basis for my performance register code came from here:
See also my previous row hammer blogs on

The original row hammer report is here:
Finally this is the important part of the Intel documentation:
Intel® 64 and IA-32 Architectures,Software Developer’s Manual,Volume 3 (3A, 3B & 3C):

Saturday, March 14, 2015

Row hammer detection is possible

Row hammer detection is possible


First off detection isn't fixing, but it's a good step in that direction and I'm growing continually more confident in my claim that it probably is fixable as I work with this. Anyway following the original idea send me deep into writing drivers and I quickly began to think that modifying my method slightly would bring detection and that writing this up would give me a lot of confirmation ahead of time for my original idea which I continue to think is superior, though significantly more technically complicated to write a proof of concept for. And I seriously need a break from doing my day job first and then come home and do a lot of hours row hammering.

Simplified method

First what is the modification to my original idea. Well it's easy, just run the performance counters without setting the request for interrupt on overflow of counter in the MSR. The interrupt would give me the eip/rip of the offending instructions and this again would allow me to have very strong identification if something as row hammering or not. Without the interrupt I won't have eip/rip so I'll essentially just be guessing based on the performance counts over an interval. This gives me two parameters to tune for detection. The first is the interval in which i poll the performance counter and the second is a "cut off level". With "cut off level" I mean how many performance counts are not normal but hammering. Essentially what this means is that false positives and false negatives can occur. Of cause I also have to decided on what performance counter I should use. L2Cache misses sounds promising because if the instruction doesn't get through to the real memory, there is no row hammer. On the other hand the cache is optimized so that normal code won't trigger it.
My conclusion up front: Yes it works good enough that is I believe the correct identification of row hammering is near perfect!

My method for proof of concept is plain using vsperfcmd that comes with visual studio 2013 community edition. Visual studio 2013 is free and awesome, unfortunately that does not extend to vsperfcmd. It's free and horrible, but with a bit a of tweaking it works.
First I had to write up a row hammer program for windows. Since I don't actually wan't to mess with bits I just made row hammer "look a like" program:
UCHAR * Buffer = (UCHAR *)malloc(1024 * 1024);
MessageBox(0, L"Hello world", L"Hello world", MB_OK);
__asm  {
             mov ecx, 1000000
             mov eax, Buffer
             mov ebx, eax
             add ebx, 0x1000
              mov    edx, [eax]
              mov esi, [ebx]
              clflush [eax]
              clflush [ebx]
              dec ecx
              jnz code1a
I hammers away like a real row hammer would, but obviously it's just a simulation (But hey.. it could still switch bits on you so be careful).  I commented out the clflush instructions to play "normal program".

Now profiling it. It took a while before I figured out that you should avoid the "launch" parameter of vsperfcmd because it seems to insist on using instrumentation instead of sampling when this parameter is used and that is just plain silly. I found out why without the clflush instructions it complained that it couldn't instrument. Replacing them with 15 nops and vsperfcmd where back in it's unintended bad business. Ugly. However the "attach" works like a charm, but obviously requires me to have a way of first starting the row hammer program and then attaching, and then starting the actual row hammering code. Hence the messagebox above (sleep() produces better results if you wish to batch test).
This would be a command line for vsperfcmd that you guys out there can play with. I automated sampling in a batch file. The output file (c.vsp) can be opened with visual studio.
vsperfcmd /Start:sample /output:c.vsp /attach:5688 /Counter:L2Misses,1000,"L2Misses" /shutdown:20
Now the job was just to sample a lot of runs and see what the results I would get was. I got on average 1367 samples with the clflush instructions and 7 without. In my testing I never got less than 1200 samples with clflush and always single digit results without. This of cause isn't "proof" in any meaningful sense, but it's enough evidence that I'll hold my neck out and say it works. There is a lot more work to be done on this but I'll take a break for now. Though I promise I will blog again on my original idea.

Wait there is more

Wondering about something like this could be implemented in real life scenario the obvious is send an email to the administrator when detected and shut down the computer before root access for the attack could do damage. This is fine in most corporate settings, but for home users it's just plain not the user experience you wish for. So I'm thinking that implementing this check would best be done in the scheduler of the operating system. This could profile every slice of time it allocates to user mode threads. If it come up with a "row hammer" incident it should just take the offending thread out of the scheduling loop so that i cannot do any more damage and set a signal for a verification program. Chances are that the row hammering loop would have been preempted near or in the row hammering core loop and just reading out the context would give you an eip/rip to work with to do some additional analysis. If the additional analyses so choose it could just terminate the process and no system can keep running.

Wednesday, March 11, 2015

Row hammer privilege elevation attack is probably software fixable

To find out what row hammering is please read this projectzero ( from Google. Unfortunately I’m extremely busy at the moment so I’ll have to follow up on this post later with code. My initial thought where that not only  the clflush instruction would be able to trigger this problem but also the non-temporal mov instructions and unlike clflush these instructions are very useful in user mode. Say if you’re decoding 4k video or running video filtering as you go these are instructions you’ll find quite useful in optimizing your code. Also the clflush instruction can only be banned in good sandboxes as google did for theirs. Unfortunately bad sandboxes and real computers remain vulnerable.
The fix I’d like to propose is to use the performance counters in the intel cpu. Essentially they allow ring 0 code to trigger an interrupt on performance events by setting a model specific register in the CPU. Row hammer is essentially hammering away on the memory and this is bound to generate performance events. In fact it’ll light up performance counting like a Christmas tree. Such events are counted internally and are made so that an interrupt can be generated once this counter overflows. This provides a defense strategy on two levels. One the performance counting in and interrupt generation is costly time wise lowering the frequency with which the ram is being hammered. Second and most importantly it’ll give a direct pointer to the code that generated the event which can then be checked for offending instruction and behavior. Obviously the down side is that there will be a performance penalty for using performance counting to this method.

I’ve still got lots of work to do on this:
-          Write proof of concept code.
-          Figure out which performance counter is best
-          And estimate the performance cost

Performance counters are described in Intel Architecture Software Developer’s Manual, Volume 3, chapter 15.