Update 2/4-2015I had a chat with Thomas Dullien about row hammer detection and the google people developed the same idea as I did in parallel. Their results pretty much match my from my second post on "Row hammer dectable". I went one step further than google by setting the INT bit in the MSR which provides me with context of the cache miss which allows me to prevent and better analyze and care for false positive which I unfortunately think remain a part of the "row hammer" detection scheme. My current feeling is that parts of the industry is mostly interested in ignoring row hammer bug to death and that makes me feel quezy about the prospects of a fix ever being made available. My prediction is that only the detection method will be used in real life and that not widespread.
Erratum regarding ROP
As I wrote the original text I where focused on the "Architectural" events on intel processor. With these events the original text about tracking ROP using performance counters remains correct. However Georg Wicherski (ochsff) commented to me that there is a "Ret Miss ( Event select 0x90 / Umask 0 - for the core duo table 19.22 In Intel Systems programming)" non-architectural counter which would suit this perfectly and thus my method for finding normal code execution outside of known areas in KM could be extended to find ROP'ing in User as well as kernel mode. User mode will probably require a severe trade off between false negatives (not finding ROP when it's there) and performance penalty. This is mainly due to the fact that all rets after a task switch will be mispredicted. I think something useful could be made. Unfortunately push / ret combinations are common in copy protection and packing wrappers and they will probably cause false positives as well using this method.
Accompanying sources codes are here:
Row hammer fix POC and generic root kit detection using performance counters
Many thanks to Jerome and his sharp mind that helped me out with this nightmare. Thanks goes to Joey too.
I think it was in 2008 meeting up with the guys from the late 1990'ies and early 2000's not from where I lived. It was a happy and alcohol infused occasion with an awesome amount of brain power present. One of the guys - I'll call him N had just written a kernel mode privilege attack for Windows XP I believe and where developing ideas on how to hide himself from detection. I did as I do best and started trolling him. My first plan of attack was pretty simple, invoke a blue screen and search a kernel mode memory dump that ensued. Counter attack difficult, but possible. This idea is actually still pretty good - so I should write up something on it some time. N argued about the obvious problems with this method. So I countered how I profile the entire computer using performance counters. You can hide, but you cannot run, not without me knowing at least probability wise. Discussion drowned on that point I think. I've long planned to write up this blog but always postponed it because it's not so closely aligned with my current interest in machine learning. So profiling were in the back of my head. Enter row hammer. Google project zero wrote on their Row hammer attack and I found myself thinking if performance counters can do anything on infosec, then they can do this. Thomas Dullien wrote entry on facebook that he was happy to be done with Row Hammer and up came the troll in me. I trolled him with the same thing which I've trolled N back in the day. Obviously when you have a good idea you havn't published it, trolling someone in an open facebook post and especially not on Thomas Dullien's facebook is a bad idea. And that forced me into publishing half finished stuff. Sorry. Unfortunately on maximal bad timing with my personal life and on the very day my hard disk had crashed.
So now for a bit better write up
Intel CPU's support performance counters to help software developers profile their software. Profiling is essentially the process of seeing how much, what kind and where computing resources are being spend to provide clues for optimizing your code. Intel has a number of model specific registers to deal with this. You can enable counting for a number of difference performance events such as time spend, missed cache hits, and so on and read out the counter register using the WDMSR/RDMSR instructions which is available only in kernel mode. Obviously row hammer is hammering away on the memory causing bottle necks. Start the profiling and you could detect somebody hammering your system and trigger a search for the culprit. Particularly easy if you're just running a sand box. This is pretty much the concept that I showed was viable with my last post on the subject.
It gets better. You can even have the counters trigger an interrupt when the overflow allowing you not only to get information on how often things are happening but also where. Let the thing run long enough and you'll see where interrupts in your code starts to rack up a tally and you'll know your bottle neck or in my case the row hammer.
The source code of my proof of concept is available. It's kept very basic so this should not be considered production code in any meaning (also it's dead insecure!!!!). Remember I never try to write a blog if I don't belive I can finish it in 24 work hours. The first decision I made where to make the POC code on only one operating system and that operating system was going to be winxp32. There are two reason for this, the first that i had a WinXP computer available to me for testing and the second is that WinXP32 does not protect the interrupt descriptor table which I would need. This mean less messing around with the operating system. This of cause means my code won't port directly to modern windows version but it's irrelevant in showing that a software fix is possible since the entire code is based on hardware features. However you probably need to be less than well behaved, or rewrite parts of the operating system. Now my XP system didn't have a kernel mode debugger on it and I decided it wasn't worth spending time setting up the system for debugging with WinDbg. Which is fair to say was a bad decision. Starting out the first task was to find out which interrupt I needed to hook. The interrupt is in Intel documentation called PMI (performance monitor interrupt) but unlike the usual interrupts it isn't assigned a fixed number rather you have to ask the APIC for it's mapping. Having done enough docu reading for a day I decided that LiveKD with symbols would probably give me a few hints. So I downloaded LiveKD set up the symbols and executed the "!idt" and I had been right "hal!HalpPerfInterrupt" sounded just about right. The Interrupt is 0xFE.
Second I need to hook the interrupt. I've done this before for a piece of copy protection software I worked on. The essentials are use KeSetThreadAffinity in a loop to get on all processors in the system, then using the SIDT instruction to find the interrupt descriptor table. Then modify it. I then realized that my only source codes for hooking interrupts (In fact all drivers for which I still have source code but one) were company property and that I couldn't publish them in a blog. So instead of rewriting everything I started a major cut'n'paste adventure using internet sources - sorry, but clock was ticking see Literature for my cut'n'paste sources.
Next was setting up the performance counters to generate the interrupt. The performance counting mechanism originally allowed two programmable performance counters to run and has since then been vastly expanded. I decided to concentrate on the original implementation since that would be supported on most computers including my ageing Core2Duo system. Obviously you shouldn't just assume that the hardware supports these functions, but ask the CPU if the feature is present using the CPUID instruction. I skipped this step for the usual reasons. To get up and running we need 3 model specific registers
This register controls what performance counters are enabled
This register is the actual counter. Depending on the hardware it's either 40 or 48 bits wide. One should ask the CPUID instruction which it is. pmc1 would exist too, but I don't need it.
This register is the control register for the counter. Here you select the properties of your counter and what events you wish to count.
First step is to make sure there is no profiling going on. Simply set the ia32_perf_global_ctrl to 0. Now since we wish to let the counter overflow to generate an interrupt we set this to a value close to actually overflowing. Interestingly enough you cannot set the bits beyond the 32rd bit (On my processor, some processor can, so again you should use CPUID to figure out what to do), these are automatically filled accounting to the sign bit of the 32 first bits. So I my case I could just set it to -1000 and my overflow would trigger in 1000 events. You would need to reset this if you wish to continue interrupting. This too is a tuning parameter on performance overhead from interupting versus accuracy. I didn't care for tuning anything qua the usual reason. Once the counter is setup you need to deal with perfevtsel0. For me there are 4 important things to notice here. The first is where I set that I wish to monitor user mode only, no reasons for complicating matters for row hammer by watching kernel mode, but theoretically you can. Second and third are the event you wish to monitor, I where looking for cache misses which is set in the event field, the umask field follows from the event you choose and you can find it in a table in the Intel documentation (documented in the source). Fourth I wish to set the interrupt flag to generate the interrupt I want. Fourth ia32_perf_global_ctrl needs to be set to activate performance counting.
Finally I should mention the Interrupt service routine(ISR). To develop a real checking ISR you need to consider lots of things, the most important being the state of the operating system when you receive this interrupt. The operating system has been interrupted and the kernel not given a chance at getting to rest with the situation (any serious discussion of IRQL, etc. are way beyond the scope here) and thus kernel API calls are a bad idea. You can deal with it of cause as the kernel indeed do. Being lazy I figured the original ISR would do it all for me and thus I set a global variable with the EIP of the caller and call the original handler, then deal with EIP later. I figured right. The only problem being that the original handler immediately stops profiling, so I just get my interrupt once. I consider that enough for a POC.
Now you just need to check if your interrupt is surrounded by row hammer type code and act accordingly. Well be aware interrupts get return addresses so the offending instruction is the one prior to where the return address is pointing. Equally there can be some latency with profiling cause interrupts to be delayed another couple of instructions. This can be dealt with, but not in any pretty way and it shouldn't matter any way.
Here is the result:
Finally I didn't write a driver loader routine. Just use the OSR driver load or write one.
My Views on Row hammer
I think row hammer was the coolest exploit I've read about in a long long time. It's novel in so many ways. It's utilizes a hardware "bug". When I say "Bug" it's more of an accepted compromise on price and speed/size versus reliability. I don't think that row hammer will ever be fixed in the manner I described here or in my last blog, it would work, but it's a lot of work for a good integration in the os for fix and since Microsoft and Apple won't feel responsible for bad memory they have nothing to do with. For Linux maybe somebody will include a fix one day, but it's more likely to be of the type in my previous blog, because it's easier to implement the price will possibly be false positives and negatives. In short you should buy ECC memory next time and then row hammer will only be a DOS exploit. As for triggering the problem with other instructions than clflush . There is, in my opinion, a slim chance that it's possible. My guess is that the Non temporal instructions combined to read from multiple different locations in a pattern that'll force cache flushes would be the best guess. (Running on multiple cores at once??) The same could apply to normal integer reads register, but I don't think it will. As for java exploits of the same.. I sincerely doubt that the world will ever see one. Finally Intel should privilege access to clflush in future processors, the benefits are too small to warrant it to be accessible given that it's potentially dangerous to RAM.
Detecting root kits with performance counters in general
Now back to what I had in mind back in 2008. Imagine using profiling on kernel mode only - which is possible.. I will assume kernel mode only for the rest of this blog, though there might be generalizations to user mode. Now you cannot run without risking the counter overflows in your code. It is quite easy to imagine writing an interrupt handler which ignores interrupts in a list of memory regions that corresponds to known signed drivers, but when triggered otherwise an alert that the system has been infected by malware is set. This effectively limits what root kits in general can do in allocated memory. Regardless of the memory is hidden from windows memory manager or not, if it run it detects.
The obvious counter measure is to use just one instruction on entry to the malware to reset the MSR. I'm unlikely to catch this instruction with a performance counter unless it's called a lot (or I'm willing to huge performance penalties). But I can do statistics on the timestamp counter and how many performance events I'd expect in a given amount of clock cycles. Now if performance accounting is shut off these two measures will diverge and could potentially be used for detection of malware. My guess is that short hook routines etc. could get away with this most of the time without being detected, but you will have a disturbance in the signal and statistic analysis could be used to filter out the change. Another counter attack would be to remove the interrupt hook entirely. That on the other hand is easily detectable directly in memory or could be analyzed using same method as above.
Performance counters and ROP
Obviously in my scenario ROP attacks remain possible. Rop'ing is using control over the stack to control execution flow by executing tid bits of existing code. Say you need a call to writefile, instead of writing a buffer with a call to writefile you find a suitable code sequence in existing code and use your stack control to access that instead - you can in theory do that ad infinitum and write up any code you'd like. They are much, much harder to produce as code in allocated memory and would limit functionality (Hooks would be nasty to implement as ROP - if even possible). However performance counters can do something against these kinds of attacks too. One of the thing that makes ROP easier to implement is ROP'ing into existing instructions. Say 0xFF 0x15 0xc0 0x55 0xc3 00 would normally be a call instruction with indirection, but contained in it is also a pop xx, ret which is useful for ROP'ing. If you carefully analyze instruction location and match it up with where you find performance counter interrupts you'd be able to detect this kind of abuse too. I don't think it'll ever have any practical application though.
My driver code where shamelessly stolen from here (and the DDK storage filter sample): https://sites.google.com/site/jozsefbekes/Home/windows-programming/windriver-hello-world
The interupt hooks came from here:
The basis for my performance register code came from here:
See also my previous row hammer blogs on dreamsofastone.blogspot.com
The original row hammer report is here:
Finally this is the important part of the Intel documentation:
Intel® 64 and IA-32 Architectures,Software Developer’s Manual,Volume 3 (3A, 3B & 3C):