Update 2/4-2015
I had a chat with Thomas Dullien about row hammer detection and the google people developed the same idea as I did in parallel. Their results pretty much match my from my second post on "Row hammer dectable". I went one step further than google by setting the INT bit in the MSR which provides me with context of the cache miss which allows me to prevent and better analyze and care for false positive which I unfortunately think remain a part of the "row hammer" detection scheme. My current feeling is that parts of the industry is mostly interested in ignoring row hammer bug to death and that makes me feel quezy about the prospects of a fix ever being made available. My prediction is that only the detection method will be used in real life and that not widespread.Erratum regarding ROP
As I wrote the original text I where focused on the "Architectural" events on intel processor. With these events the original text about tracking ROP using performance counters remains correct. However Georg Wicherski (ochsff) commented to me that there is a "Ret Miss ( Event select 0x90 / Umask 0 - for the core duo table 19.22 In Intel Systems programming)" non-architectural counter which would suit this perfectly and thus my method for finding normal code execution outside of known areas in KM could be extended to find ROP'ing in User as well as kernel mode. User mode will probably require a severe trade off between false negatives (not finding ROP when it's there) and performance penalty. This is mainly due to the fact that all rets after a task switch will be mispredicted. I think something useful could be made. Unfortunately push / ret combinations are common in copy protection and packing wrappers and they will probably cause false positives as well using this method.
Original Post
Accompanying
sources codes are here:
Row hammer fix POC and
generic root kit detection using performance counters
Many thanks
to Jerome and his sharp mind that helped me out with this nightmare. Thanks
goes to Joey too.
An anecdote
I think it
was in 2008 meeting up with the guys from the late 1990'ies and early 2000's
not from where I lived. It was a happy and alcohol infused occasion with an
awesome amount of brain power present. One of the guys - I'll call him N had
just written a kernel mode privilege attack for Windows XP I believe and where
developing ideas on how to hide himself from detection. I did as I do best and
started trolling him. My first plan of attack was pretty simple, invoke a blue
screen and search a kernel mode memory dump that ensued. Counter attack
difficult, but possible. This idea is actually still pretty good - so I should write up something
on it some time. N argued about the obvious problems with this method. So I
countered how I profile the entire computer using performance counters. You can
hide, but you cannot run, not without me knowing at least probability wise.
Discussion drowned on that point I think. I've long planned to write up this
blog but always postponed it because it's not so closely aligned with my
current interest in machine learning. So profiling were in the back of my head.
Enter row hammer. Google project zero wrote on their Row hammer attack and I
found myself thinking if performance counters can do anything on infosec, then
they can do this. Thomas Dullien wrote entry on facebook that he was happy to
be done with Row Hammer and up came the troll in me. I trolled him with the
same thing which I've trolled N back in the day. Obviously when you have a good idea you havn't
published it, trolling someone in an open facebook post and especially not on
Thomas Dullien's facebook is a bad idea. And that forced me into publishing
half finished stuff. Sorry. Unfortunately on maximal bad timing with my
personal life and on the very day my hard disk had crashed.
So now for a bit better write
up
Intel CPU's
support performance counters to help software developers profile their
software. Profiling is essentially the process of seeing how much, what kind and where computing resources are being spend
to provide clues for optimizing your code. Intel has a number of model specific
registers to deal with this. You can enable counting for a number of difference
performance events such as time spend, missed cache hits, and so on and read
out the counter register using the WDMSR/RDMSR instructions which is available
only in kernel mode. Obviously row
hammer is hammering away on the memory causing bottle necks. Start the profiling
and you could detect somebody hammering your system and trigger a search for
the culprit. Particularly easy if you're just running a sand box. This is
pretty much the concept that I showed was viable with my last post on the
subject.
It gets
better. You can even have the counters trigger an interrupt when the overflow
allowing you not only to get information on how often things are happening but
also where. Let the thing run long enough and you'll see where interrupts in
your code starts to rack up a tally and you'll know your bottle neck or in my
case the row hammer.
The source
code of my proof of concept is available. It's kept very basic so this should
not be considered production code in any meaning (also it's dead insecure!!!!).
Remember I never try to write a blog if I don't belive I can finish it in 24
work hours. The first decision I made where to make the POC code on only one
operating system and that operating system was going to be winxp32. There are
two reason for this, the first that i had a WinXP computer available to me for
testing and the second is that WinXP32 does not protect the interrupt
descriptor table which I would need. This mean less messing around with the
operating system. This of cause means my code won't port directly to modern
windows version but it's irrelevant in showing that a software fix is possible
since the entire code is based on hardware features. However you probably need
to be less than well behaved, or rewrite parts of the operating system. Now my
XP system didn't have a kernel mode debugger on it and I decided it wasn't worth
spending time setting up the system for debugging with WinDbg. Which is fair to
say was a bad decision. Starting out the first task was to find out which interrupt
I needed to hook. The interrupt is in Intel documentation called PMI
(performance monitor interrupt) but unlike the usual interrupts it isn't
assigned a fixed number rather you have to ask the APIC for it's mapping.
Having done enough docu reading for a day I decided that LiveKD with symbols
would probably give me a few hints. So I downloaded LiveKD set up the symbols
and executed the "!idt" and I had been right "hal!HalpPerfInterrupt"
sounded just about right. The Interrupt is 0xFE.
Second I
need to hook the interrupt. I've done this before for a piece of copy
protection software I worked on. The essentials are use KeSetThreadAffinity in
a loop to get on all processors in the system, then using the SIDT instruction
to find the interrupt descriptor table.
Then modify it. I then realized that my only source codes for hooking interrupts
(In fact all drivers for which I still have source code but one) were company
property and that I couldn't publish them in a blog. So instead of rewriting
everything I started a major cut'n'paste adventure using internet sources -
sorry, but clock was ticking see Literature for my cut'n'paste sources.
Next was
setting up the performance counters to generate the interrupt. The performance
counting mechanism originally allowed two programmable performance counters to
run and has since then been vastly expanded. I decided to concentrate on the
original implementation since that would be supported on most computers
including my ageing Core2Duo system. Obviously you shouldn't just assume that
the hardware supports these functions, but ask the CPU if the feature is
present using the CPUID instruction. I skipped this step for the usual reasons.
To get up and running we need 3 model specific registers
ia32_perf_global_ctrl
(0x38f)
|
This register
controls what performance counters are enabled
|
ia32_pmc0 (0xC1)
|
This register is the
actual counter. Depending on the hardware it's either 40 or 48 bits wide. One
should ask the CPUID instruction which it is. pmc1 would exist too, but I
don't need it.
|
ia32_perfevtsel0 (0x186)
|
This register is the
control register for the counter. Here you select the properties of your
counter and what events you wish to count.
|
First step
is to make sure there is no profiling going on. Simply set the ia32_perf_global_ctrl
to 0. Now since we wish to let the counter overflow to generate an interrupt we
set this to a value close to actually overflowing. Interestingly enough you
cannot set the bits beyond the 32rd bit (On my processor, some processor can,
so again you should use CPUID to figure out what to do), these are
automatically filled accounting to the sign bit of the 32 first bits. So I my
case I could just set it to -1000 and my overflow would trigger in 1000 events.
You would need to reset this if you wish to continue interrupting. This too is
a tuning parameter on performance overhead from interupting versus accuracy. I
didn't care for tuning anything qua the usual reason. Once the counter is setup
you need to deal with perfevtsel0. For me there are 4 important things to
notice here. The first is where I set that I wish to monitor user mode only, no
reasons for complicating matters for row hammer by watching kernel mode, but
theoretically you can. Second and third are the event you wish to monitor, I
where looking for cache misses which is set in the event field, the umask field
follows from the event you choose and you can find it in a table in the Intel
documentation (documented in the source). Fourth I wish to set the interrupt
flag to generate the interrupt I want. Fourth ia32_perf_global_ctrl needs to be
set to activate performance counting.
Finally I
should mention the Interrupt service routine(ISR). To develop a real checking
ISR you need to consider lots of things, the most important being the state of
the operating system when you receive this interrupt. The operating system has
been interrupted and the kernel not given a chance at getting to rest with the
situation (any serious discussion of IRQL, etc. are way beyond the scope here)
and thus kernel API calls are a bad idea. You can deal with it of cause as the
kernel indeed do. Being lazy I figured the original ISR would do it all for me
and thus I set a global variable with the EIP of the caller and call the
original handler, then deal with EIP later. I figured right. The only problem
being that the original handler immediately stops profiling, so I just get my
interrupt once. I consider that enough for a POC.
Now you
just need to check if your interrupt is surrounded by row hammer type code and
act accordingly. Well be aware interrupts get return addresses so the offending
instruction is the one prior to where the return address is pointing. Equally
there can be some latency with profiling cause interrupts to be delayed another
couple of instructions. This can be dealt with, but not in any pretty way and
it shouldn't matter any way.
Here is the
result:
Finally I didn't write a
driver loader routine. Just use the OSR driver load or write one.
My Views on Row hammer
I think row
hammer was the coolest exploit I've read about in a long long time. It's novel
in so many ways. It's utilizes a hardware "bug". When I say
"Bug" it's more of an accepted compromise on price and speed/size versus
reliability. I don't think that row hammer will ever be fixed in the manner I
described here or in my last blog, it would work, but it's a lot of work for a
good integration in the os for fix and since Microsoft and Apple won't feel
responsible for bad memory they have nothing to do with. For Linux maybe
somebody will include a fix one day, but it's more likely to be of the type in
my previous blog, because it's easier to implement the price will possibly be
false positives and negatives. In short you should buy ECC memory next time and
then row hammer will only be a DOS exploit. As for triggering the problem with
other instructions than clflush . There is, in my opinion, a slim chance that it's
possible. My guess is that the Non temporal instructions combined to read from
multiple different locations in a pattern that'll force cache flushes would be
the best guess. (Running on multiple cores at once??) The same could apply to
normal integer reads register, but I don't think it will. As for java exploits of
the same.. I sincerely doubt that the world will ever see one. Finally Intel
should privilege access to clflush in
future processors, the benefits are too small to warrant it to be accessible
given that it's potentially dangerous to RAM.
Detecting root kits with
performance counters in general
Now back to
what I had in mind back in 2008. Imagine using profiling on kernel mode only -
which is possible.. I will assume kernel mode only for the rest of this blog,
though there might be generalizations to user mode. Now you cannot run without
risking the counter overflows in your code. It is quite easy to imagine writing
an interrupt handler which ignores interrupts in a list of memory regions that corresponds
to known signed drivers, but when triggered otherwise an alert that the system
has been infected by malware is set. This effectively limits what root kits in
general can do in allocated memory. Regardless of the memory is hidden from
windows memory manager or not, if it run it detects.
Counter attack
The obvious
counter measure is to use just one instruction on entry to the malware to reset
the MSR. I'm unlikely to catch this instruction with a performance counter
unless it's called a lot (or I'm willing to huge performance penalties). But I
can do statistics on the timestamp counter and how many performance events I'd
expect in a given amount of clock cycles. Now if performance accounting is shut
off these two measures will diverge and could potentially be used for detection
of malware. My guess is that short hook routines etc. could get away with this
most of the time without being detected, but you will have a disturbance in the
signal and statistic analysis could be used to filter out the change. Another
counter attack would be to remove the interrupt hook entirely. That on the
other hand is easily detectable directly in memory or could be analyzed using
same method as above.
Performance counters and
ROP
Obviously
in my scenario ROP attacks remain possible. Rop'ing is using control over the
stack to control execution flow by executing tid bits of existing code. Say you
need a call to writefile, instead of writing a buffer with a call to writefile
you find a suitable code sequence in existing code and use your stack control
to access that instead - you can in theory do that ad infinitum and write up
any code you'd like. They are much, much harder to produce as code in allocated
memory and would limit functionality (Hooks would be nasty to implement as ROP
- if even possible). However performance counters can do something against
these kinds of attacks too. One of the thing that makes ROP easier to implement
is ROP'ing into existing instructions. Say 0xFF 0x15 0xc0 0x55 0xc3 00 would
normally be a call instruction with indirection, but contained in it is also a
pop xx, ret which is useful for ROP'ing. If you carefully analyze instruction
location and match it up with where you find performance counter interrupts
you'd be able to detect this kind of abuse too. I don't think it'll ever have
any practical application though.
Literature:
My driver
code where shamelessly stolen from here (and the DDK storage filter sample): https://sites.google.com/site/jozsefbekes/Home/windows-programming/windriver-hello-world
The
interupt hooks came from here:
http://www.codeproject.com/Articles/36585/Hook-Interrupts-and-Call-Kernel-Routines-in-User-M
The basis
for my performance register code came from here:
http://www.mindfruit.co.uk/2012/11/fun-with-msrs-counting-performance.html
See also my
previous row hammer blogs on dreamsofastone.blogspot.com
The original row hammer report is here:
http://googleprojectzero.blogspot.de/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
Finally
this is the important part of the Intel documentation:
Intel®
64 and IA-32 Architectures,Software Developer’s Manual,Volume 3 (3A, 3B &
3C):