Row hammer detection is possible
First off detection isn't fixing, but it's a good step in that direction and I'm growing continually more confident in my claim that it probably is fixable as I work with this. Anyway following the original idea send me deep into writing drivers and I quickly began to think that modifying my method slightly would bring detection and that writing this up would give me a lot of confirmation ahead of time for my original idea which I continue to think is superior, though significantly more technically complicated to write a proof of concept for. And I seriously need a break from doing my day job first and then come home and do a lot of hours row hammering.
First what is the modification to my original idea. Well it's easy, just run the performance counters without setting the request for interrupt on overflow of counter in the MSR. The interrupt would give me the eip/rip of the offending instructions and this again would allow me to have very strong identification if something as row hammering or not. Without the interrupt I won't have eip/rip so I'll essentially just be guessing based on the performance counts over an interval. This gives me two parameters to tune for detection. The first is the interval in which i poll the performance counter and the second is a "cut off level". With "cut off level" I mean how many performance counts are not normal but hammering. Essentially what this means is that false positives and false negatives can occur. Of cause I also have to decided on what performance counter I should use. L2Cache misses sounds promising because if the instruction doesn't get through to the real memory, there is no row hammer. On the other hand the cache is optimized so that normal code won't trigger it.
My conclusion up front: Yes it works good enough that is I believe the correct identification of row hammering is near perfect!
My method for proof of concept is plain using vsperfcmd that comes with visual studio 2013 community edition. Visual studio 2013 is free and awesome, unfortunately that does not extend to vsperfcmd. It's free and horrible, but with a bit a of tweaking it works.
First I had to write up a row hammer program for windows. Since I don't actually wan't to mess with bits I just made row hammer "look a like" program:
UCHAR * Buffer = (UCHAR *)malloc(1024 * 1024);
MessageBox(0, L"Hello world", L"Hello world", MB_OK);
mov ecx, 1000000
mov eax, Buffer
mov ebx, eax
add ebx, 0x1000
mov edx, [eax]
mov esi, [ebx]
I hammers away like a real row hammer would, but obviously it's just a simulation (But hey.. it could still switch bits on you so be careful). I commented out the clflush instructions to play "normal program".
Now profiling it. It took a while before I figured out that you should avoid the "launch" parameter of vsperfcmd because it seems to insist on using instrumentation instead of sampling when this parameter is used and that is just plain silly. I found out why without the clflush instructions it complained that it couldn't instrument. Replacing them with 15 nops and vsperfcmd where back in it's unintended bad business. Ugly. However the "attach" works like a charm, but obviously requires me to have a way of first starting the row hammer program and then attaching, and then starting the actual row hammering code. Hence the messagebox above (sleep() produces better results if you wish to batch test).
This would be a command line for vsperfcmd that you guys out there can play with. I automated sampling in a batch file. The output file (c.vsp) can be opened with visual studio.
vsperfcmd /Start:sample /output:c.vsp /attach:5688 /Counter:L2Misses,1000,"L2Misses" /shutdown:20
Now the job was just to sample a lot of runs and see what the results I would get was. I got on average 1367 samples with the clflush instructions and 7 without. In my testing I never got less than 1200 samples with clflush and always single digit results without. This of cause isn't "proof" in any meaningful sense, but it's enough evidence that I'll hold my neck out and say it works. There is a lot more work to be done on this but I'll take a break for now. Though I promise I will blog again on my original idea.
Wait there is more
Wondering about something like this could be implemented in real life scenario the obvious is send an email to the administrator when detected and shut down the computer before root access for the attack could do damage. This is fine in most corporate settings, but for home users it's just plain not the user experience you wish for. So I'm thinking that implementing this check would best be done in the scheduler of the operating system. This could profile every slice of time it allocates to user mode threads. If it come up with a "row hammer" incident it should just take the offending thread out of the scheduling loop so that i cannot do any more damage and set a signal for a verification program. Chances are that the row hammering loop would have been preempted near or in the row hammering core loop and just reading out the context would give you an eip/rip to work with to do some additional analysis. If the additional analyses so choose it could just terminate the process and no system can keep running.