Monday, February 1, 2016

Row hammer, java script and MESI

Ramblings

There is no tech info here. If you don't care abot my ramblings skip to intro. It's been a while since my last blog post. With 22 blog post of varying quality in my first year of blogging it was time to see to it that I got around to my other hobbies and the usual social commitments around silly season allowed me to enjoy life in other ways. Also I finally got myself a computer and it takes forever installing it. Most of this post was written while installing operating systems on my new computer. So this blog post is likely the last blog post written on my wife's Sandy Bridge. From now on I'm on a cheap Broadwell laptop (that much to my dismay doesn't row hammer either). The good thing is that my new laptop can be crashed and I hope to get back to some of the projects that require kernel mode code real soon. 

Introduction

I was reading around for another blog-post that’ll get around to soon when I stumbled upon an intel document that caught my attention. It hinted at that a read/write pattern could flush the cache and that probably relatively fast.  In other words – row hammer. And it seemed like I could do this project in just a few hours and given how long it had been since I’d posted something I figure I’d do this first and then return to the project I was reading up on. I actually thought that this where going to fail so the working title was “Fail at Row hammer js 2.0”. There is a tendency that people never publish about their failures. But reality of this stuff is that you fail quite frequently (I do at least), but you always learn something in the process and I think it’s a shame that this knowledge generally isn’t shared. I turned out to be wrong about two things – I thought the idea could easily be dismissed, but despite not making the final proof it looks like the idea has merit. The second thing I was wrong about was how long it’d take me to finish. Most of that time was spend fighting Visual Studio. Optimize and it’ll move your instructions around e.g. memory access, then clflush instead of the other way around makes a huge different. Don’t optimize and you’ll have any amount of stack access within your timing section. With just a few hours to spare here and there I were trying again and again to get it to work instead of just go handwritten asm . Something I ended up doing anyway. As usual as the topic is offensive in nature I’ll share source codes only on request.

Why do we need RowHammer.js 2.0?

The short answer is that we don't. The current rowhammer.js works by figuring out which addresses correspond to a cache set and using that information to build a set of addresses that when touched evicts the two addresses required to row hammer. There are currently two ways of doing that. Either by assuming large pages which leaks information about address and cache sets, but an asssumption that is rarely met in the real world or by timing memory access with an accurate timer. However java script is increasingly disbanding access to a timer accurate enough to tell a single cache hit from a cache miss. Java does this to prevent spying using cache side channel attacks, but it has a side effect on row hammer. This makes it difficult to figure out which addresses correspond to a cache set. However not impossible since that, unlike for spying, it's possible to repeat the experiment until enough signal is available to pick it up even with a low accuracy timer. 

Another reasons could be that we might be able to work around the current cache-miss-performance counter detection of row hammer that has been suggest by myself and Nishat Herath of Qualys. Now before you attackers out there start to cheer too much. I'm fairly certain that a new detection and mitigation along these lines is feasible.

The rough overview

Take this scenario: Core 1 and Core 2 are both reading the same cache line. Core 1 doesn't read it much and get it from L3 where as Core 2 reads it a lot and thus retrieves it from L2. At some point Core 1 decides to write to the cache line. Since L2 at Core 2 is disconnected from Core 1 this becomes a coherency problem. The two most likely solutions are these:

1) When writing to a cache line used across multiple caches. Stop all access to this cache line, write back the modified cache to memory, flush the cache hierarchy and "reload" the memory. There is an intel document that can be read as suggesting that this is what happens: "The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded."

2) The second solution is a bit simpler. There is no reason to actually flush the L3 to memory and reload - coherency is guaranteed if L1 and L2 are flushed on all cores since L3 is shared. Thus the scheme could just be: stop all access to the cache line, modify L3. Finally invalidate all lower latency caches (L1 and L2 ) on all cores .

In scenario 1 the possibility of using this to avoid clflush exists and this might give us a way to do row hammer without clflush - potentially even in Javascript. Because this method only requires reading and writing of variables the only micro architectural knowledge we'd actually need would be to figure out if two addresses belong to the same bank and have a thread running on two different cores. [1] Seaborn and Dullien(2015) demonstrated that finding two addresses that belong to the same bank can be done by chance alone. It also seems likely that we could get a thread running on another core. Thus if it'd work we'd eliminate the need for a timer and no special special opcodes would be required. In fact there is a good chance that the method could apply to other devices with similar cache implementations.

The cache coherency protocol

I'll spend a tiny bit of time on the cache coherency protocol of the modern intel CPU's. The protocol insures that multiple caches contains a coherent representation of memory. The protocol is a slightly modified version of the MESI-protocol. The modified protocol is called MESIF, but the literature is much better on MESI if you choose to research this you should start there. The MESIF protocol states that any cache entry is always in one of 5 states: Modified, Exclusive, Shared, Invalid or Forward. The MESI(F) order is not relevant - it just makes the acronym pronounceable. The states are as follows:

Invalid: The cache entry starts out its life as "Invalid". Meaning no meaningful data is stored here. It is also the state that a cache entry enters when we use clflush.

Exclusive: This state means only one cache contains this cache line. Thus in this state there is per definition no cache coherency issues.

Shared: This state means that the cache line is loaded in two or more caches.

Modified: This state signifies that cache line is different in the cache than in main memory - thus that 
a memory store operation is required should this cache become flushed.

Forward: This is a special case of "Shared". It essentially means if multiple caches holds a shared copy of a cache line and that any read operation should be served from here.

The protocol then specifies that for a given cache line, only transistions between these states are possible:

M
E
S
I
F
M
x
x
x
Ok
x
E
x
x
x
Ok
x
S
x
x
Ok
Ok
Ok
I
ok
ok
Ok
Ok
Ok
F
x
x
Ok
Ok
x

Of interest here is that if any cache (say L3) is in the S state, the cache line cannot be directly modified because there is no valid transition to from S to M. It is this feature we hope to utilize.

Another feature of the intel cache manager is snooping. Any memory request is send out on the bus so that every core can report back the state of the cache - that is if and in what state the appropriate cache line is available on the processors caches.

Also I need to mention the Request for Ownership or (RFO for short) transaction. When you're writing to a cache entry that is either shared or invalid you use an RFO. The RFO is an exclusive command - that is no other command executes on the cache hierarchy for the cache line while the RFO is processing. The RFO invalidates all other entries with the cache line in entire cache line loaded in the entire cache hierarchy thus clearing the way for modifying an entry.
We can now outline the two scenarios again.

Core 1 issues a read. The memory is not in the L3 cache which means it'll be loaded into a way in the appropriate cache set : Either there is already an invalid way in the set (one of the ways is marked as I) or an eviction takes place and one of the ways is marked as I. Then data is loaded from main memory and the way is marked as E(xclusive).

Core 2 now reads the same memory. The cache manager decides for some reasons that core 2 needs this cache line a lot of thus wants' to fill L2 with this particular cache line. Consequently it finds an empty way (or makes one through eviction) in appropriate set in L2. Then it loads the data from L3. Now both the L3 and the L2 cache holds a copy (Intel cache is inclusive - what is present in L2 must be in L3 too) and thus both entries now get stamped as S(hared).

Core 1 now places the write operation on the bus. The L3 cache can immediately tell that it's present  and shared. Consequently the L2 must be invalidated. Thus an RFO is send out on the bus and the L2 cache is invalidated. The question here is what happens now.  
Option 1: Modify L3 and set it to Modified.
Option 2: Modify L3, write it back to main memory.  Then set the appropriate entry to Exclusive.
Option 3: Modify L3, write it back to main memory. Invalid date L3. Reload from main memory.

For us to use this method for row hammering we'd need either option 2 or 3 since both read and writing to main memory activates the row either can be used for row hammering. 

Empirical data

Unfortunately grey theory is just that. Intel tweaks around with their processors and details maybe changed for better performance. But we do have the option to see if we can verify it empirically.

 As always when timing things in the memory sub system there is a lot of noise. There are many sources for this noise, but the bottom line is we need to deal with this noise. My rough method has been to sample between 200 and 1000 timings depending on just how much noise I get in my measurements, Then I remove 10% slowest times assuming they were slow due to noise. This is not something that would make my old statistics professors proud, but it’s easy to implement and probably doesn't introduce too much bias anyway. Second we have a ball park idea about how long it’ll take for a given memory operation. Say an uncached memory operation usually takes more than 100 CLK’s on my wife’s Sandy Bridge. Thus if I’m timing uncached operating I assume something went wrong if the time spend was less than 100 CLK’s. This approach  significantly reduce the noise and actually allow me, by looking at the remaining data, to get an idea about what’s happening while I’m writing the code.  Once I’m confident in the code I calculate average and standard deviation and formally test my hypothesis.

So my first step was to get some ball park figures. Because it’s difficult to benchmark how long a an “Uncached” memory write is as it's somewhat undefined, I use the memory read as a proxy for what amount of time it takes to actually access memory.  Cached memory access takes around 66 CLK’s on average. While it’s uncached counterpart (Using CLflush+mfence) to make sure it’s not in cache takes around 200 CLK’s. A Clflush on an address loaded just prior to that takes some 112 CLK’s. Finally I time a write operation (mov [Pointer],1) when I had previously loaded it into L3 and when I hadn’t. Both takes around 90 CLK’s.

These data makes a fair amount of sense. Obviously it’s a lot faster to read from L3 than it is to read from memory. A Clflush operation in essence needs to block the cache line in the hierarchy and then actually invalidate the cache entry and with my fencing we probably need to wait for process to be completed. Consequently we’d expect it to take longer than a cached read, but less than an uncached read that actually has to drop down to memory and fetch stuff.

Now I code up the scenario sketched above.:
1.       Main Thread: Make sure it’s running on core 1. Set affinity and sleep to invoke scheduler
2.       Main Thread: access address of interest (AOI) 6 times which in the past have given me good results on being loaded into L3 and only L3.
3.       Main Thread: Start second thread and wait for it to finish
4.       Second Thread: Make sure it’s running on core 2. As above.
5.       Second thread: access AOI 10000 times to load it into L2
6.       Second thread: Finish
7.       Main Thread: Modify AOI  and time with RDTSC -  mov [AOI],1

And voila: now it mostly takes around 170-210 CLK’s to write memory. This tells me that I’m on to something. It’s much slower than a normal write operation so we can assume that something is going on. It’s also far slower than the clflush from my bench line which should be a flush of a cache entry in the E(xclusive state). In fact the time suggests that we are hitting main memory. That it isn’t always between 170 -210 CLK’s and sometimes far below it seems to suggest that from time to time my address of interest get’s removed from L2 on core 2 before I actually modify it. Maybe the combination of destroying the second thread and waiting for it to be done is enough for other system activity to evict AOI from L2.

But we don’t actually have to live with this circumstantial evidence. We might be able measure if we actually hit memory. It goes like this. When writing to physical memory the row you wish to write in is loaded into the row buffer of the ram. Then it’s modified with the data being written – remember the row buffer is 8kb where as a cache line is only 64 bytes. Finally the row buffer is written into the appropriate row. Consequently if we make sure that another row is in the row buffer by reading it from physical memory before modifying the address of interest we can do a row buffer side channel attack to see if the row was now removed.  

Just prior to modifying I therefore added a function that randomly access physical memory in a large buffer 200 times, thus making it unlikely that the address of interest’s row is in the row buffer. I then modify the address of interest as usual and then access the row of address of interest. To avoid using the cache line for the AOI I make sure to place the AOI on a page boundry and access the memory 2 cache lines further ahead. Thus I have no cache conflicts, but from [5]Seaborn(2015)’s mapping of the physical memory I can be sure to access the same row.

When I read an address from a row, that is in the row buffer as I did for my benchline I get read times of around 200 CLK’s. Interestingly if I immediately access the same row the in the scheme above, it takes around 350 CLK’s for it to finish the request. This seems to indicate that a required resource is unavailable at that time. I have only unfounded speculation what that could be. However if I execute a loop of a significant amount of dummy instructions (mfence) and then access the memory I get times around 208 CLK’s which is consistent with the current row being in the row buffer.  Thus I think I have a strong case that I actually activate the row in D-Ram which is ultimate prerequisit for row hammering.

Can we simplify the attack vector?

If you are smarter than I were you’d probably already know what I’m hinting at. MESI(F) is a protocol to keep coherency between caches – not cores. The core story might just have been a detour in my thinking. What if we load the address of interest into L2 and then modify it on the same core. We’d have exactly the same effect without the threads. At least with a high number of accesses to make sure that the AOI is elevated to L2, this indeed seems to work though the latency on modifying drops from 200 to around 150 CLK's.

Good enough?

I don’t know if we can get this fast enough for actual row hammering. I don’t know how fast we can get two addresses into L2 and then cause them to be written back out again. And since I’m currently without accesses to a computer with ram susceptible to row hammer  I’ll leave it to other people to figure out if it works. Also I’m not really strong in Java script and thus have no clue if the threaded version would really be implementable in JS.  I cannot imagine a world where the non-threaded version isn’t implementable in JS.

Yet another cache side channel attack

Obviously we have another type of cache side channel here. With it taking longer to write to shared memory we know for sure that we can use the method for a side channel attack – thus we can detect if a victim uses a read/write address enough to place them into L1 or L2. For lack of a better name I call it share+modify. There can be many different actual implementations of this cache attacks. This simplest might look like this in pseudo code:

1.mfence
2.StartTime = rdtsc()
3. Modify(address of interest)
4. mfence
5.Information = Rdtsc()-StartTime;
However it's an impractical side channel attack. The reason is that shared memory with read/write access is required. While shared read-only memory is common through dedublication and shared libraries, shared read/write memory is quite scarce. The advantageous thing from an attacker view point is is that it might trigger other performance counters than the classical flush+reload and thus avoid suggested detection for this attack. Might. Also the chances of a victim actually loading stuff into L2 is much lower than L3 and consequently a lot less information is likely to be revealed by this attack.

Litterature

 [1] Mark Seaborn; Thomas Dullien (March 9, 2015). "Exploiting the DRAM rowhammer bug to gain kernel privileges". googleprojectzero.blogspot.com. Google. Retrieved March 10, 2015.
[2] Daniel Gruss, ClĂ©mentine Maurice, Stefan Mangard (2015). „Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript”. http://arxiv.org/abs/1507.06955
[3] David Levinthal(201x). Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors. https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
[4] Intel (201x).”Avoiding and identifying false sharing among threads”. https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
[5]Seaborn(2015):” How physical addresses map to rows and banks in DRAM”; 

 



2 comments: