Saturday, February 20, 2016

Breaking KASRL with micro architecture Part 1


Hack in the box & stuff

Skip to Introduction if you are here for the techstuff. I submitted a CFP to “Hack In the Box” in Amsterdam end of may and the talk has been accepted. So I’ll be spending part of my vacation this year presenting some of my research on cache side channel attacks from the last year or so. I’m really excited because I think I actually have some really nice stuff to present – not only is it technically interesting but I think it’s far more relevant that the common perception in the infosec community.  If you have a chance I think it’ll be worth while going. Amongst those who is going to present is Jacob Torrey whose material I covered twice on my blog and who in my opinion really is one of those who is pushing the boundaries of info sec. Also Felix Wilhelm is going and I currently have his master thesis on my night table and thus far it is indeed very cool reading.

Preparing a talk obviously takes a lot of time and since I’m an amateur this is going to cut down on the time I can spend on this blog. I have two more posts in the pipeline (including part 2 of this post), but no idea when I can get them done really.

Finally posting this blog post has been a big dilemma for me. I had the basic idea almost a year ago and I’ve been putting it off again and again because it’s offensive and posting it is a moral borderline case for me. I decided to go ahead but with a heavy heart. Also I have a heavy heart because when I started this blog I wanted to target the ambitious newbie. However there has been a tendency that my work has become progressively less accessible and this particular post is probably the least accessible thus far.

Introduction


KASRL means “kernel address space randomized layout”. It is a feature of most modern operating systems meant to protect against privilege escalation attacks that utilize code reuse attacks such as ROP (return oriented programming). The idea is that if an attacker doesn’t know where the code is, he cannot setup a ROP chain to reuse the code. Therefore a modern operating system load driver on different addresses on each boot and this leaves an attacker in the dark about where the code is located. Code reuse attacks has grown in importance for privilege escalations attacks with requirements for driver signing and wide spread use of “no-execute” settings on stack and other data in kernel mode, thus forcing the attacker to either bypass driver signing to load a driver, attack page tables or reuse existing code.

The attacker has typically relied on information leaks in known drivers that give away addresses that can be used to locate code. However last year I had an idea how we might use the CPU design for our information leak and this will be the topic of this blog post. That we can use CPU design against KARSL is particularly bad news because CPU’s are not subject to updates as software are and in the past intel has not been forthcoming with security fixes in this respect. Also relying on CPU for an attack means that mean thing generalizes across operating systems – this means many of the findings I’ve made here for Windows 7 will lend themselves to other platforms. On the other hand these kind of attacks may be applicable only on a subset of CPU’s. I think that all modern Intel CPU with a 3rd level cache i3,i5,i7 and xeons from Sandy Bridge and up is likely to be affected and possibly other CPU’s as well.

Finally I’d like to say these kind of attacks has been made before but I’m not aware of any with the same methodology as mine.

Finding drivers

The idea evolved around the “Prefetch” instructions. These instructions are meant for software to give the prefetcher a hint to load a specific address into a certain cache level because the software will need it shortly. This can be used to significantly improve performance as the prefetcher runs independent of the CPU and thus memory is fetched before it’ll be used. There is however a interesting thing with the prefetch instructions that caught my eye: They don’t cause exceptions (except they are incompatible with a lock prefix)[2] Intel. However they take a virtual address as input and will need a physical address for loading memory and placing it in the cache. Thus they may entirely by pass memory protection mechanisms further that seem relatively plausible because since they aren’t actually reading data they should not in any way or fashion allow an attacker to read memory he isn’t supposed to. Consequently intel might have decided that there wasn’t a security issue.

My first idea was to use the prefetch instruction in a modified prime+probe or a evict+time attack. That is make a guess about an address in a driver, then use prefetch instruction to load the cache entry and either evict+time or prime+probe to tell if you actual hit the address. This of cause is pretty slow and there is a lot of memory you’d need to run this on before you’d actually find your way around the kernel address space. I’ll get back to this in another post.

The second idea is what if the prefetch instructions leak information through it’s execution time? I thought of 3 scenarios:

1)      Prefetch does not leak information (e.g. the DS descriptor has a limit, prefetcher doesn’t do anything on kernel memory when called from R0, is designed as  constant time etc.)
2)      Prefetch is slower on virtual addresses which actually maps to a page. One reason could be because telling the prefetcher to load the cache takes time
3)      Prefetch is slower on addresses which is not does not map to a page because searching all page table entries for an address is slower than just searching till you hit something.

Well. First I dug into the descriptor issue mentioned in 1) because that’d be a no go and since I recall that 32bit XP actually has a limit on it’s DS descriptor making it useless to access kernel mode memory. It seems though that this is not the case in 64 bit mode as 5.3.1 of the [1] Intel clearly says that limits are not checked at run time. (I think there are some exceptions for FS/GS) So time to load up a debugger and write some code. First I loaded live KD and dumped a map of the drivers using the “lm n t” command. Then I wrote some user mode code that times the prefetcht2 instruction. I picked the prefetcht2 instruction because I assumed loading the largest latency cache was most likely to be actually carried out by the prefetcher and thus most likely to be successful. I then wrote up code to time prefetching an address 20 times and pick the lowest timing found. My argumentation here is that there is noise in memory sub system (hence the multiple accesses to try and cancel it out) and the lowest value because searching for the lower bound would tell me more than an average. Then I picked two addresses to time in this way. One a few thousand pages below the first loaded driver and one in the first loaded driver. Finally I repeated the experiment a 1000 timers. And voila the prefetcht2 instruction is faster on a virtual address that is actually mapped. I speculate that this is because searching through PTE’s is aborted early and thus faster. I think I know how to figure out if this is true and I may get back to it in another blog post. So I moved on and wrote up a small program that’d scan a small portion of the kernel memory address space and tell me if a page was mapped or not and compared it to my module list. It worked beyond any expectation.

So how fast can I scan? It takes about 112 CLK’s for a miss and 85 CLK’s for hit on my wife’s sandy bridge computer. An approximate value (with 20 timings per address) including code over head is that I run around 500k probes per second. With a 64bit address space that too is impractical to scan the entire address space with. However first we should notice that modern operating systems are paging and thus we only need to probe once per page. A page is usually 4096 bytes large. Also on windows 7 it appears that the 16 upper bits are all set for drivers further reducing the problem. Finally Windows 7 loads drivers continuously – on my computer around 100 mega bytes of them – and thus I only have to run through a 48 bit address space in 100 megabyte steps to find where the drivers are hidden. This reduces the problem to a practical 6 second problem. Now I do get a few false positives on this scan. Other stuff than drivers maybe located in this space and I may have some tuning problems and additional probing could further reduce these. But I consistently find the drivers and only a few false positives.

What was a blessing above – that the drivers are clustered into one big chuck of pages – is a curse going forward. The problem is that we do not get the size of the individual drivers and thus is unable to tell which is which for a gadget search. However I believe we can solve this problem with modified versions of evict+time and prime+probe attacks. This however will have to wait for part 2 of this blog post.

Thanks

I need to express my gratitude to Nishat Herath of Qualys for discussions on this topic I had with him.

Litterature

[1] Intel: “Intel® 64 and IA-32 Architectures Software Developer’s Manual. Volume 3A: System Programming Guide, Part 1”.http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
[2]Intel: “Intel® 64 and IA-32 Architectures.Software Developer’s Manual.Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z


Sunday, February 7, 2016

Comments on CATalyst the first use of CAT to mitigate Cache Side Channel Attacks

Recently Lui et al.[1]  released a new very interesting paper. This paper will be the topic of this blog post. In my “Rowbuffer side channel attacks“. I speculated that CAT would be interesting as mitigation for cache side channel attacks. It turns out that I was right about that. I did however get some of the details of how CAT works wrong. Yuval Yarom of University of Adelaide was kind enough to correct me and that became an errata. This new paper, that Yuval co-authored, how ever proves that CAT could indeed play an import role in defending against cache attacks. They call their method “CATalyst”. In this blog post I make an ultra short and incomplete summary of how CAT and CATalyst work for some context. If you wish to get into the details of it you really need to read the paper. Though I will be critical of CATalyst in the this blog post, I should make it  very clear that I consider CATalyst a big step forward towards defense against cache side channel attacks and that I’d much prefer to have CATalyst than nothing at all – which it’s my impression is the current state of cloud computing.

Ultra short summary of how CAT works

Cache allocation technology or CAT is a feature of some newer Xeon’s. It’s not meant to make it into consumer PC’s. Newer means some Haswell’s has CAT, I’m not sure how common it is in Broadwell or Skylake. CAT was meant as a performance enhancement tools to make it allow a supervisor or hypervisor to allocate more L3 cache to performance critical operations. Essentially the hypervisor or supervisor can assign a “Class of service” (Short COS) to an MSR for an application. There are currently 4 COS’s. Each COS is associated with a bitmask that describes which ways of the cache can be evicted under this service class. Thus any application can have reads/writes and flushes served from any way just as before, however only evict ways that is compatible with the COS. The intended effect is to shield high priority applications from competitive eviction by lower priority applications.

Using CAT to defend against CSC

Since flush can be served from any way in the cache as always, CAT is incapable of defending against flush+reload and flush+flush. However those attacks can be thwarted avoiding sharing memory across trust boundaries. [1]Liu et al. uses this to defend against these attacks. 

This makes the most important enemy “Prime + Probe”. Recall prime and probe works by priming a cache set to consist of attacker only data. Then wait for victim activity and finally access the data it had in the cache – if it incurs a cache miss the attacker knows that the victim used the cache set.  With CAT able to prohibit evictions in parts of the cache, it is capable of making it impossible to prime the cache and thus make prime + probe impossible.

However it would be bad karma to just deny eviction as a general rule because that’ll come at a huge performance cost. Deny eviction and the cache cannot adapt to the ongoing work and become inefficient. With 4 COS values it’s however possible to partition the cache in to only 4 pieces. That again is a bad idea, because in the cloud where you typically have say 16 to 64 simultaneous VM’s on one computer.  It could only make it more difficult for the attacker to get co-location by factor 4, but not thwart anything.

 Liu et al.[1] deals with this problem rather elegantly. They divide the cache in two: A secure partition and insecure partition.  Since the secure partition is shared among all VM’s prime + probe would still be possible. To deal with this only so much data can be loaded into the secure partition that absolutely everything can be cached. They then make sure that everything in the secure partition is then cached. The size of the secure partition is kept small to minimize the performance penalty. Finally the VM keeps control of the COS and bitmask to prevent any evictions in the secure partition. Thus everything in the secure partition is pinned in the L3 cache and access to memory in the secure region will never miss the L3 cache. The VMM exports API to the VM’s operating systems to load and unload pages into this secure partition for applications in the VM’s to use. To prevent a DoS attack on this each VM can only load it’s share of secure memory.

Implementation is made possible through the fact that a VMM gets called when MSR’s are written. Thus it can reserve a COS for itself for loading pages into the secure partition and prevent VM’s from gaining this COS as well as setting a bitmask that would allow access to this.
The really cool thing about this is that it’s the first suggested mitigation so far that both looks iron clad and is feasible in current hardware without being tied down to other micro-architectural designs such as the complex-address function for cache allocation and at a reasonable performance cost as well. 

Another feature of CATalyst is that it will make row buffer side channels quite difficult as reads are served entirely from cache, thus not activating a row in D-ram. It might not be all together impossible to use row buffer side channels in this as secure partition may share rows with insecure memory and r/w memory can cause row activation when write operations trigger write-back operations for reasons of cache coherency. I don’t think this is a real problem though.

Problems with this method


I see three problems with this method however:
1)      Modern hardware is required
2)      Changes to VM, OS and applications are required
3)      The secure partition is small.

Modern hardware
The requirement for modern hardware is a short term problem. In time modern hardware will be available in the cloud and thus it’s not really a big issue.

Changes to VMM, OS and applications
For the second problem OS and VMM could probably be easily changed with the current focus on the cloud.  In fact I think the OS changes needed could be implemented by 3rd Party drivers and thus easily made available. Applications pose a much bigger issue though. Typically a cloud consumer uses multiple applications each with their own security requirements and they need to be updated to support CATalyst etc. This problem could be overcome, but certainly a bigger issue than the first. To my knowledge, historically security implementations that require changes to a range of applications have not fared well in the market place.

The small partition size
The real issue for me with CATalyst is the secure partition is small. The obvious maximum size is the cache size of the system to be shared by all VM’s   (say 20 megabytes). That however would be prohibitively expensive performance wise – which is why the authors are talking about taking only a 10th of that effectively limiting the secure partition size to 4-16 pages per VM. Increasing the size is probably associated with (near) exponential performance cost: Caching the most used cache line brings more than caching the 20th most used cache line.

The authors argue that this isn’t a problem: “A few secure pages are usually sufficient
for protecting the security-sensitive code and data since these are usually small”. I beg to differ. For a single crypto implementation this is obviously true. The problem is that many VM’s will have a number of crypto implementations and operation typically threaded.  Remote access technologies may very well use a different implementation than the web shop, that again uses a different one than SSH and so on and so forth. Not to mention custom implementations. But security requirements doesn’t limit themselves to crypto. Oren et al. [2] Showed us that we can infer mouse movement and implement covert channels with cache side channel attacks. Gruss, Spreitzer & Mangaard[3] showed us that we can spy on keyboard input in some circumstances. Why care about crypto if your password leaks? I dare speculate that using machine learning technology combined with cache side channel attacks on the non-secure partition a great many secrets could be leaked from a co-located victim. After all, which information we each consider sensitive is subjective and cache side channel attacks apply to a very wide range of data.

Even if each of these things fit individually in the secure partition bad programming practices, hangs and crashes are likely to, at some point, allow a VM to DOS itself by filling the secure partition and not release appropriately. Not to mention the overhead that would come from often loading pages into secure memory. Further I wouldn’t trust software developers in general to correctly identify security critical code in the first place.

Litterature

[1] Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, Ruby B. Lee (2016). CATalyst: Defeating Last-Level Cache Side Channel Attacks in Cloud Computing. https://ssrg.nicta.com.au/publications/nictaabstracts/8984.pdf

[2] Oren et al (2015): “Spy in the Sandbox, Practical Cache Attacks in Javascript”; Yossef Oren, P.Kemerlis, Simha Sethumadhavan and Angelos D. Keromytis. arXiv: 1502.07373v2


[3] Gruss, Spreitzer & Mangard (2015): “Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches”

Monday, February 1, 2016

Row hammer, java script and MESI

Ramblings

There is no tech info here. If you don't care abot my ramblings skip to intro. It's been a while since my last blog post. With 22 blog post of varying quality in my first year of blogging it was time to see to it that I got around to my other hobbies and the usual social commitments around silly season allowed me to enjoy life in other ways. Also I finally got myself a computer and it takes forever installing it. Most of this post was written while installing operating systems on my new computer. So this blog post is likely the last blog post written on my wife's Sandy Bridge. From now on I'm on a cheap Broadwell laptop (that much to my dismay doesn't row hammer either). The good thing is that my new laptop can be crashed and I hope to get back to some of the projects that require kernel mode code real soon. 

Introduction

I was reading around for another blog-post that’ll get around to soon when I stumbled upon an intel document that caught my attention. It hinted at that a read/write pattern could flush the cache and that probably relatively fast.  In other words – row hammer. And it seemed like I could do this project in just a few hours and given how long it had been since I’d posted something I figure I’d do this first and then return to the project I was reading up on. I actually thought that this where going to fail so the working title was “Fail at Row hammer js 2.0”. There is a tendency that people never publish about their failures. But reality of this stuff is that you fail quite frequently (I do at least), but you always learn something in the process and I think it’s a shame that this knowledge generally isn’t shared. I turned out to be wrong about two things – I thought the idea could easily be dismissed, but despite not making the final proof it looks like the idea has merit. The second thing I was wrong about was how long it’d take me to finish. Most of that time was spend fighting Visual Studio. Optimize and it’ll move your instructions around e.g. memory access, then clflush instead of the other way around makes a huge different. Don’t optimize and you’ll have any amount of stack access within your timing section. With just a few hours to spare here and there I were trying again and again to get it to work instead of just go handwritten asm . Something I ended up doing anyway. As usual as the topic is offensive in nature I’ll share source codes only on request.

Why do we need RowHammer.js 2.0?

The short answer is that we don't. The current rowhammer.js works by figuring out which addresses correspond to a cache set and using that information to build a set of addresses that when touched evicts the two addresses required to row hammer. There are currently two ways of doing that. Either by assuming large pages which leaks information about address and cache sets, but an asssumption that is rarely met in the real world or by timing memory access with an accurate timer. However java script is increasingly disbanding access to a timer accurate enough to tell a single cache hit from a cache miss. Java does this to prevent spying using cache side channel attacks, but it has a side effect on row hammer. This makes it difficult to figure out which addresses correspond to a cache set. However not impossible since that, unlike for spying, it's possible to repeat the experiment until enough signal is available to pick it up even with a low accuracy timer. 

Another reasons could be that we might be able to work around the current cache-miss-performance counter detection of row hammer that has been suggest by myself and Nishat Herath of Qualys. Now before you attackers out there start to cheer too much. I'm fairly certain that a new detection and mitigation along these lines is feasible.

The rough overview

Take this scenario: Core 1 and Core 2 are both reading the same cache line. Core 1 doesn't read it much and get it from L3 where as Core 2 reads it a lot and thus retrieves it from L2. At some point Core 1 decides to write to the cache line. Since L2 at Core 2 is disconnected from Core 1 this becomes a coherency problem. The two most likely solutions are these:

1) When writing to a cache line used across multiple caches. Stop all access to this cache line, write back the modified cache to memory, flush the cache hierarchy and "reload" the memory. There is an intel document that can be read as suggesting that this is what happens: "The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded."

2) The second solution is a bit simpler. There is no reason to actually flush the L3 to memory and reload - coherency is guaranteed if L1 and L2 are flushed on all cores since L3 is shared. Thus the scheme could just be: stop all access to the cache line, modify L3. Finally invalidate all lower latency caches (L1 and L2 ) on all cores .

In scenario 1 the possibility of using this to avoid clflush exists and this might give us a way to do row hammer without clflush - potentially even in Javascript. Because this method only requires reading and writing of variables the only micro architectural knowledge we'd actually need would be to figure out if two addresses belong to the same bank and have a thread running on two different cores. [1] Seaborn and Dullien(2015) demonstrated that finding two addresses that belong to the same bank can be done by chance alone. It also seems likely that we could get a thread running on another core. Thus if it'd work we'd eliminate the need for a timer and no special special opcodes would be required. In fact there is a good chance that the method could apply to other devices with similar cache implementations.

The cache coherency protocol

I'll spend a tiny bit of time on the cache coherency protocol of the modern intel CPU's. The protocol insures that multiple caches contains a coherent representation of memory. The protocol is a slightly modified version of the MESI-protocol. The modified protocol is called MESIF, but the literature is much better on MESI if you choose to research this you should start there. The MESIF protocol states that any cache entry is always in one of 5 states: Modified, Exclusive, Shared, Invalid or Forward. The MESI(F) order is not relevant - it just makes the acronym pronounceable. The states are as follows:

Invalid: The cache entry starts out its life as "Invalid". Meaning no meaningful data is stored here. It is also the state that a cache entry enters when we use clflush.

Exclusive: This state means only one cache contains this cache line. Thus in this state there is per definition no cache coherency issues.

Shared: This state means that the cache line is loaded in two or more caches.

Modified: This state signifies that cache line is different in the cache than in main memory - thus that 
a memory store operation is required should this cache become flushed.

Forward: This is a special case of "Shared". It essentially means if multiple caches holds a shared copy of a cache line and that any read operation should be served from here.

The protocol then specifies that for a given cache line, only transistions between these states are possible:

M
E
S
I
F
M
x
x
x
Ok
x
E
x
x
x
Ok
x
S
x
x
Ok
Ok
Ok
I
ok
ok
Ok
Ok
Ok
F
x
x
Ok
Ok
x

Of interest here is that if any cache (say L3) is in the S state, the cache line cannot be directly modified because there is no valid transition to from S to M. It is this feature we hope to utilize.

Another feature of the intel cache manager is snooping. Any memory request is send out on the bus so that every core can report back the state of the cache - that is if and in what state the appropriate cache line is available on the processors caches.

Also I need to mention the Request for Ownership or (RFO for short) transaction. When you're writing to a cache entry that is either shared or invalid you use an RFO. The RFO is an exclusive command - that is no other command executes on the cache hierarchy for the cache line while the RFO is processing. The RFO invalidates all other entries with the cache line in entire cache line loaded in the entire cache hierarchy thus clearing the way for modifying an entry.
We can now outline the two scenarios again.

Core 1 issues a read. The memory is not in the L3 cache which means it'll be loaded into a way in the appropriate cache set : Either there is already an invalid way in the set (one of the ways is marked as I) or an eviction takes place and one of the ways is marked as I. Then data is loaded from main memory and the way is marked as E(xclusive).

Core 2 now reads the same memory. The cache manager decides for some reasons that core 2 needs this cache line a lot of thus wants' to fill L2 with this particular cache line. Consequently it finds an empty way (or makes one through eviction) in appropriate set in L2. Then it loads the data from L3. Now both the L3 and the L2 cache holds a copy (Intel cache is inclusive - what is present in L2 must be in L3 too) and thus both entries now get stamped as S(hared).

Core 1 now places the write operation on the bus. The L3 cache can immediately tell that it's present  and shared. Consequently the L2 must be invalidated. Thus an RFO is send out on the bus and the L2 cache is invalidated. The question here is what happens now.  
Option 1: Modify L3 and set it to Modified.
Option 2: Modify L3, write it back to main memory.  Then set the appropriate entry to Exclusive.
Option 3: Modify L3, write it back to main memory. Invalid date L3. Reload from main memory.

For us to use this method for row hammering we'd need either option 2 or 3 since both read and writing to main memory activates the row either can be used for row hammering. 

Empirical data

Unfortunately grey theory is just that. Intel tweaks around with their processors and details maybe changed for better performance. But we do have the option to see if we can verify it empirically.

 As always when timing things in the memory sub system there is a lot of noise. There are many sources for this noise, but the bottom line is we need to deal with this noise. My rough method has been to sample between 200 and 1000 timings depending on just how much noise I get in my measurements, Then I remove 10% slowest times assuming they were slow due to noise. This is not something that would make my old statistics professors proud, but it’s easy to implement and probably doesn't introduce too much bias anyway. Second we have a ball park idea about how long it’ll take for a given memory operation. Say an uncached memory operation usually takes more than 100 CLK’s on my wife’s Sandy Bridge. Thus if I’m timing uncached operating I assume something went wrong if the time spend was less than 100 CLK’s. This approach  significantly reduce the noise and actually allow me, by looking at the remaining data, to get an idea about what’s happening while I’m writing the code.  Once I’m confident in the code I calculate average and standard deviation and formally test my hypothesis.

So my first step was to get some ball park figures. Because it’s difficult to benchmark how long a an “Uncached” memory write is as it's somewhat undefined, I use the memory read as a proxy for what amount of time it takes to actually access memory.  Cached memory access takes around 66 CLK’s on average. While it’s uncached counterpart (Using CLflush+mfence) to make sure it’s not in cache takes around 200 CLK’s. A Clflush on an address loaded just prior to that takes some 112 CLK’s. Finally I time a write operation (mov [Pointer],1) when I had previously loaded it into L3 and when I hadn’t. Both takes around 90 CLK’s.

These data makes a fair amount of sense. Obviously it’s a lot faster to read from L3 than it is to read from memory. A Clflush operation in essence needs to block the cache line in the hierarchy and then actually invalidate the cache entry and with my fencing we probably need to wait for process to be completed. Consequently we’d expect it to take longer than a cached read, but less than an uncached read that actually has to drop down to memory and fetch stuff.

Now I code up the scenario sketched above.:
1.       Main Thread: Make sure it’s running on core 1. Set affinity and sleep to invoke scheduler
2.       Main Thread: access address of interest (AOI) 6 times which in the past have given me good results on being loaded into L3 and only L3.
3.       Main Thread: Start second thread and wait for it to finish
4.       Second Thread: Make sure it’s running on core 2. As above.
5.       Second thread: access AOI 10000 times to load it into L2
6.       Second thread: Finish
7.       Main Thread: Modify AOI  and time with RDTSC -  mov [AOI],1

And voila: now it mostly takes around 170-210 CLK’s to write memory. This tells me that I’m on to something. It’s much slower than a normal write operation so we can assume that something is going on. It’s also far slower than the clflush from my bench line which should be a flush of a cache entry in the E(xclusive state). In fact the time suggests that we are hitting main memory. That it isn’t always between 170 -210 CLK’s and sometimes far below it seems to suggest that from time to time my address of interest get’s removed from L2 on core 2 before I actually modify it. Maybe the combination of destroying the second thread and waiting for it to be done is enough for other system activity to evict AOI from L2.

But we don’t actually have to live with this circumstantial evidence. We might be able measure if we actually hit memory. It goes like this. When writing to physical memory the row you wish to write in is loaded into the row buffer of the ram. Then it’s modified with the data being written – remember the row buffer is 8kb where as a cache line is only 64 bytes. Finally the row buffer is written into the appropriate row. Consequently if we make sure that another row is in the row buffer by reading it from physical memory before modifying the address of interest we can do a row buffer side channel attack to see if the row was now removed.  

Just prior to modifying I therefore added a function that randomly access physical memory in a large buffer 200 times, thus making it unlikely that the address of interest’s row is in the row buffer. I then modify the address of interest as usual and then access the row of address of interest. To avoid using the cache line for the AOI I make sure to place the AOI on a page boundry and access the memory 2 cache lines further ahead. Thus I have no cache conflicts, but from [5]Seaborn(2015)’s mapping of the physical memory I can be sure to access the same row.

When I read an address from a row, that is in the row buffer as I did for my benchline I get read times of around 200 CLK’s. Interestingly if I immediately access the same row the in the scheme above, it takes around 350 CLK’s for it to finish the request. This seems to indicate that a required resource is unavailable at that time. I have only unfounded speculation what that could be. However if I execute a loop of a significant amount of dummy instructions (mfence) and then access the memory I get times around 208 CLK’s which is consistent with the current row being in the row buffer.  Thus I think I have a strong case that I actually activate the row in D-Ram which is ultimate prerequisit for row hammering.

Can we simplify the attack vector?

If you are smarter than I were you’d probably already know what I’m hinting at. MESI(F) is a protocol to keep coherency between caches – not cores. The core story might just have been a detour in my thinking. What if we load the address of interest into L2 and then modify it on the same core. We’d have exactly the same effect without the threads. At least with a high number of accesses to make sure that the AOI is elevated to L2, this indeed seems to work though the latency on modifying drops from 200 to around 150 CLK's.

Good enough?

I don’t know if we can get this fast enough for actual row hammering. I don’t know how fast we can get two addresses into L2 and then cause them to be written back out again. And since I’m currently without accesses to a computer with ram susceptible to row hammer  I’ll leave it to other people to figure out if it works. Also I’m not really strong in Java script and thus have no clue if the threaded version would really be implementable in JS.  I cannot imagine a world where the non-threaded version isn’t implementable in JS.

Yet another cache side channel attack

Obviously we have another type of cache side channel here. With it taking longer to write to shared memory we know for sure that we can use the method for a side channel attack – thus we can detect if a victim uses a read/write address enough to place them into L1 or L2. For lack of a better name I call it share+modify. There can be many different actual implementations of this cache attacks. This simplest might look like this in pseudo code:

1.mfence
2.StartTime = rdtsc()
3. Modify(address of interest)
4. mfence
5.Information = Rdtsc()-StartTime;
However it's an impractical side channel attack. The reason is that shared memory with read/write access is required. While shared read-only memory is common through dedublication and shared libraries, shared read/write memory is quite scarce. The advantageous thing from an attacker view point is is that it might trigger other performance counters than the classical flush+reload and thus avoid suggested detection for this attack. Might. Also the chances of a victim actually loading stuff into L2 is much lower than L3 and consequently a lot less information is likely to be revealed by this attack.

Litterature

 [1] Mark Seaborn; Thomas Dullien (March 9, 2015). "Exploiting the DRAM rowhammer bug to gain kernel privileges". googleprojectzero.blogspot.com. Google. Retrieved March 10, 2015.
[2] Daniel Gruss, Clémentine Maurice, Stefan Mangard (2015). „Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript”. http://arxiv.org/abs/1507.06955
[3] David Levinthal(201x). Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors. https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
[4] Intel (201x).”Avoiding and identifying false sharing among threads”. https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
[5]Seaborn(2015):” How physical addresses map to rows and banks in DRAM”; 

 



Tuesday, December 1, 2015

Rowbuffer side channel attacks

Update 30.11.2015

Errata regarding CAT

Yuval Yarom (University of Adelaide) was so kind to email me regarding my speculation on CAT changing the cache side channel game and it’s impact on row buffer side channels. Before I proceed with what Yuval had to say I’d like to recommend a paper Yuval resonantly co-authored called “Amplifying Side Channels Through Performance Degradation“. You can find it here: http://eprint.iacr.org/2015/1141.pdf
Yuval corrected me on some points and verified my thinking on others. Below is my summary of his comments:
1)      CAT is available on some Haswell processors. E.g. Xeon 2618L v3.
2)      CAT partitions on ways.
3)      CAT only controls evictions. If a process can access memory CAT will not intervene with writing, reading or flushing from the L3 cache.
4)      Thus CAT cannot protect against attacks that rely on clflush in particular: flush+reload and flush+flush. It could be used to protect against attacks that rely on eviction such as prime+probe.
5)      With only 4 COS values it will be of limited use in many cloud use cases.
6)      If you wish to defeat clflush based attacks the approach should be to avoid sharing memory.
My comments:
I did not foresee that CAT would only control evictions and not flushes. Yuval ‘s point on CAT not protecting against flush+reload and flush+flush extends itself to Row Buffer Side Attacks.

Comment on concurrency

Peter Pessl, Daniel Gruss, Clémentine Maurice and Stefan Mangard released a paper called “Reverse Engineering Intel DRAM Addressing and Exploitation” you can find it here: http://arxiv.org/abs/1511.08756. This paper hinges around the exact same idea that I present in my blog post and was released almost simultaneously. Daniel Gruss and I were discussing the flush+flush paper prior to it’s release when I mentioned my work on RBSC attacks. He explained that he and others where working in the same direction and we agreed to work concurrently on our own, but synchronize our publication towards the end of November. I received a draft of their paper hours before they published. The only relevant influence I’ve had on their work is reporting a typo and properly my concurrent work moved up their schedule. They attack the micro architecture problems I’ve been skirting around in this blog head on which is pretty cool. I would’ve loved to play with those issues if I’d had the time. Interestingly their work might be interesting for my current “pet project”.



Introduction

Cache side channel attacks has resonantly had a renaissance in the literature due to the emergence of a 3rd Level cache that is shared between cores on a system. However since the dawn of time a feature of DRAM has been available for side channel attacks very similar to CSC: The row buffer. I’ll make the case here that row buffer is usable for side channel attacks. I strongly recommend you read my “Cache Side Channel Attacks” blog post before proceeding, because there is a lot of cross over between how Cache Side Channel attacks work and how a row buffer side channel attack could work. I will not spend much time on these issues in this post and generally refer you to my older post. My actual PoC code for an attack is not public. As usual I do not make public offensive security resources.

Before anybody gets too exited the practical importance of a row buffer side channel attack is probably very limited. The reason for this is that the emergence of the 3rd Level cache has in many ways made it obsolete and the presence of the 3rd level cache makes the attack impractical because the effect of the cache will tend to be in the way. Thus the practical importance of such an attack is thus limited to corner cases or computers prior to the introduction of big caches. From a theoretical perspective, I’ll argue that my tiny effort has been worthwhile. Also my guess is that noise problems will significantly hamper it’s use. However the attack my very well scale to other platforms. This is because what I utilize here is a feature of DRAM and not a feature of the processor. This means potentially any device using DRAM could be vulnerable to this kind of attack.

Background

Current DRAM comes in modules called DIMM’s. If you buy a modern memory module for your PC, you’re buying a DIMM. If you look at the DIMM most DIMMs will have chips on both sides. Each side of the DIMM is a rank. Each rank again consists of a number of banks. The banks are typically the physical individual chips you can see. Inside a bank you’d find a  two dimensional matrix of memory cells .There are 32k rows in the matrix and 16k or 512k cells per row.  Each cell stores one bit and consists of a transistor for control and a capacitor which stores charge to signify bit is equal to 1 and no charge when bit is equal to 0. Thus a row stores 8kb or 64kb of data depending on the exact kind of DRAM you you have in front of you. When you read from DRAM an entire row is always read at once.  Reading a row means checking if the capacitor is charged or not. Since this operation inevitably discharges the capacitor, DRAM first reads the entire row into a buffer and then returns the read out values to the bus from here, then the entire row is written back into the matrix to renew the charge. However if two reads to the same row occurs there is no need to fetch the memory from the matrix the second time and thus the row buffer works as a cache (This is why row hammering requires reading from 2 rows within a bank to be successful). Such a row buffer is either 8kb or 64kb – obviously the same size as the row. For a more accurate description I refer you to [2]Wikipedia.

How does my RBSC work

Just like a standard cache we can use this to gain information about the memory access patterns of another process. The obvious requirement is that we need to share row buffers with the victim. While technically we can share a row buffer with somebody without sharing memory and in fact it’s bound to happen to some degree since page size is smaller than the row size and the operating system allocates memory through paging. I shall not discuss this however. The reason is that to utilize this for an attack requires knowledge of the micro architecture of the processor – in particular physical address mapping. Mark Seaborn[1] has done some work on this and going further in this direction is way beyond my time schedule – no way I can research this in less than 24 hours and I don’t write blog posts that take longer than this. Also it’s likely that we need an additional information leak to really leverage RBSC in this fashion. I may get back to this in another blog post, but don’t count on it. This leaves me with spying on memory access to memory that is shared between victim and attacker. Unfortunately there is plenty of shared memory between processes – typically the operating system avoids loading shared modules twice and instead just maps a single instance of physical memory logically in all processes that use it. Even virtual machines often share memory. Thus my attack has similarities to the “flush+reload” cache side channel attack. Where flush+reload CSC’s work on cache lines and the attacker is able to determine access within an 64 byte cache line, with row buffer side channel attacks we’ll only be able to tell about a 8/64kb row buffer. The mechanics of the attack I’ve written up is

1)            First read tons of memory to statistically make sure another row is in the row buffer. This could surely be done in a more efficient fashion, but again that would require knowledge of micro architecture – however if you read the original row hammer paper you’d know that my approach has very a very good chance of working correctly. See Seaborn & Dullien(2015) for more. To be specific I read 200 times from a random location in a large allocated buffer to give me a good chance of having read from another row in the bank I’m interested in. Now I theoretically could just do a normal read operation, but I think you’d have less noise if you make sure the memory you’re trying to read isn’t in the cache. Therefore I’ve added a CLFlush instruction prior to. Also since we need a  baseline memory access timing I use this phase of the attack to time the memory access using the RTDSC instruction. Because modern processors are capable of executing out of order to make the most of waiting for the memory it’s important that timing operating is bracket in the mfence instruction to avoid reordering. It is important that the timing for your baseline is the same address which you wish to attack for. The reason is, If the address we use to do our baseline with is on a different core than the one we wish we to attack, querying the L3 cache has a different timing due to the ring buffer .

2)            Finally I time how long it takes to read the memory I'm interested in as an attacker. A fast read and the victim used this row – that is the data was returned from the row buffer. A slow read and the victim did not use this row – the data got returned from the matrix. If a read was fast or slow is measured against the benchmark from 1. Here I also use the clflush instruction prior to make sure the memory read does not arrive from the cache as well as using the mfence instruction to protect against reordering.

Now there is lots of noise in this attack.  All though there is a measurable difference between the timing of buffered and unbuffered reads, the difference is very small compared to a traditional flush+reload attack. And on modern multi core computers, you always have “noise” in execution times (especially on memory) leading to frequent misidentification Another reason is that the chance another non-victim process uses another row is much, much larger than another core causing a cache eviction (cache has 14/16 64 byte ways - a bank 32k rows of 8kb memory) or even uses the same cache line as the victim. And finally another core using the same DIMM (ram module) while you are polling will cause delays in excess of the saving from using a row buffer as cache. Another potential source of noise is the memory refresh that'll read and rewrite rowbuffers periodically. With so much noise, coarse grained attacks are unlikely to bring significant rates of success. Additionally high levels of memory activity on other cores is poison to the attack. When running under favorable conditions and probing I only get around 70% correct predictions on if  the attacker used my row buffer or not.

I need to mention that my way of launching this attack can be improved upon instead of just statistically making sure that another row buffer is loaded, you could do it exactly if you had knowledge of the micro architecture. I often observe very slow reads (factor 3 or more) for other reasons than missing the row buffer. The amount of time spend however clearly indicates that we’re seeing noise and should disregard the reading both in finding the baseline and while attacking. Likely cause for these long waiting times is that we are forced to wait for another core’s request to access memory from the DIMM.

I said in the introduction that this attack vector probably isn’t that important on modern computers. The reason is that the 3rd level cache is bound to pick up on any memory operation occurring and thus the second read that is required is likely to never make it to DRAM. Now we can work around this using CLFLush or equivalent means – this is what I did as described above. But why would an attacker bypass a better side channel to use a worse and in the process leave behind all the hallmarks of a true CSC attack? Never-the-less there might be corner cases where RBSC attacks could provide a viable alternative to CSC attacks.

Mitigation

I think the solution to RBSC channel attacks has to be a (micro?) architectural one. But since we probably aren’t going to get one we may wish to consider software mitigations. Mostly the same methodology for mitigation would apply to row buffer attacks as to cache attacks. First on modern computers with a L3 cache we need to flush cache to make it work and this makes my performance counter detection a viable solution for high frequency polling. To access the row buffer instead of the cache, the attacker must cause a cache miss - and typically where we'd normally expect a hit. Another good solution would be making the timer less accurate as I’ve suggested for cache side channel attacks, which would make it virtually impossible to distinguish between row buffer reads and normal reads.

Conclusion

From a theoretical point of view RBSC is a lesson into how apparent beneficial design decisions such as using the row buffer as a cache can quickly turn into a security issue. The row buffer is required in the DRAM scheme – however it’s cache function probably isn’t (I’m no DRAM expert), but seems like a great idea performance wise. As soon as we share resources across trust boundaries there will always be a potential for side channels. It also serves as a warning that fixing cache side channel attacks in micro architecture in the processor requires that  thought is given to RBSC attacks. That being said RBSC attacks seem unlike to be a real world  threat – at least of the x86 architecture. However it might be quite interesting for other architectures.


Ending on a speculative note, CATs and more advanced attacks

This part was really added long after I'd finished up the above text. So consider this a last minute addition.

More advanced attacks

The first modification we can do to the attack above is to look not at what is in the row buffer, but what isn't. If we in the above attack prime the row buffer with an attacker memory location, checking to see if the victim removed it gives us information about his activities. But this isn't much information because we'll only know if the victim used the same bank or not. In essence knowing that the victim used any one of 32k 8k buffers isn't too valuable.

But we can improve on this. The details are micro architecture specific and I'll stick to my wife's I3 Sandy Bridge dual channel architecture here.  My guess is that other architectures will be different, but have enough similarities for the basics to apply. Sometimes the attacker shares a row with a victim without actually sharing memory and we can utilize this to gain more information. First the operating system allocates memory to processes on  a per page basis and security attributes too is on a per page basis. However a row is shared across pages and thus potentially across privileges and we can use this to spy. Typically there is parts of 4 pages in a row buffer. The first part of that number is that the row typically contains 8kb while standard (small pages) are typically 4kb. This accounts for two pages. Finally we know the first 12 bits of a virtual address is identical to the corresponding physical address. Also we know from Seaborn(2015) that bit 6 of a physical address is used to select which DIMM should be used. This is probably implemented so by intel so that memory requests can be handled in parallel by the two DIMM's. When the computer in question has two DIMM's (memory modules) each page is split across two DIMMs and thus two rows. Hence each row contains memory from 4 different pages. If the attacker is in a position to  cause behavior in the victim he can map this behavior to row activity by timing and eventually use this information to spy on activity not caused by him. It is obviously a far more difficult attack to launch and I've not actually tried to.

CAT and the future of cache attacks

In Intel[3] chapter 17.15 the “cache allocation technology” (CAT) of “future” (I think this means Broadwell and beyond, but I'm not sure) Xeon processors is vaguely described. Unfortunately I don't own or have access to a "future Xeon processor" and thus anything below remains pure speculation.  The CAT allows the supervisor or hypervisor to partition the cache and select which processes should have access to which partitions. It is build in to give hypervisors and supervisors a way to provide important processes with better performance by giving them better access to cache. Without partitions processes compete for cache on equal terms, CAT is supposed to tip the balance in favour of those who the OS think needs it.  The documentations that I have state that only the L3 cache is currently supported. The way it works is undocumented (or I don't have/understand the documentation), but it is hinted that current implementations is partitioning over ways. Which sounds like a easy and sensible method. The current process is assigned a two bit mask in an MSR called a COS value. Based on this COS value a so called capacity bit mask is fetched from another MSR. I assume that for each bit set in this MSR corresponding ways of each cache set is accessible for the current memory access operation . The idea is that the operating system sets up the bitmask on boot (default is everything is shared) and for each process/thread/what ever sets the appropriate COS value. This all could be potentially be used to give 4 different users access to each a private L3 cache depending on how it’s actually implemented.  This could be a nice flexible (leaving performance/security trade off to the OS implementation) micro-architectural solution to the cache side channel attacks on L3 caches – should it work.

I see two likely scenarios for the actual implementation of CAT. The first an simple is a hard partition of the cache. What I mean by this is that memory placed in a cache partition only available to a COS=0 process will come from physical memory if a COS=1 process should try to access it (or from COS=1's own partition of memory). Further a COS=1 process could not possible evict or flush memory from the COS=0's partition. In this scenario it's likely that all cross COS attacks on the 3rd level cache is impossible when the bit makes are set up appropriately. Obviously even though the cache is hard partition and attackers cannot either "prime" or "load" across partitions boundaries, instructions may still leak information about cache state and allow some attack possibilities.

The second scenario I call the "soft partition scenario". I find more likely this scenario more likely. In this scenario only the second part would hold true. Any memory access regardless of COS would be served from any cache partition if available. However any eviction or flush decision would only be affected only by cache events initiated with appropriate COS access. It would allow any amount of "probing" however once a cache entry has been made by the victim, the attacker could no longer flush it on his own. Thus flush or evicting by an attacker becomes a race condition instead of a sure thing. So L3 cache attacks would lose a lot of power, but not generally be impossible. I find this scenario more likely for two reasons. The first is that in this scenario no new cache coherency between partitions would have to be enforced. Secondly it allows synergies between COS 's to be leveraged for better performance.

It is however easy to imagine other scenarios on how CAT could be implemented and implementation details matters very much. There is work to be done here.
It might also be valuable to look into chapter 17.14 ( cache monitoring) for CSC mitigation detection techniques in the future.

Certainly CAT will have a profound impact on RBSC too. The impact pushes RBSC's in two directions first the problem with by passing the cache for another COS becomes impossible in the first scenario and becomes a race condition in the second. However CAT does not appear in any implementation to make RBSC unfeasible and this is likely to make it a more viable options for attackers.

Acknowledgement

In this place I must acknowledge Daniel Gruss who motivated me to keep on working on this blog post, when it seemed like it'd be too much effort to finish. Also my thanks should go out to Jacob Soo whose comments improved the text. All errors are mine.


Litterature

[1]Seaborn(2015):” How physical addresses map to rows and banks in DRAM”; 

[2]:Wikipedia(XXXX): Dynamic random-access memory;https://en.wikipedia.org/wiki/Dynamic_random-access_memory#Operations_to_read_a_data_bit_from_a_DRAM_storage_cell
http://lackingrhoticity.blogspot.de/2015/05/how-physical-addresses-map-to-rows-and-banks.html

[3]Intel(2015): “Intel® 64 and IA-32 Architectures, Software Developer’s Manual,Volume 3 (3A, 3B & 3C):,System Programming Guide”

[4]Daniel Gruss, Clémentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf

Saturday, November 21, 2015

Detecting stealth mode cache attacks: Flush+Flush

Update (22 Nov. 2015):

Daniel Gruss commented on twitter that he was a bit sceptical about the performance of my detection scheme. I think it’s fair to be skeptical on this point. I’ve augmented the blog post with some information about the performance of the detection. And a small comment on how the CPU could be changed to allow better mitigation. The augmented stuff is appended.

Introduction

In my previous works, I described the how Cache side channel attacks work, why I consider them a serious threat and spend a lot of time on mitigation methods and detection. Resonately a new paper was published that avoid the detection methods that I’ve suggested by avoiding the signature behavior I’ve looked for. Thus not only my methods, but also variations of the method failed against this new attack. In this blog post I’ll shortly describe the new attack and propose a detection method for it. I’ll touch on subjects which I explained in my previous cache side channel post. I won’t spend much time on explaining the back ground of CSC’s in this post. So if you’re not up to date on cache side channel attacks, you may wish to read this post before proceeding.

Developments since my last post

Since my last cache side channel attacks post 3 important papers has been released. First was Chiappetta, Savas & Yilmaz[1], about using performance counters for detecting cache side channel attacks. Unlike what Nishat Herath and I wrote about in our black hat slides, Chiappetta, Savas & Yilmaz[1], coupled up the performance counters to machine learning methods – specifically a neural network and showed you could detect attacks real time using this methodology. It’s an interesting upgrade to my work on the subject and I suspect their method to work well on high frequency attacks. My unpublished results from the work I did in connection with the black hat speech, suggest that low frequency cache attacks in a noisy environment may not detect well using this method, but I’ll be happy to be shown otherwise. I’ll return in part to this point later in this blog post. Never the less I’m really happy, because it was the first independent confirmation that the ideas Mr. Herath and I brought forth a black hat has bearing and an article like that always carry more information that could be presented at a conference.
Then came a very interesting paper by Lipp et al. [2]: In this paper the author breaks down the hurdles for using a shared cache on ARM devices and thus brings CSC from intel to other platforms. It is a development that should not be taken lightly as the potential for information leakage is huge, simply because the installed base is really large. In my last blog on CSC’s I warned about performance counters being leaked to unprivileged processes and the irony is that exactly this, is what makes the ARMageddon attack possible in the first place.
Finally Gruss, Maurice& Wagner [3] struck again with the Flush+Flush: A Stealthier Last-Level Cache Attack paper. It is this paper I shall write about and respond to in this blog.

The story

A while back Daniel Gruss contacted me and essentially teased me if I tought it’d be easy to detect a CSC that did not cause last level cache misses. My answer was “depends on the details” which is a nice way of hedging my bets. I thought about how Mr. Gruss would be going about it and the answer was on my notes, specifically on my to do list. While I’ve been playing with CSC’s I’d figured that it would be likely that clflush would leak information through how long it would take to execute. Actually flushing memory would likely take a different amount of time than just searching through the cache set and causing a miss. I had never actually done any work on it, but it allowed me to guess what Mr. Gruss’s was hinting at.  In the end I ended up with a draft of the paper and send back a few comments to Mr. Gruss & Clémentine Maurice.  I’m really grateful for gesture of trust on their part and it’s always a joy reading their papers – and early access is pretty cool too. My early access wasn’t just used to “review” their paper, but also to ponder how I could detect this new evil beast and I’ve come up with a solution.

More data on performance counters and CSC


Gruss, Maurice & Wagner[3] spends part of their paper documenting how a number of performance counters responds to different kinds of attacks and benign use cases.  After carefully analyzing the results they suggest a modification of Mr. Herath and my method by using a fixed decision boundry on the ratio of cache hits to cache misses. This is indeed a good idea. They also show that the method gives fairly good results. Again I’m somewhat skeptical about the performance of this measure under noise for low frequency attacks. My argument goes that if the attacker (or another benign process) causes lots of cache hits on unrelated sets, the performance counters are likely to show a ratio that is considered ok and thus not detect the attack. The attack would remain successful as long as the noise doesn’t extend to the attackers cache sets of interest.  High frequency attacks are more difficult to “mask” in this way. Lowering the demands on the ratio will help, but may lead to false positives on other memory intensive applications. This all said I count their results as evidence that that performance counters can make a real impact on CSC’s. In the black hat slides and my previous blog post my suggestion on how to deal with these issues can be found.

The new stealthier attack

The new stealthier attack works by timing the clflush instruction. It is slower when it actually has to flush the cache as one would expect a priori. Now obviously clflush takes an address as parameter and this essentially forces an attacker to have read access to the memory he wishes to spy on. This makes the attack in most respects similar to the flush+reload attack which I described in my blog on cache side channel attacks. The difference is now that the attacker no longer needs to cause a significant amount of cache misses on his own since the CLFlush instruction on it’s own is enough and the reload “phase” of the flush+reload attack entirely disappears. This fact not only makes it more difficult to pick up with performance counters, but it also makes the attack much faster. There are other aspects of this attack, which makes it even scarier, but it’s beyond the scope of this post to go into that here. The pseudo code for the attack code looks like this:
1.            While (true) {
2.            m_fence
3.            starttime = rdtsc
4.            m_fence
5.            clflush(target_address)
6.            m_fence
7.            delta = rdtsc – starttime
8.            derive information from delta (cache miss or hit)
9.            optionally wait for victim here  }

An important feature of the flush+flush attack is that the time difference between hit and miss is relatively small compared to the flush+reload attack. I shall come back to this later.

The new detection

My new detection bases it self on this code. The idea is that the presence of a RDTSC instruction closely followed by a clflush is likely to be a CSC. The presence of a second RDTSC instruction only hardens the suspicion. Even more so if it’s called repeatedly.
Obviously I cannot just search the entire memory for code that looks like this. So I use a variation of the method Mr. Herath & I suggested for finding low frequency attacks. Below is the sketch description of the most primitive variation of this detection mechanism.
1.            First I setup a PMI interrupt handler to handle interrupts that appear when a performance counter overflows.
2.            To get notification of the use of a RDTSC instruction in user mode I set the CR4.TSD flag which will make the RDTSC instruction(and it’s brother the RDTSCP instruction) privileged and thus cause access violations when called outside ring 0.
3.            If my exception handler was called for other reasons than RDTSC I chain the previous handler. Otherwise I setup performance counting to cause interrupts on “instruction retired” and the appropriate is counter is set to -1 to cause an interrupt on the next instruction. Here after I emulate the RDTSC instruction by using it and forwarding the right values to the user mode client . I do this step for compatibility reasons, so that legitimate use of the RDTSC does not break. There after the interrupt handler exits without chaining further handlers.
4.            Once I get the PMI interrupt I check if the instruction following the R/EIP in the client is a clflush. If it is I’ve detected a CSC in progress, if not I set the counter register to -1 to cause an interrupt on the next instruction. If 4. Has been called a sufficient number of time I stop the performance counters and consider the RDTSC event to be unrelated to a CSC attack. Then I exit gracefully from the interrupt handler.

First observation is that we cannot count on the attacker not obfuscating the attack. In the code listed for the attack he could add dummy instructions such as nop or even (conditional) jmps to make a search very difficult. My first solution to this problem was that I needed an emulator to search for the clflush, but that seems like a terrible solution, slow, prone to error etc. I quickly realized that I could use the trace-flag to let the processor do the work. I scrapped that idea too. The reason was I needed to install another interrupt handler and that the trace flag can be deleted/detected from user mode, which again would require at least a partial emulator for it to work. Thus I scrapped this approach. Eventually I figured that using performance counters with interrupt provided a neat solution to search for the needle in the haystack. Also it would line up well with detections of the other types of CSC’s in a real world implementation. Now the prerequisite for all these methods to work is of cause that the attacker doesn’t use tons of obfuscation. He is however not in a position to do so. Since the clflush timing differences the attacker relies on are relatively short and modern processors isn’t exact time on any instruction, the attacker is limited in how much obfuscation he can use, and also very much in what kind of obfuscation he can use. From Gruss, Maurice  & Wagner[3] I gathered that clflush causes cache hits and dummy instruction that accesses memory has particularly high variance, and is thus unsuitable for the attacker to use as obfuscation. Thus it would be likely I’d get to my clflush in just one interrupt and that I could stop looking after just a few. To my surprise I didn’t get any cache hits on clflush on a Kentsfield processor. This lead me to conclude that most likely there is a difference in the implementation of clflush/cache manager across processors in respect to the interpretation of when to count. Being conservative I thus went with the “instructions retired” event with the downside that I need more interrupts and thus a high performance penalty to rule out false negatives. The smart ones amongst you will notice that Kentsfield doesn’t have a level 3 cache and my guess is that this is the reason for not causing cache misses. Thus I suspect that the cache hit event would do the trick in a real world implementation.

A note on my detection PoC

I developed my PoC on a WinXP 32 on a 4 core Kentsfield system. The reason for picking this system was two fold. First and foremost it was the only one available to me. I don’t own a computer and my wife would not find it amusing if my driver crashing ended up causing damage to the file system on hers. Using Windows XP 32 also have a significant advantage – there is no patch guard which allows me easy access to access the hardware directly. Real world implementation needs to consider this. Now I did not actually hook the access violation interrupt handler in the IDT. I opted for the easier solution of using vectored exception handling in user mode. My reason to do this was that my method was based on hardware features of the intel processor and thus I could use the existing handler of windows without contaminating my results. From a development perspective this allows me to work in ring 3 (which is a really great thing when you’re testing a driver using remote desktop over VPN) and it gives me a lot of handling of the interrupt for free – like figuring out why the access violation was trigged in the first place.  Despite using a Kentsfield processor, where the attack cannot actually take place, the mechanism my detection can be tested reliably and thus I’m confident that general process will extend to other architectures as well.

Performance


First comment I need to make is about the word performance. I this context it can mean two things: First how well does the detection work in terms of false positives and false negatives. Second what is the speed penalty for running the detection? I shall try to answer both separately.

False positives and false negatives


The detection hinges on two things that an attacker uses RDTSC(P) twice and a clflush. Now if the attacker is able to make a suitable high precision timer and thus avoid using RDTSC my detection fails. In my cache side channel blog post I argued that a thread timer may be a solution to that problem. With flush+flush the margin for success is smaller because the timing difference is used to measure hit or miss is much smaller than in the traditional CSCs.  I don’t think these timers work well enough, but I could be wrong. Second option for the attacker to break my detection is dummy instructions that I discussed above. Third the clflush instruction itself can be obsfuscated using prefixes such as ds,es,cs,ss or even combinations of these since the intel CPUs tends to ignore the illegality of double address prefixes. Never-the-less this should not pose much of a challenge for the defender. If the defender is able to use the “cache hit” performance counter to find the clflush instruction doing so will come with a slim chance of false positives, because we’ll no longer find the beginning of the instruction, but the end and the instruction triggering the cache hit could be one  that looks like the clflush instruction that is the 0x0f, 0xa  bytes could be present in the ModRM/SIB/Displacement of say a mov instruction. The defender could deal with this using the “instruction retired” method for the next round of the attack. Finally there is the possibility that a benign program would use the rdtsc/clflush combination for some reason and that would definitely turn up a false positive. But the quick summary is that the attacker is not left with much wiggle room to avoid detection and false positives are unlikely.

The speed penalty of this detection scheme

The normal performance counter method for CSC detection is virtually free. Mr. Herath and I ran tests on the performance loss and found a drop in the range of 0.3% (statistically insignificant too). What one would consider an acceptable cost is of cause highly subjective, so I’ll just give some numbers here. In this flush+flush detection there is however a performance loss. This performance loss has two components the first is the handler for the PMI interrupt and the second is the access violation on the RDTSC instruction. The overflow interrupt on performance counting cost in and of itself around 530 ns on the Kentsfield processor I used to write the POC for the detection. Mr. Herath measured around 250ns on his more modern computer as part of our presentation for black hat. I suspect the access violation interrupt to be in the same order of magnitude. The access violation handler isn’t all that simple because we need to figure out what caused the exception before we can pass it on or emulate the RDTSC instruction. I do not have any measurements for that, but I suspect it’ll be less than a micro second total over head to the RDTSC instruction (which of cause is massive amount of added inaccuracy too). The handler for the PMI is relatively fast as we only need to search for the clflush instruction. For reference human perception starts around 20 milliseconds.

With ball park figures for the duration of the components of the detection attempt we now need turn our attention to incidence figures. Unfortunately I don’t have any. The first thing we must take note of is that the RDTSC instruction is rarely used. Visual studio 2013 and search indexer on WInXP both crashed after I set the CR4.TSD flag without a handler, but other than that the PC continued to run stabile. So we wouldn’t be running one detection attempt after another. The PMI will only come into play once we found an RDTSC. If we use the “instruction retired” variant we’ll probably spend a significant amount of time before we can dismiss an incidence of rdtsc as benign. Depending on just how much dummy instruction we think the attacker could add without adding too much noise. However if we can use the “cache hit” variation we’ll probably only need a few interrupts to confidently say it’s not an attack – again depending on how much cache access an attacker could actually do without adding too much noise.

The last thing we should consider is that the general method can be applied with deliberation. We can set the CR4.TSD flag only for suspected applications or white list known benign software. We can even white list dynamically – say we have another thread that analyzes the code statically around each RDTSC and if it’s sure to be benign then we wouldn’t have to trigger PMI’s in response to the next time we encounter the same instance of RDTSC. Also we could moderate our use of the PMI. Say we trigger the PMI only on every other instruction – this gives us a 50% reduction in number of interrupts and a 50% chance of detecting an attacker – per round he runs the attack(number of loops in the “while(true)” loop in the pseudo code), which means our chance of detection is extremely high in real use cases.

To summarize I think the detection scheme could be implemented at a acceptable speed penalty.

What could intel do?


It has often been suggested to make clflush privileged. Adding a CR4.TSD type flag for clflush would be a more flexible solution because it’d grant operating system the power to allow and disallow clflush where it sees fit. 


Acknowledgement

I need to acknowledge Mr. Herath for his contributions to our black hat slides. The conversations and work I did with Mr. Nishat for black hat has been instrumental for this post to work out. Also thanks goes to Daniel Gruss for enlightening conversations and to Clementine Maurice for trusting me with a draft of their excellent paper. Any errors or misunderstandings are entirely mine!


Litterature

[1] Marco Chiappetta, Erkay Savas & Cemal Yilmaz (2015): “Real time detection of cache-based side-channel attacks using Hardware Performance Counters”: http://eprint.iacr.org/2015/1034.pdf
[2] Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Stefan Mangard (2015), ”ARMageddon: Last-Level Cache Attacks on Mobile Devices”. http://arxiv.org/abs/1511.04897

[3] Daniel Gruss, Clémentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf