Saturday, November 21, 2015

Detecting stealth mode cache attacks: Flush+Flush

Update (22 Nov. 2015):

Daniel Gruss commented on twitter that he was a bit sceptical about the performance of my detection scheme. I think it’s fair to be skeptical on this point. I’ve augmented the blog post with some information about the performance of the detection. And a small comment on how the CPU could be changed to allow better mitigation. The augmented stuff is appended.

Introduction

In my previous works, I described the how Cache side channel attacks work, why I consider them a serious threat and spend a lot of time on mitigation methods and detection. Resonately a new paper was published that avoid the detection methods that I’ve suggested by avoiding the signature behavior I’ve looked for. Thus not only my methods, but also variations of the method failed against this new attack. In this blog post I’ll shortly describe the new attack and propose a detection method for it. I’ll touch on subjects which I explained in my previous cache side channel post. I won’t spend much time on explaining the back ground of CSC’s in this post. So if you’re not up to date on cache side channel attacks, you may wish to read this post before proceeding.

Developments since my last post

Since my last cache side channel attacks post 3 important papers has been released. First was Chiappetta, Savas & Yilmaz[1], about using performance counters for detecting cache side channel attacks. Unlike what Nishat Herath and I wrote about in our black hat slides, Chiappetta, Savas & Yilmaz[1], coupled up the performance counters to machine learning methods – specifically a neural network and showed you could detect attacks real time using this methodology. It’s an interesting upgrade to my work on the subject and I suspect their method to work well on high frequency attacks. My unpublished results from the work I did in connection with the black hat speech, suggest that low frequency cache attacks in a noisy environment may not detect well using this method, but I’ll be happy to be shown otherwise. I’ll return in part to this point later in this blog post. Never the less I’m really happy, because it was the first independent confirmation that the ideas Mr. Herath and I brought forth a black hat has bearing and an article like that always carry more information that could be presented at a conference.
Then came a very interesting paper by Lipp et al. [2]: In this paper the author breaks down the hurdles for using a shared cache on ARM devices and thus brings CSC from intel to other platforms. It is a development that should not be taken lightly as the potential for information leakage is huge, simply because the installed base is really large. In my last blog on CSC’s I warned about performance counters being leaked to unprivileged processes and the irony is that exactly this, is what makes the ARMageddon attack possible in the first place.
Finally Gruss, Maurice& Wagner [3] struck again with the Flush+Flush: A Stealthier Last-Level Cache Attack paper. It is this paper I shall write about and respond to in this blog.

The story

A while back Daniel Gruss contacted me and essentially teased me if I tought it’d be easy to detect a CSC that did not cause last level cache misses. My answer was “depends on the details” which is a nice way of hedging my bets. I thought about how Mr. Gruss would be going about it and the answer was on my notes, specifically on my to do list. While I’ve been playing with CSC’s I’d figured that it would be likely that clflush would leak information through how long it would take to execute. Actually flushing memory would likely take a different amount of time than just searching through the cache set and causing a miss. I had never actually done any work on it, but it allowed me to guess what Mr. Gruss’s was hinting at.  In the end I ended up with a draft of the paper and send back a few comments to Mr. Gruss & ClĂ©mentine Maurice.  I’m really grateful for gesture of trust on their part and it’s always a joy reading their papers – and early access is pretty cool too. My early access wasn’t just used to “review” their paper, but also to ponder how I could detect this new evil beast and I’ve come up with a solution.

More data on performance counters and CSC


Gruss, Maurice & Wagner[3] spends part of their paper documenting how a number of performance counters responds to different kinds of attacks and benign use cases.  After carefully analyzing the results they suggest a modification of Mr. Herath and my method by using a fixed decision boundry on the ratio of cache hits to cache misses. This is indeed a good idea. They also show that the method gives fairly good results. Again I’m somewhat skeptical about the performance of this measure under noise for low frequency attacks. My argument goes that if the attacker (or another benign process) causes lots of cache hits on unrelated sets, the performance counters are likely to show a ratio that is considered ok and thus not detect the attack. The attack would remain successful as long as the noise doesn’t extend to the attackers cache sets of interest.  High frequency attacks are more difficult to “mask” in this way. Lowering the demands on the ratio will help, but may lead to false positives on other memory intensive applications. This all said I count their results as evidence that that performance counters can make a real impact on CSC’s. In the black hat slides and my previous blog post my suggestion on how to deal with these issues can be found.

The new stealthier attack

The new stealthier attack works by timing the clflush instruction. It is slower when it actually has to flush the cache as one would expect a priori. Now obviously clflush takes an address as parameter and this essentially forces an attacker to have read access to the memory he wishes to spy on. This makes the attack in most respects similar to the flush+reload attack which I described in my blog on cache side channel attacks. The difference is now that the attacker no longer needs to cause a significant amount of cache misses on his own since the CLFlush instruction on it’s own is enough and the reload “phase” of the flush+reload attack entirely disappears. This fact not only makes it more difficult to pick up with performance counters, but it also makes the attack much faster. There are other aspects of this attack, which makes it even scarier, but it’s beyond the scope of this post to go into that here. The pseudo code for the attack code looks like this:
1.            While (true) {
2.            m_fence
3.            starttime = rdtsc
4.            m_fence
5.            clflush(target_address)
6.            m_fence
7.            delta = rdtsc – starttime
8.            derive information from delta (cache miss or hit)
9.            optionally wait for victim here  }

An important feature of the flush+flush attack is that the time difference between hit and miss is relatively small compared to the flush+reload attack. I shall come back to this later.

The new detection

My new detection bases it self on this code. The idea is that the presence of a RDTSC instruction closely followed by a clflush is likely to be a CSC. The presence of a second RDTSC instruction only hardens the suspicion. Even more so if it’s called repeatedly.
Obviously I cannot just search the entire memory for code that looks like this. So I use a variation of the method Mr. Herath & I suggested for finding low frequency attacks. Below is the sketch description of the most primitive variation of this detection mechanism.
1.            First I setup a PMI interrupt handler to handle interrupts that appear when a performance counter overflows.
2.            To get notification of the use of a RDTSC instruction in user mode I set the CR4.TSD flag which will make the RDTSC instruction(and it’s brother the RDTSCP instruction) privileged and thus cause access violations when called outside ring 0.
3.            If my exception handler was called for other reasons than RDTSC I chain the previous handler. Otherwise I setup performance counting to cause interrupts on “instruction retired” and the appropriate is counter is set to -1 to cause an interrupt on the next instruction. Here after I emulate the RDTSC instruction by using it and forwarding the right values to the user mode client . I do this step for compatibility reasons, so that legitimate use of the RDTSC does not break. There after the interrupt handler exits without chaining further handlers.
4.            Once I get the PMI interrupt I check if the instruction following the R/EIP in the client is a clflush. If it is I’ve detected a CSC in progress, if not I set the counter register to -1 to cause an interrupt on the next instruction. If 4. Has been called a sufficient number of time I stop the performance counters and consider the RDTSC event to be unrelated to a CSC attack. Then I exit gracefully from the interrupt handler.

First observation is that we cannot count on the attacker not obfuscating the attack. In the code listed for the attack he could add dummy instructions such as nop or even (conditional) jmps to make a search very difficult. My first solution to this problem was that I needed an emulator to search for the clflush, but that seems like a terrible solution, slow, prone to error etc. I quickly realized that I could use the trace-flag to let the processor do the work. I scrapped that idea too. The reason was I needed to install another interrupt handler and that the trace flag can be deleted/detected from user mode, which again would require at least a partial emulator for it to work. Thus I scrapped this approach. Eventually I figured that using performance counters with interrupt provided a neat solution to search for the needle in the haystack. Also it would line up well with detections of the other types of CSC’s in a real world implementation. Now the prerequisite for all these methods to work is of cause that the attacker doesn’t use tons of obfuscation. He is however not in a position to do so. Since the clflush timing differences the attacker relies on are relatively short and modern processors isn’t exact time on any instruction, the attacker is limited in how much obfuscation he can use, and also very much in what kind of obfuscation he can use. From Gruss, Maurice  & Wagner[3] I gathered that clflush causes cache hits and dummy instruction that accesses memory has particularly high variance, and is thus unsuitable for the attacker to use as obfuscation. Thus it would be likely I’d get to my clflush in just one interrupt and that I could stop looking after just a few. To my surprise I didn’t get any cache hits on clflush on a Kentsfield processor. This lead me to conclude that most likely there is a difference in the implementation of clflush/cache manager across processors in respect to the interpretation of when to count. Being conservative I thus went with the “instructions retired” event with the downside that I need more interrupts and thus a high performance penalty to rule out false negatives. The smart ones amongst you will notice that Kentsfield doesn’t have a level 3 cache and my guess is that this is the reason for not causing cache misses. Thus I suspect that the cache hit event would do the trick in a real world implementation.

A note on my detection PoC

I developed my PoC on a WinXP 32 on a 4 core Kentsfield system. The reason for picking this system was two fold. First and foremost it was the only one available to me. I don’t own a computer and my wife would not find it amusing if my driver crashing ended up causing damage to the file system on hers. Using Windows XP 32 also have a significant advantage – there is no patch guard which allows me easy access to access the hardware directly. Real world implementation needs to consider this. Now I did not actually hook the access violation interrupt handler in the IDT. I opted for the easier solution of using vectored exception handling in user mode. My reason to do this was that my method was based on hardware features of the intel processor and thus I could use the existing handler of windows without contaminating my results. From a development perspective this allows me to work in ring 3 (which is a really great thing when you’re testing a driver using remote desktop over VPN) and it gives me a lot of handling of the interrupt for free – like figuring out why the access violation was trigged in the first place.  Despite using a Kentsfield processor, where the attack cannot actually take place, the mechanism my detection can be tested reliably and thus I’m confident that general process will extend to other architectures as well.

Performance


First comment I need to make is about the word performance. I this context it can mean two things: First how well does the detection work in terms of false positives and false negatives. Second what is the speed penalty for running the detection? I shall try to answer both separately.

False positives and false negatives


The detection hinges on two things that an attacker uses RDTSC(P) twice and a clflush. Now if the attacker is able to make a suitable high precision timer and thus avoid using RDTSC my detection fails. In my cache side channel blog post I argued that a thread timer may be a solution to that problem. With flush+flush the margin for success is smaller because the timing difference is used to measure hit or miss is much smaller than in the traditional CSCs.  I don’t think these timers work well enough, but I could be wrong. Second option for the attacker to break my detection is dummy instructions that I discussed above. Third the clflush instruction itself can be obsfuscated using prefixes such as ds,es,cs,ss or even combinations of these since the intel CPUs tends to ignore the illegality of double address prefixes. Never-the-less this should not pose much of a challenge for the defender. If the defender is able to use the “cache hit” performance counter to find the clflush instruction doing so will come with a slim chance of false positives, because we’ll no longer find the beginning of the instruction, but the end and the instruction triggering the cache hit could be one  that looks like the clflush instruction that is the 0x0f, 0xa  bytes could be present in the ModRM/SIB/Displacement of say a mov instruction. The defender could deal with this using the “instruction retired” method for the next round of the attack. Finally there is the possibility that a benign program would use the rdtsc/clflush combination for some reason and that would definitely turn up a false positive. But the quick summary is that the attacker is not left with much wiggle room to avoid detection and false positives are unlikely.

The speed penalty of this detection scheme

The normal performance counter method for CSC detection is virtually free. Mr. Herath and I ran tests on the performance loss and found a drop in the range of 0.3% (statistically insignificant too). What one would consider an acceptable cost is of cause highly subjective, so I’ll just give some numbers here. In this flush+flush detection there is however a performance loss. This performance loss has two components the first is the handler for the PMI interrupt and the second is the access violation on the RDTSC instruction. The overflow interrupt on performance counting cost in and of itself around 530 ns on the Kentsfield processor I used to write the POC for the detection. Mr. Herath measured around 250ns on his more modern computer as part of our presentation for black hat. I suspect the access violation interrupt to be in the same order of magnitude. The access violation handler isn’t all that simple because we need to figure out what caused the exception before we can pass it on or emulate the RDTSC instruction. I do not have any measurements for that, but I suspect it’ll be less than a micro second total over head to the RDTSC instruction (which of cause is massive amount of added inaccuracy too). The handler for the PMI is relatively fast as we only need to search for the clflush instruction. For reference human perception starts around 20 milliseconds.

With ball park figures for the duration of the components of the detection attempt we now need turn our attention to incidence figures. Unfortunately I don’t have any. The first thing we must take note of is that the RDTSC instruction is rarely used. Visual studio 2013 and search indexer on WInXP both crashed after I set the CR4.TSD flag without a handler, but other than that the PC continued to run stabile. So we wouldn’t be running one detection attempt after another. The PMI will only come into play once we found an RDTSC. If we use the “instruction retired” variant we’ll probably spend a significant amount of time before we can dismiss an incidence of rdtsc as benign. Depending on just how much dummy instruction we think the attacker could add without adding too much noise. However if we can use the “cache hit” variation we’ll probably only need a few interrupts to confidently say it’s not an attack – again depending on how much cache access an attacker could actually do without adding too much noise.

The last thing we should consider is that the general method can be applied with deliberation. We can set the CR4.TSD flag only for suspected applications or white list known benign software. We can even white list dynamically – say we have another thread that analyzes the code statically around each RDTSC and if it’s sure to be benign then we wouldn’t have to trigger PMI’s in response to the next time we encounter the same instance of RDTSC. Also we could moderate our use of the PMI. Say we trigger the PMI only on every other instruction – this gives us a 50% reduction in number of interrupts and a 50% chance of detecting an attacker – per round he runs the attack(number of loops in the “while(true)” loop in the pseudo code), which means our chance of detection is extremely high in real use cases.

To summarize I think the detection scheme could be implemented at a acceptable speed penalty.

What could intel do?


It has often been suggested to make clflush privileged. Adding a CR4.TSD type flag for clflush would be a more flexible solution because it’d grant operating system the power to allow and disallow clflush where it sees fit. 


Acknowledgement

I need to acknowledge Mr. Herath for his contributions to our black hat slides. The conversations and work I did with Mr. Nishat for black hat has been instrumental for this post to work out. Also thanks goes to Daniel Gruss for enlightening conversations and to Clementine Maurice for trusting me with a draft of their excellent paper. Any errors or misunderstandings are entirely mine!


Litterature

[1] Marco Chiappetta, Erkay Savas & Cemal Yilmaz (2015): “Real time detection of cache-based side-channel attacks using Hardware Performance Counters”: http://eprint.iacr.org/2015/1034.pdf
[2] Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Stefan Mangard (2015), ”ARMageddon: Last-Level Cache Attacks on Mobile Devices”. http://arxiv.org/abs/1511.04897

[3] Daniel Gruss, ClĂ©mentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf

2 comments:

  1. I tried to configure msr registers to generate PMI after a threshold value. However, the PMI interrupt is only generated once.
    IA32_PERF_GLOBAL_CTRL = 1
    IA32_PEBS_ENABLE = 1
    IA32_PERF_GLOBAL_OVF_CTRL = 1
    msr 0xc1 = -1000
    msr 0x186 = 0x004100C0

    Could you please let me know if I am missing anything. Any help will be highly appreciated.

    ReplyDelete
    Replies
    1. Sorry for my late reply. I missed your comment. You only get th PMI once. Which is why you reset the msr 0xc1 in your PMI handler. OnPMI() { if (MorePMIsNeeded) SetMSR(0xC1, -1000) }. Hope this help. Otherwise email me on anders_fogh (at) hotmail.com and I'll send you some sources...

      Delete