Update (22 Nov. 2015):
Daniel
Gruss commented on twitter that he was a bit sceptical about the performance of
my detection scheme. I think it’s fair to be skeptical on this point. I’ve
augmented the blog post with some information about the performance of the
detection. And a small comment on how the CPU could be changed to allow better mitigation. The augmented stuff is appended.
Introduction
In my
previous works, I described the how Cache side channel attacks work, why I
consider them a serious threat and spend a lot of time on mitigation methods
and detection. Resonately a new paper was published that avoid the detection
methods that I’ve suggested by avoiding the signature behavior I’ve looked for.
Thus not only my methods, but also variations of the method failed against this
new attack. In this blog post I’ll shortly describe the new attack and propose
a detection method for it. I’ll touch on subjects which I explained in my
previous cache side channel post. I won’t spend much time on explaining the
back ground of CSC’s in this post. So if you’re not up to date on cache side
channel attacks, you may wish to read this post before proceeding.
Developments since my last
post
Since my
last cache side channel attacks post 3 important papers has been released.
First was Chiappetta, Savas & Yilmaz[1], about using performance counters
for detecting cache side channel attacks. Unlike what Nishat Herath and I wrote
about in our black hat slides, Chiappetta, Savas & Yilmaz[1], coupled up
the performance counters to machine learning methods – specifically a neural
network and showed you could detect attacks real time using this methodology.
It’s an interesting upgrade to my work on the subject and I suspect their
method to work well on high frequency attacks. My unpublished results from the
work I did in connection with the black hat speech, suggest that low frequency
cache attacks in a noisy environment may not detect well using this method, but
I’ll be happy to be shown otherwise. I’ll return in part to this point later in
this blog post. Never the less I’m really happy, because it was the first
independent confirmation that the ideas Mr. Herath and I brought forth a black
hat has bearing and an article like that always carry more information that
could be presented at a conference.
Then came a
very interesting paper by Lipp et al. [2]: In this paper the author breaks down
the hurdles for using a shared cache on ARM devices and thus brings CSC from
intel to other platforms. It is a development that should not be taken lightly
as the potential for information leakage is huge, simply because the installed
base is really large. In my last blog on CSC’s I warned about performance
counters being leaked to unprivileged processes and the irony is that exactly
this, is what makes the ARMageddon attack possible in the first place.
Finally
Gruss, Maurice& Wagner [3] struck again with the Flush+Flush: A Stealthier
Last-Level Cache Attack paper. It is this paper I shall write about and respond
to in this blog.
The story
A while
back Daniel Gruss contacted me and essentially teased me if I tought it’d be
easy to detect a CSC that did not cause last level cache misses. My answer was
“depends on the details” which is a nice way of hedging my bets. I thought
about how Mr. Gruss would be going about it and the answer was on my notes,
specifically on my to do list. While I’ve been playing with CSC’s I’d figured
that it would be likely that clflush would leak information through how long it
would take to execute. Actually flushing memory would likely take a different
amount of time than just searching through the cache set and causing a miss. I
had never actually done any work on it, but it allowed me to guess what Mr.
Gruss’s was hinting at. In the end I
ended up with a draft of the paper and send back a few comments to Mr. Gruss
& ClĂ©mentine Maurice. I’m really
grateful for gesture of trust on their part and it’s always a joy reading their
papers – and early access is pretty cool too. My early access wasn’t just used
to “review” their paper, but also to ponder how I could detect this new evil
beast and I’ve come up with a solution.
More data on performance
counters and CSC
Gruss,
Maurice & Wagner[3] spends part of their paper documenting how a number of
performance counters responds to different kinds of attacks and benign use cases. After carefully analyzing the
results they suggest a modification of Mr. Herath and my method by using a
fixed decision boundry on the ratio of cache hits to cache misses. This is
indeed a good idea. They also show that the method gives fairly good results.
Again I’m somewhat skeptical about the performance of this measure under noise
for low frequency attacks. My argument goes that if the attacker (or another
benign process) causes lots of cache hits on unrelated sets, the performance
counters are likely to show a ratio that is considered ok and thus not detect
the attack. The attack would remain successful as long as the noise doesn’t
extend to the attackers cache sets of interest.
High frequency attacks are more difficult to “mask” in this way.
Lowering the demands on the ratio will help, but may lead to false positives on
other memory intensive applications. This all said I count their results as
evidence that that performance counters can make a real impact on CSC’s. In the
black hat slides and my previous blog post my suggestion on how to deal with
these issues can be found.
The new stealthier attack
The new
stealthier attack works by timing the clflush instruction. It is slower when it
actually has to flush the cache as one would expect a priori. Now obviously
clflush takes an address as parameter and this essentially forces an attacker
to have read access to the memory he wishes to spy on. This makes the attack in
most respects similar to the flush+reload attack which I described in my blog
on cache side channel attacks. The difference is now that the attacker no
longer needs to cause a significant amount of cache misses on his own since the
CLFlush instruction on it’s own is enough and the reload “phase” of the
flush+reload attack entirely disappears. This fact not only makes it more
difficult to pick up with performance counters, but it also makes the attack
much faster. There are other aspects of this attack, which makes it even
scarier, but it’s beyond the scope of this post to go into that here. The
pseudo code for the attack code looks like this:
1. While (true) {
2. m_fence
3. starttime = rdtsc
4. m_fence
5. clflush(target_address)
6. m_fence
7. delta = rdtsc – starttime
8. derive information from delta (cache
miss or hit)
9. optionally wait for victim here }
An
important feature of the flush+flush attack is that the time difference between
hit and miss is relatively small compared to the flush+reload attack. I shall
come back to this later.
The new detection
My new
detection bases it self on this code. The idea is that the presence of a RDTSC
instruction closely followed by a clflush is likely to be a CSC. The presence
of a second RDTSC instruction only hardens the suspicion. Even more so if it’s
called repeatedly.
Obviously I
cannot just search the entire memory for code that looks like this. So I use a
variation of the method Mr. Herath & I suggested for finding low frequency
attacks. Below is the sketch description of the most primitive variation of
this detection mechanism.
1. First I setup a PMI interrupt
handler to handle interrupts that appear when a performance counter overflows.
2. To get notification of the use of a
RDTSC instruction in user mode I set the CR4.TSD flag which will make the RDTSC
instruction(and it’s brother the RDTSCP instruction) privileged and thus cause
access violations when called outside ring 0.
3. If my exception handler was called
for other reasons than RDTSC I chain the previous handler. Otherwise I setup
performance counting to cause interrupts on “instruction retired” and the
appropriate is counter is set to -1 to cause an interrupt on the next
instruction. Here after I emulate the RDTSC instruction by using it and
forwarding the right values to the user mode client . I do this step for
compatibility reasons, so that legitimate use of the RDTSC does not break.
There after the interrupt handler exits without chaining further handlers.
4. Once I get the PMI interrupt I check
if the instruction following the R/EIP in the client is a clflush. If it is
I’ve detected a CSC in progress, if not I set the counter register to -1 to
cause an interrupt on the next instruction. If 4. Has been called a sufficient
number of time I stop the performance counters and consider the RDTSC event to
be unrelated to a CSC attack. Then I exit gracefully from the interrupt
handler.
First
observation is that we cannot count on the attacker not obfuscating the attack.
In the code listed for the attack he could add dummy instructions such as nop
or even (conditional) jmps to make a search very difficult. My first solution
to this problem was that I needed an emulator to search for the clflush, but
that seems like a terrible solution, slow, prone to error etc. I quickly
realized that I could use the trace-flag to let the processor do the work. I
scrapped that idea too. The reason was I needed to install another interrupt
handler and that the trace flag can be deleted/detected from user mode, which
again would require at least a partial emulator for it to work. Thus I scrapped
this approach. Eventually I figured that using performance counters with
interrupt provided a neat solution to search for the needle in the haystack.
Also it would line up well with detections of the other types of CSC’s in a
real world implementation. Now the prerequisite for all these methods to work
is of cause that the attacker doesn’t use tons of obfuscation. He is however
not in a position to do so. Since the clflush timing differences the attacker
relies on are relatively short and modern processors isn’t exact time on any
instruction, the attacker is limited in how much obfuscation he can use, and
also very much in what kind of obfuscation he can use. From Gruss, Maurice & Wagner[3] I gathered that clflush
causes cache hits and dummy instruction that accesses memory has particularly
high variance, and is thus unsuitable for the attacker to use as obfuscation.
Thus it would be likely I’d get to my clflush in just one interrupt and that I
could stop looking after just a few. To my surprise I didn’t get any cache hits
on clflush on a Kentsfield processor. This lead me to conclude that most likely
there is a difference in the implementation of clflush/cache manager across
processors in respect to the interpretation of when to count. Being
conservative I thus went with the “instructions retired” event with the
downside that I need more interrupts and thus a high performance penalty to rule
out false negatives. The smart ones amongst you will notice that Kentsfield
doesn’t have a level 3 cache and my guess is that this is the reason for not
causing cache misses. Thus I suspect that the cache hit event would do the
trick in a real world implementation.
A note on my detection PoC
I developed
my PoC on a WinXP 32 on a 4 core Kentsfield system. The reason for picking this
system was two fold. First and foremost it was the only one available to me. I
don’t own a computer and my wife would not find it amusing if my driver
crashing ended up causing damage to the file system on hers. Using Windows XP
32 also have a significant advantage – there is no patch guard which allows me
easy access to access the hardware directly. Real world implementation needs to
consider this. Now I did not actually hook the access violation interrupt
handler in the IDT. I opted for the easier solution of using vectored exception
handling in user mode. My reason to do this was that my method was based on
hardware features of the intel processor and thus I could use the existing
handler of windows without contaminating my results. From a development
perspective this allows me to work in ring 3 (which is a really great thing
when you’re testing a driver using remote desktop over VPN) and it gives me a
lot of handling of the interrupt for free – like figuring out why the access
violation was trigged in the first place. Despite using a Kentsfield processor, where
the attack cannot actually take place, the mechanism my detection can be tested
reliably and thus I’m confident that general process will extend to other
architectures as well.
Performance
First
comment I need to make is about the word performance. I this context it can
mean two things: First how well does the detection work in terms of false positives
and false negatives. Second what is the speed penalty for running the detection?
I shall try to answer both separately.
False positives and false
negatives
The
detection hinges on two things that an attacker uses RDTSC(P) twice and a clflush.
Now if the attacker is able to make a suitable high precision timer and thus
avoid using RDTSC my detection fails. In my cache side channel blog post I
argued that a thread timer may be a solution to that problem. With flush+flush the
margin for success is smaller because the timing difference is used to measure hit
or miss is much smaller than in the traditional CSCs. I don’t think these timers work well enough,
but I could be wrong. Second option for the attacker to break my detection is
dummy instructions that I discussed above. Third the clflush instruction itself
can be obsfuscated using prefixes such as ds,es,cs,ss or even combinations of
these since the intel CPUs tends to ignore the illegality of double address
prefixes. Never-the-less this should not pose much of a challenge for the
defender. If the defender is able to use the “cache hit” performance counter to
find the clflush instruction doing so will come with a slim chance of false
positives, because we’ll no longer find the beginning of the instruction, but
the end and the instruction triggering the cache hit could be one that looks like the clflush instruction that
is the 0x0f, 0xa bytes could be present
in the ModRM/SIB/Displacement of say a mov instruction. The defender could deal
with this using the “instruction retired” method for the next round of the
attack. Finally there is the possibility that a benign program would use the
rdtsc/clflush combination for some reason and that would definitely turn up a
false positive. But the quick summary is that the attacker is not left with
much wiggle room to avoid detection and false positives are unlikely.
The speed penalty of this
detection scheme
The normal
performance counter method for CSC detection is virtually free. Mr. Herath and
I ran tests on the performance loss and found a drop in the range of 0.3%
(statistically insignificant too). What one would consider an acceptable cost
is of cause highly subjective, so I’ll just give some numbers here. In this flush+flush
detection there is however a performance loss. This performance loss has two
components the first is the handler for the PMI interrupt and the second is the
access violation on the RDTSC instruction. The overflow interrupt on
performance counting cost in and of itself around 530 ns on the Kentsfield
processor I used to write the POC for the detection. Mr. Herath measured around
250ns on his more modern computer as part of our presentation for black hat. I
suspect the access violation interrupt to be in the same order of magnitude.
The access violation handler isn’t all that simple because we need to figure
out what caused the exception before we can pass it on or emulate the RDTSC instruction.
I do not have any measurements for that, but I suspect it’ll be less than a
micro second total over head to the RDTSC instruction (which of cause is
massive amount of added inaccuracy too). The handler for the PMI is relatively
fast as we only need to search for the clflush instruction. For reference human
perception starts around 20 milliseconds.
With ball
park figures for the duration of the components of the detection attempt we now
need turn our attention to incidence figures. Unfortunately I don’t have any. The
first thing we must take note of is that the RDTSC instruction is rarely used. Visual
studio 2013 and search indexer on WInXP both crashed after I set the CR4.TSD
flag without a handler, but other than that the PC continued to run stabile. So
we wouldn’t be running one detection attempt after another. The PMI will only
come into play once we found an RDTSC. If we use the “instruction retired”
variant we’ll probably spend a significant amount of time before we can dismiss
an incidence of rdtsc as benign. Depending on just how much dummy instruction
we think the attacker could add without adding too much noise. However if we
can use the “cache hit” variation we’ll probably only need a few interrupts to
confidently say it’s not an attack – again depending on how much cache access
an attacker could actually do without adding too much noise.
The last
thing we should consider is that the general method can be applied with
deliberation. We can set the CR4.TSD flag only for suspected applications or
white list known benign software. We can even white list dynamically – say we
have another thread that analyzes the code statically around each RDTSC and if
it’s sure to be benign then we wouldn’t have to trigger PMI’s in response to
the next time we encounter the same instance of RDTSC. Also we could moderate
our use of the PMI. Say we trigger the PMI only on every other instruction –
this gives us a 50% reduction in number of interrupts and a 50% chance of
detecting an attacker – per round he runs the attack(number of loops in the “while(true)”
loop in the pseudo code), which means our chance of detection is extremely high
in real use cases.
To
summarize I think the detection scheme could be implemented at a acceptable
speed penalty.
What could intel do?
It has
often been suggested to make clflush privileged. Adding a CR4.TSD type flag for
clflush would be a more flexible solution because it’d grant operating system
the power to allow and disallow clflush where it sees fit.
Acknowledgement
I need to acknowledge Mr. Herath for his contributions to our black hat slides. The conversations and work I did with Mr. Nishat for black hat has been instrumental for this post to work out. Also thanks goes to Daniel Gruss for enlightening conversations and to Clementine Maurice for trusting me with a draft of their excellent paper. Any errors or misunderstandings are entirely mine!
Litterature
[1] Marco
Chiappetta, Erkay Savas & Cemal Yilmaz (2015): “Real time detection of
cache-based side-channel attacks using Hardware Performance Counters”:
http://eprint.iacr.org/2015/1034.pdf
[2] Moritz
Lipp, Daniel Gruss, Raphael Spreitzer, Stefan Mangard (2015), ”ARMageddon:
Last-Level Cache Attacks on Mobile Devices”. http://arxiv.org/abs/1511.04897
[3] Daniel
Gruss, ClĂ©mentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier
Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf
Sorry for my late reply. I missed your comment. You only get th PMI once. Which is why you reset the msr 0xc1 in your PMI handler. OnPMI() { if (MorePMIsNeeded) SetMSR(0xC1, -1000) }. Hope this help. Otherwise email me on anders_fogh (at) hotmail.com and I'll send you some sources...
ReplyDeleteWas just reading through your article. You setup a vectored interrupt handler, but that could not have been system wide? How did you manage to monitor all user mode exceptions from all processes and handle them using a vectored interrupt handler?
ReplyDeleteRespected Anders Fogh Sir,
ReplyDeleteHow can I flag Flush function (CLFLUSH) and Time function (RDTSC) inclusively on CentOS Linux 7 running Kernel-based Virtual Machine ?
Any assistance you can provide would be greatly appreciated.
Thanking you.