Cache side channel attacks
Errata 5.11.2015:
As I
expected a few errors and less than accurate formulations got into this post.
There is two things I’d specifically wish to address here.
My comments on using a non-inclusive cache hierarchy
to defeat cache attacks are misleading at best. Obviously since the L1 and L2 caches
are on a per core basis. It’s the L3 shared nature combined with the
non-inclusive cache hierarchy that allows us to flush these caches from another core in the first
place. However these caches are relatively small so that with only a “small”
amount of memory activity by the victim or other code running on the same
core (this includes the attacker if she can get herself executing on that core)
rare events can still be effectively spied upon.
On using
performance counters to detect cache side channel attacks I focused only on low
frequency attacks in this text. Mr. Herath and my slides from Black hat does
not. The idea is that high frequency cache attacks is relatively easily
identified as such from performance counters (either through the simple amount
of cache misses or say a ratio of cache misses to cache hits). Alternatively
machine learning methods could be applied to such performance counter data. The
low frequency data is much more difficult to keep apart from noise generate by
normal usage of memory in a busy system. Say the attacker is looking for
keyboard events. For this he is polling every .5 second and have say 15 good
addresses he polls. That gives us 15 cache misses in .5 second periods
(possibly even spread out over the .5 seconds) which isn’t much if a video
encoder is causing thousands of misses also going up and down periodically as the
video encoder switches between motion search (large number of cache misses and
lots of hits) and entropy encoding (few hits and misses). This is where the approach I describe here can filter the noise and force the attacker into territory where we can with confidence detect him.
Prelude
No tech
stuff in this part here. Skip to introduction to read the actual blog post. If
you care about my ramblings do continue.
It’s been 8
months awesome months since I started my comeback into infosec with this blog. It’s
time to do status. A few times I’ve felt misunderstood. The most common
misunderstanding is that I have a different concept of research than many in
the infosec community. This blog isn’t research. It is play. I do it for fun
and games and that’s it. There is no intend on my side to do research, because
that requires me to be systematic about it and that again is outside of my time
budget. I hope from time to time to present a new idea here, but that’s the
loftiest I wish to get with this blog. I do hope that the ambitious infosec
newbee might be inspired to look into some of the things I write about and I
also hope that the infosec professionals occasionally enjoys a lunch break with
my musings.
The most
read post so far was the nostalgia post. Followed by anything row hammer,
followed by anything with machine learning in the title. Ironically the post I
like the best on machine learning doesn’t have those words in the title and
isn’t being read much. The post nobody cares about is about H.a.r.e.s. which I
think is a shame though I admit is not the most relevant, but H.a.r.e.s. sounds
like such a cool idea. The LangSec post was the biggest surprise to me: It’s
quite well read.
As a
personal note – I find the best infosec blogs is by far http://lackingrhoticity.blogspot.com/ by Mark Seaborn and http://www.malwaretech.com/ by MalwareTech. Mark’s blog is really inspiring on a technical level
and MalwareTech’s joy of finding things out is contagious to me.
For now
I’ll go offline for a couple of weeks and hope to spend a lot of time with a
print out of an article about using performance counters to reverse engineer
caches.
Introduction
This blog
post benefitted from conversations with some really smart people . I’d like to
extend my thanks to them. Nevertheless
this blog is probably full of errors and they are mine alone. I’ve not released
source codes because the topic is mostly offensive and the source codes would
be more useful for offensive purposes than defensive. If you for some reason
think my sources could be useful, get in contact (I’ll be offline the next 3
weeks, but I will get back to you). FInally I'd like to apologize that this post got rather rushed at the end.Infact I'm writting this with a plane to catch. Particular damage was done to the different timers section where I really would've liked to add more information.
Cache side
channel attacks have been around since 1996 or so perhaps even longer. Recently
they had a renaissance in the literature because of the (re-?) emergence of a
3rd level cache . (The 3rd level cache is identical to the last
level cache on systems that have 3 cache levels - this is worth to note when
reading the literature). The 3rd level cache on modern intel
processors is shared between cores and this effectively allows a program at any
privilege level to spy on programs running on other cores simultaneously
without being concerned about sandboxes, privilege level and to some degree even
through hypervisors. The sheer size of the new 3rd level cache also
allows for spying being more accurate than with previous caches. The usage of
cache side channel attacks on the L3 cache has been show to be able to
successfully allow you to track mouse cursors from a java script on a webpage,
grab keyboard input from other users on the same system and grab cryptographic
keys on co-located virtual computers in the cloud. It is thus not entirely an academic
pursuit.
There are
currently three different kinds of cache side channel attacks. I shall write
about them all to some extend here so I’ll introduce them in short before I go
into a slightly more detailed description of how the cache works.
1. Evict + time
The attacker measures the time it takes to execute a piece of victim
code. Then attacker flushes part of the cache, executes and times the victim
code again. The difference in timing tells something about whether the victim
uses that part of the cache.
2. Prime + probe
The attacker now accesses memory to fill part
of the cache with his own memory and waits for the victim code to execute.
(Prime) Then the attacker measures the time it takes to access the memory that
he would carefully placed in cache before. If it’s slow it is because the
victim needed the cache and this gives us knowledge about what victim did.
(Probe)
3. Flush + reload
The flush and reload attack utilizes that
processes often share memory. By flushing a shared address, then wait for the victim
and finally measuring the time it takes to access the address an attacker can
tell if the victim placed the address in question in the cache by accessing it.
A short introduction into
level 3 cache
So let’s
spend a bit of time considering how the level 3 cache is build. I’ll use
details from the Sandy Bridge as an example, but it’s mostly similar on
different intel architecture. The details however can be important in implementing
an attack and defense. I use the sandy bridge for 3 reasons: I only have access
to a sandy bridge (details and sizes are based on my wife’s computer), it’s the
best researched in the literature and it’s likely to be the simplest.
At some
point in the past CPU’s became much faster than memory causing CPU’s to be
forced to stall when waiting for memory. Because memory accesses typically
concentrate on a relatively small amount of memory at any given time adding a
cache of super fast (and expensive – in terms of CPU real estate and money)
memory directly in the CPU gave a large performance gain for only a small
price. Thus the first level cache was added. The later as CPU’s turned faster
and software more memory hungry a slightly less expensive a second level cache
was added – real estate prices in the CPU
drops with the distance to where magic happens. And finally a 3rd level was added,
which is a lot slower that L1 and L2, but still much faster than memory and
comparatively huge.
L1 and L2
cache operations are slightly more advanced than L3 due to the fact that they
are per core, so a messaging system exists to keep L1 and L2 coherent between
cores. I’ll not discuss that here since my focus is the L3 cache, but suffice
to say they are otherwise similarly build. There is however one design feature
by intel which is important: The cache hierarchy is inclusive. This means that
if memory is loaded in L1 it’s also loaded in L2 and L3. If it’s loaded in L2
it’s also loaded in L3. This is intel specific – AMD K7 for instance also has a
cache hierachy but it’s not inclusive. For an attacker an inclusive cache makes
life much easier. If he can flush memory from L3, he will automatically flush
it in all caches.
When a CPU
vendor is building a cache they need to track what part of the memory is in the
cache and what’s not. This gives rise to a trade off between how much memory
you spend for managing the cache, how flexible the cache is and how fast you
can do your look up. The level 3 cache on Sandy bridge balances these trade
offs in the following way. Sandy Bridge’s L3 cache and many other CPU's is a so
called N-Way set associative cache. This means first the memory is divided into
a small number of blocks and each blocks can be mapped to a particular set in
the cache. Each set can store any memory
location within the block in N different places in the set, N being a small
integer. For Sandy bridge L3 has either 12 or 16 depending on the exact model.
It is implemented this way as a compromise between the two extremes: The first
is called direct mapping meaning that each memory location has only 1 location
where it can be stored in cache making for quick look up, but fierce
competition of different parts of memory to be stored in this magic location -
it is the N=1 scenario. The second extreme is called fully associative cache
and means that any memory location can be cached in anywhere in the cache, the
N=cache size scenario (only 1 set). While this makes for a very flexible cache,
it also requires searching the entire cache for a match. A good way to think
about it is that sets are indexed and ways are searched when looking for a
cache hit. Also to further reduce the problem the cache isn't working on single
bytes, but on 64 byte chunks called cache lines. Finally Sandy Bridge and many
of subsequent CPU's have many cores and each core has a part of the cache
located next to it, called a slice. The slices are connected to a ring buffer
(intel calls it QuickPath Interconnect) so that any core can access the slice
of another core, but only at a penalty. This is a compromise for fast lookup on
cache belonging to the core, and access to the entire cache for all cores.
Graphically this looks something like this:
Summary
example for my wifes Sandy Bridge: Any
given byte in memory can be cached as part of a 64 byte cache line. This cache
line is cached in exactly one slice which belongs to a core. In this slice it
can be an occupant of exactly one cache set out of 2048. Within the cache set
it can occupy any of 12 ways. 2 (slices/ cores) * 2048 (sets) * 12 (ways) *
64 (bytes per cache line) = 3 Mb cache
total.
Let's take
a look at it differently. A request for data is send from the core 1 down
towards the memory. We start where one request has arrived at the L3 cache:
1. First
part of the address is used to index the correct slice. If we are lucky it's
likely to be on the current core's slice otherwise we pay a small penalty for
using the ring buffer to access the slice of core 2.
2. Other
bits of the address is now used to index the set to which the address belongs.
3. Finally
the N-ways is searched for our address. If it's not found the cache returns a
miss and the request will be forwarded
to memory.
4. If the requested memory is found in the cache and the cache
returns the relevant bytes from the cache line using the lowest 6 bits of the
address to index into the cache line.
For the
details of how the bits of the physical address of the request maps to the
cache see Seaborn(2015).
Eviction
The cache
is filled either through memory reads or through the prefetcher. When you read
memory very often you’ll shortly thereafter either read or write the same
memory or memory in the same cache line. Thus reading memory usually causes the
memory you just read to be stored in the appropriate cache set. The prefetcher
looks for patterns in your memory access and tries to move cache lines from
memory to the cache before you need them. Either way a set is going to be full
at some point and to keep the system up to date memory needs to be evicted from
the cache. The method by which is determined what cache lines are no longer
needed in the cache is called the eviction policy and the act of dropping a
cache line is called to evict – or sometimes flush. The first algorithm to come
to mind is to drop the least recently used cache line to make space. This is
not surprisingly called a LRU eviction policy. The problem with such a policy
is that you need to keep a track of the order that cache lines where accessed
within the cache and to evict the least recently used. This again requires you
to use n*log2(n) bits (where n is the number of ways) and search through the
bits to figure out which one to evict. This can
be done in different ways, but it’s always computationally intensive.
With bits and time being expensive in the CPU, Sandy Bridge opted for a so
called pLRU or “Pseudo least recently used” eviction policy. Instead of
accurately keeping track a simple binary tree is represent with a suitable
number of bits made to represent the cache set so that each leave is a way in
the set. On access the bits are flipped in nodes of the tree that where used
and for eviction the opposite path is followed to figure out which way is to be
evicted. pLRU is not only faster algorithmically since only the nodes needs
update, but also require less bits. The downside is that it only approximates
LRU. To read more about eviction policies you may like to read this Jimenez
(201x). More modern CPU’s than the Sandy Bridge has more complex (adaptive)
eviction policies. We can now figure out that if we wish to remove other
programs memory from the cache we can do
so by systematically load our own memory from each cache set. Theoretically for
Sandy Bridge we need only N addresses (12 or 16) which belong to a cache set of
interest and load each of those twice and the cache will be filled with those
addresses. Real life experiments suggest that a little more than twice provides
better results. The interesting thing is that it’s quite a doable proposition.
In fact it’s so doable that even without knowing the actually addresses we can
figure them out by using timing and thus do it from java script without
pointers. Oren et al (2015) does this. A detailed discussion of optimally
filling cache sets is giving in Gruss &Maurice(2015)
Cache side channel attacks
Now that we
have a partial overview of how the cache works we can turn our attention on how
the 3 before mentioned cache attacks works the cache to gain information.
Evict + time
In the
beginning of this blog I wrote: “The attacker measures the time it takes to
execute a piece of victim code. Then attacker evicts part of the cache,
executes and times the victim code again. The difference in timing tells
something about whether the victim uses that part of the cache.” We can now
fill in the blanks. If we know which cache sets the victim is likely to use we
setup the cache as to give us an idea about what memory the victim used. If we
run the same victim once we know it’s likely to have cached part of the memory
it used. Now we can time how long it takes to run with a high resolution timer
to give us a baseline. Now we load our own memory systematically so that a
number (in which we are interested in terms of the victim) of cache sets
becomes full with our data –now we can assume that anything the victim placed
in the cache has been evicted. Then we let the victim run again and compare how
long it took to our new baseline. We can repeat this process for different
cache sets until we’ve figured out which cache set he used. If branches are
sufficiently large (larger than the cache line size) we can say which parts of
the code the victim uses. If memory structures are bigger or displaced across
cache lines we can say which memory structures he used. This information can be
used to draw inference on information private to the victim. Typical examples
are attacking AES that is using tables for encryption with the key being
private to the victim or KASRL where actual addresses are hidden from view. We
should notice here that other tasks using the cache will add noise to our
measurements so repeating the attack again and again will improve accuracy as
on multi core system there is always noise on anything to do with memory. Also
the prefetcher is an enemy to be considered as it might load memory in
anticipation of being used by the victim without the victim actually using it
and thus giving us false information. In practice these things makes attacks
less reliable, but by no mean infeasible. The real down side to this attack is
that we need to be able to cause execution of victim code which might be
difficult in the first place. The positive side of this attack is that we can
apply it in java script or native code since we need not know any addresses
from the victim to carry out the attack – the actual cache sets used is all the
information that is required.
Prime + probe
The Prime +
probe attack looses the restriction of evict + time that the attacker needs to
be able to start execution. To make this happen we again make sure that
interesting cache sets are filled with memory we control. Once this is the case
we’ll wait a predefined amount of time to see if a victim uses memory in the
region covered by the cache set. If he does he’ll cause one of our addresses to
be evicted. This is the prime phase of the attack. The probe phase is now to access
each address we originally filled the cache set with. If this is slow, the
victim has used memory with in this cache set as well. Probing fast enough
we’ll get a series of observations over time of what cache sets are being used
on the computer. This allows us to draw lots of inference on what the victim is
doing at any time. Typically to map cache set activity the attacker runs lots
of attacks on a system under his control to generate a baseline data set. He
then uses a machine learning method to map the time series he got from the
victim to figure out what the profile means the victim is actually doing.
Prime+probe is in my opinion the most powerful of the cache side channel
attacks. Again we don’t need actual addresses so we can make it work in java script.
We don’t need to actually cause execution of the victim code so just need to
loiter around – which is one less prerequisite for the attacker. Thus this
attack vector is suitable for almost any situation where we get access to
running our own code on the same computer as our victim. Hypervisors, privilege
levels, sandboxes and other barriers are irrelevant because we all meet up in
the cache. The down side to this attack is that the data we get back tends to
be very noisy. We only get which cache sets where use (and partly how many
cache lines where loaded), but we have no initial information about which core,
thread or even which code cause the access. So to successfully we need to
repeat the process a lot and we can spy only on events that are relatively
common. Oren et al(2015) covers this kind of attack in more details.
Flush + reload
Flush+reload
over comes evict+time requirement of causing execution as well as prime +probe
noisy data. It does this by assuming that victim and attacker are using the
same memory. This proposition at first seems weird since even in unsafe
operating system such as windows 95 each process is given it’s own memory
space. But it’s not as weird as it may seem. This is because of a memory
management feature by most operating system called page dedublication. If two
identical memory pages are loaded in separate processes, operating systems can
save physical memory (and more more efficient use of the cache) by mapping the
same page in each process. Most operating system does this – even across
different users and privilege levels. Typical example would be the code of
kernel32.dll in windows. It’s shared by processes but most of the DLL
(especially read-only-portions) such as code is only mapped in physical memory
once. Should a process decided to write to such memory most operating systems
incorporate a feature called copy-on-write, essentially the OS sees to it that
an exception is thrown on write access and then makes a new copy of the page
for the offending process only before retrying the write operation. Page
dedublication is rampant in all modern operating system and even some
hypervisors are capable of doing it. The attacker now uses this information by
flushing a specific address shared by attacker and victim. Because he needs to
flush a specific address this attack seem impractical in java. In native code
it’s quite easy to flush an address using the clflush instruction, but there is
also no reason why he shouldn’t be able to do it using the normal eviction
policy of the cache. Once the address is flushed the attacker waits an interval
and then times the access time to the address. If the victim used the address
it’s likely to be in cache and thus the access will be fast. Since we know
exactly what address we are talking about we have very little noise.
Flush+reload can be used as a tailored attack or by comparing what happens on a
machine under the attackers control with what happens on a victim computer
using machine learning methods. The latter called template cache attacks. The
best article on flush+reload attacks is probably: Gruss, Spreitzer,
Mangard(2015)
Current counter measures
to cache side channel attacks
My general
opinion on security problems such as CSC’s is that we should fix the root cause
and in this case the root cause is the micro architecture of the cache. The
core problem however is the cache being shared across privilege levels and that
might be a significant hurdle to climb for CPU designers. Also there is already
plenty of existing hardware in use so even if intel where to fix the root problem
we’d be stuck with these issues for a considerable amount of time unless we
look at other options.
1. Less accurate timer
Most cache side channel attacks use the timing
of a memory access to determine if it’s cached or not cached. This usually
works through the RDTSC/RDTSP instructions to read the current time stamp
counter of the processor or an operating system/library wrapper thereof. For
javascript this can be implemented directly in the wrapper and has so far this
has been suggested and already implemented in response to Oren et al.(2015)
Obviously if the time is sufficiently coarse we cannot tell a cache hit from a
cache miss apart. For user mode attacks RDTSC(P) can be disabled in user mode
through a bit in CR4, which will cause it to make an access violation. This
again can be used to emulated the instruction with any coarseness we wish,
leaving compatibility with most existing software. Now this approach is never
going to be perfect. The reason is, that if we start a timer at a random point
in time and then carry out a flush or a probe the probability of seeing an
increment reading when reading the timer again is increasing in the time the
flush or probe took. Thus if we can sample enough events we’ll be able to draw
inference what is happening – that is spy. Say keyboard presses are not such an
event type as they’ll be more or less unique events – we can’t just ask the
user to type his password 10000 times. Where as blitting a mouse cursor touches
multiple cache lines at once and in neighboring positions as it’s moved we
could probably get a good idea about where it is even with a middle coarse
timer. One approach to avoid this for attackers would be using performance
counters to count LLC cache misses instead of relying on timing. levels allows
for it. Also on virtual machines performance counters are often not implemented.
OS and virtualization implementers however should be aware that performance
counters provide a powerful tool for reading into side channels even when you
only get your own performance data because of sharing of resources in the
system. I’ll return to timing later.
2. Non-inclusive cache hierarchy
Obviously this does nothing against code using
the CLFLUSH instruction, but people trying to evict the cache through filling a
cache set now need to bother about flushing all levels in the hierarchy separately.
Obviously this makes it more difficult but I don’t think it’s a deal breaker –
since we can relatively easy evict a cache set in a 3MB cache, it seems like
evicting one in a 32 kbyte cache should be a feasible task. I’ve not read about
cache attacks on AMD k7, but I’m fairly pessimistic that despite the
non-inclusive cache hierarchy that CSC’s are possible on this processor too.
3. Fully associative caches though they
may not hold they same performance, they do make it very difficult to
sufficiently fast evicting the cache without the use of CLFlush instructions as
we’d essentially have an extreme amounts of ways in our eviction set. Again
this would only be a solution on new hardware and to really close off the gap
the Clflush instruction should be made priviledged. There are other reasons –
row hammer – to follow such a path.
4. Disable cache-line sharing
Flush + reload attacks can be dealt with
effectively if there is no shared memory. This ranges from an operating system/
virtual machine level disabling page deduplication to actively causing
copy-on-write on particularly interesting pages. On Windows 7 a simple
ReadProcessMemory() followed by WriteProcessMemory() is enough to trigger a
copy-on-write and thus disable Flush+reload – you don’t actually need to change
memory! There is other reasons to reconsider page dedublication: The
copy-on-write takes times and opens up for a side channel attack by timing
write operating to pages. I believe Daniel Gruss wrote a paper on the subject,
but I could be mistaken.
5. Accessing and flushing memory
“randomly” by a victim or operating system obviously adds lots of noise to the
process. I love the irony of using cache side channel attacks to mitigate/detect
cache side channel attacks – but despite this irony there is obviously a heavy
performance penalty associated with causing cache misses to protect against
attacks and in scenarios where the attacker is making many repeated attempts
this is likely to be too big.
6. The intel CPU tries to load stuff
into the cache before it’s used prefetching. It’s especially useful for code
and reads which are predictable like memcpy(). Obviously the prefetcher
continually causes cache hits even when nobody already accessed the memory thus
making more difficult to get good data on memory activity for spying purposes.
However I find it unlikely that we’ll see a prefetcher that’ll be sufficiently
good to do land a very strong blow on the possibilities of an attacker.
7. Also it’s been suggested to modify
software so that each user has a version with a unique memory layout and access
path. Imagine a powerful mutation engine at work. This has been suggested as a
remedy against code reuse attacks in the past and is likely to make CSC quite a
bit more difficult. It is also quite feasible as such a mutation engine was
used in a successful virus by Z0mbie back in 2005. The downside is of cause
that mutation engines of any kind posses a Q&A problem to software
developers. Certainly this method has not yet gained traction in fighting
against code reuse attacks so the probability of it gaining tracking for
fighting CSC’s seem slim.
8. Using performance counters is also
an option for detection. I suggested in mine and Nishad Herath’s black hat slides
to instrumentalize RDTSC(P) or other high resolution timers to first make them
less accurate and second to start
counting cache misses using the intel performance counters. This forces a spy
into territory where he has to look at more cache misses to get reliable information.
This combined with the presence of a high resolution timer makes a very good heuristic
for cache side channel attacks. Especially since one potential response would
be to make the timer even less accurate – thus only mildly impacting the system
as a whole in the event of a false positive. In my experiments false positives
have been rare though low-frequency spying sometimes ends with false negatives.
I imagine that continued spying as well
as tuning with the coarseness of the timer could make it a viable solution.
However let’s take a look at timing from an attacker perspective. Even if the
attacker manages without a high resolution timer, it might be well worth
looking into performance counters to detect cache side channel attacks –
because any high-resolution-timer is likely to leave uncommon artifacts in the
performance counters.
Timer issues
From the
above it looks like the attacking the timer is the most promising defensive
technique as it deals with all three methods of cache side channel attacks.
It’s feasible, relatively easy to
implement, at least if we don’t consider Patch Guard and though it might not
give 100% piece of mind it at least goes a long way in the right direction.
However there might be other ways to get a high resolution timer on the x86. I’ll
suggest two such methods here, others may exist:
1. Delta timer. Using a low-res timer,
then wait for the moment when it increments and then do your measurement and
try with register only instructions to fill the gap so that a cache miss will
exactly cause the timer to increment and a cache hit will not. I call this
method of timing a delta timer. Probably somebody in the literature has
assigned another name and if so please contact me and tell me. On old single
core CPU’s register only instructions had deterministic timing. This is not the
case on multi core CPU’s and this makes it a bit less reliable. The figure
below shows a graph of 30 observations from my Sandy bridge trying to when
trying to time out 8000 nano seconds. This suggest that if we need to use a
delta timer to get 95% correct we need at least 10 cache misses to be sure
there was cache misses at all. This brings it into a range where the
performance counter detection method would work well. However If we accept an
error rate of 1/6 the delta timer might do the job with while being difficult
to detect using performance counters. Now remember that this is not a real scenario
where we have a lot more noise already. But it’s unfortunately not infeasible that
delta timer would work in many real world scenarios – especially for
flush+reload since flush+reload is a lot less susceptible to noise anyway than
the other methods. But who says the coarseness of the timer should be constant
– with a random value (random drift?)the delta timer would be difficult to use
– that again is a different scenario than what has been proposed for java
scripts.
2. Thread timer. I’ve dubbed this kind
of time “Thread timer” without knowing if there is any literature on that
subject. The idea is to have a thread running and by a signal through a global
variable start counting. When the global variable is reset it stops counting.
The count then serves as a high resolution timer. There is problems in
implementing such a timer. Mainly accesses to the global variable is not
automatically perfectly synchronized leaving a non deterministic (and
significant) time gap from the timer is set to start to it actually starts –
and for stopping the same problem exists. Now intel does provide instructions
to deal with this – mfence and the lock prefix – however I’ve not investigated
those as I figure they’d be difficult to generate in java (I find the java
attack vector the most threatening). For native code they are likely to bring
around improvement, but not determinism. The first thing I noticed while
experimenting was that I got absolutely random results until I started
assigning the thread to a specific CPU (core or hyper thread). The behavior is
very different on hyper threads and real cores. The unit for the thread timer
is the number of times ecx was incremented during 1 uncached read and for 10
Cache misses. The curve for hits look similar. You may notice that it’s shorter
than the delta timer. This is because I’ve deleted a handful of observation
with the value 0 – that is where the thread timer didn’t start at all! When
timing for only one cache miss I get only a few observations with counts. This
hints at that we need at least a handful of cache misses to be able to tell
misses from hits with a thread timer. Obviously this is not the last words on
thread timers, but it hints at that they are fairly reliable on hyper threads
and a little less so real cores and that thread timers are unsuitable for
spying on events that gives only a few cache misses as feed back. The thread
timer code I used is listed below. In short my suggested use of CR4 to reduce resolution does pose a problem
to an attack looking for rare events with only a few cache misses – say key
presses.
while (g_Start == 0);
g_Time
= 0;
__asm
{
WaitLooo:
//mfence
cmp
g_Start, 0
jz
WaitLooo
xor
ecx, ecx
again:
inc
ecx
cmp[g_Start],
0
jnz
again
mov[g_Time],
ecx
}
g_Start
= 0;
Maybe I drink too much
In private
conversations I’ve suggested in the past to use cache side channel attacks to
detect rootkits including SMM root kits. My suggestion goes like this:If we
have a root kit type that tries to read usermode private keys in memory. The
possibility exists that we can trigger this behavior and use it for an evict +
time attack. Imagine an SMM with a hook on SomeOpenPrivateKeyFunction() which
reads the input irrespectively of validity and that our input exceeds a cache
line but that SomeOpenPrivateKeyFunction ()won’t read past a syntax error in the
key. The scenario sounds farfetched but maybe not too farfetched. Then setup a
private key the size of two cache lines with a syntax error in the first cache
line. Flush the second cache line and then call
SomeOpenPrivateKeyFunction(). Now time accessing the second cache line
of the fake private key. If SMM touched it chances are it used the cache and we’ll
be able to see the time difference. Obviously there are plenty of things a root
kit could do to mitigate, but at least we started a game of hide and seek.
Another
interesting (mis-)use for cache side channel attacks would be detection of
undesired runtime environments. Emulators and VM’s could be detected. Emulators
could be detected because it’s quite difficult to emulate a cache correctly and
most just plain skip it. VM’s could be detected by bootkits through the fact
that the cache is likely to be full under a VM, but on a real system it should
be empty on boot. I’m sure other cache based methods would be available.
Conclusion
Cache side
channel attacks certainly have a high coolness factor which is why I bothered
to write about it. On the surface of it spying on another VM from a sandboxed
java script webpage – being able to exfiltrate cryptokey,key strokes, mouse
movement etc. sounds like the perfect storm. Add to it that(in my opinion and limited
knowledge) nobody found a good strategy for mitigation yet.
Certainly cache
side channels poses a real danger to systems. But cache side channel attacks
won’t be responsible for a buffer-overflow-type wave of new breeches. The value
of figuring out private keys or reducing the entropy of KASLR is just so much
lower than a traditional breech. In the face of the mitigation options already
available cache side channel attacks becomes even further reduced. Attackers
can only figure out private keys if they are used often, attackers cannot
really spy on key strokes with reduced resolution of timer. Java script spying
is made more difficult and with it’s level of noise it spyes only on repeated
events and only until somebody closes the webpage. However the message should
be: On sensitive systems these mitigations should be implemented – especially
the lower resolution timer and dedublication measures. And that java is already
implementing a lower timer resolution is a good thing, even in the presense of
other timers and even though I to some extend would’ve done details in
different ways.
And finally
just because there currently are no good mitigations on the table certainly
doesn’t mean there isn’t one and even less than perfect mitigations raises the
cost/benefit ratio for an attacker. I for one would love to spend more time
with performance counters. I just have to wonder why kind of quirks thread
timers and delta timers have in the microops world of performance counters.
Literature
Oren et al
(2015): “Spy in the Sandbox, Practical Cache Attacks in Javascript”; Yossef
Oren, P.Kemerlis, Simha Sethumadhavan and Angelos D. Keromytis. arXiv:
1502.07373v2
Gruss,
Spreitzer & Mangard (2015): “Cache Template Attacks: Automating Attacks on
Inclusive Last-Level Caches”
Jiménez(201x)
: “Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches”; http://taco.cse.tamu.edu/pdfs/15-jimenez.pdf
Seaborn
(2015): “L3 cache mapping on Sandy Bridge CPUs“;
Mark Seabrorn.
http://lackingrhoticity.blogspot.de/2015/04/l3-cache-mapping-on-sandy-bridge-cpus.html
Gruss
&Maurice(2015): “Rowhammer.js”; Daniel Gruss, Clementine Maurice. http://arxiv.org/pdf/1507.06955v1.pdf
Great article. I do not quite follow why the Delta Timer requires the performance counters. Once we are able to differentiate between a cache miss and a cache hit using the register only instructions, aren't we done?
ReplyDeleteLow frequency attacks can easily lead to false positives or false negatives detection with performance counters on machines that are not idle. My solution to this problem was to cause RDTSC(P) instructions to throw access violations (through CR4) and start/stop performance counting on these. This essentially blends out all the noise that other processes creates. This scheme can be broken if the attacker uses another high-resolution timer such as a delta or thread timer. This is what I wished to convey. Hope it clarifies things. Else write me an email or contact me on twitter and I'll see if I can explain it better.
Delete