Update 30.11.2015
Errata regarding CAT
Yuval Yarom
(University of Adelaide) was so kind to email me regarding my speculation on
CAT changing the cache side channel game and it’s impact on row buffer side
channels. Before I proceed with what Yuval had to say I’d like to recommend a
paper Yuval resonantly co-authored called “Amplifying Side Channels Through
Performance Degradation“. You can find it here: http://eprint.iacr.org/2015/1141.pdf
Yuval corrected
me on some points and verified my thinking on others. Below is my summary of
his comments:
1) CAT is available on some Haswell
processors. E.g. Xeon 2618L v3.
2) CAT partitions on ways.
3) CAT only controls evictions. If a
process can access memory CAT will not intervene with writing, reading or flushing
from the L3 cache.
4) Thus CAT cannot protect against
attacks that rely on clflush in particular: flush+reload and flush+flush. It
could be used to protect against attacks that rely on eviction such as
prime+probe.
5) With only 4 COS values it will be of
limited use in many cloud use cases.
6) If you wish to defeat clflush based
attacks the approach should be to avoid sharing memory.
My
comments:
I did not
foresee that CAT would only control evictions and not flushes. Yuval ‘s point
on CAT not protecting against flush+reload and flush+flush extends itself to
Row Buffer Side Attacks.
Comment on concurrency
Peter
Pessl, Daniel Gruss, Clémentine Maurice and Stefan Mangard released a paper
called “Reverse Engineering Intel DRAM Addressing and Exploitation” you can
find it here: http://arxiv.org/abs/1511.08756. This paper hinges around the exact
same idea that I present in my blog post and was released almost simultaneously.
Daniel Gruss and I were discussing the flush+flush paper prior to it’s release
when I mentioned my work on RBSC attacks. He explained that he and others where
working in the same direction and we agreed to work concurrently on our own,
but synchronize our publication towards the end of November. I received a draft
of their paper hours before they published. The only relevant influence I’ve
had on their work is reporting a typo and properly my concurrent work moved up
their schedule. They attack the micro architecture problems I’ve been skirting
around in this blog head on which is pretty cool. I would’ve loved to play with
those issues if I’d had the time. Interestingly their work might be interesting
for my current “pet project”.
Introduction
Cache side
channel attacks has resonantly had a renaissance in the literature due to the
emergence of a 3rd Level cache that is shared between cores on a system.
However since the dawn of time a feature of DRAM has been available for side
channel attacks very similar to CSC: The row buffer. I’ll make the case here
that row buffer is usable for side channel attacks. I strongly recommend you
read my “Cache Side Channel Attacks” blog post before proceeding, because there
is a lot of cross over between how Cache Side Channel attacks work and how a
row buffer side channel attack could work. I will not spend much time on these
issues in this post and generally refer you to my older post. My actual PoC
code for an attack is not public. As usual I do not make public offensive
security resources.
Before
anybody gets too exited the practical importance of a row buffer side channel
attack is probably very limited. The reason for this is that the emergence of
the 3rd Level cache has in many ways made it obsolete and the presence of the
3rd level cache makes the attack impractical because the effect of the cache
will tend to be in the way. Thus the practical importance of such an attack is
thus limited to corner cases or computers prior to the introduction of big
caches. From a theoretical perspective, I’ll argue that my tiny effort has been
worthwhile. Also my guess is that noise problems will significantly hamper it’s
use. However the attack my very well scale to other platforms. This is because
what I utilize here is a feature of DRAM and not a feature of the processor.
This means potentially any device using DRAM could be vulnerable to this kind
of attack.
Background
Current
DRAM comes in modules called DIMM’s. If you buy a modern memory module for your
PC, you’re buying a DIMM. If you look at the DIMM most DIMMs will have chips on
both sides. Each side of the DIMM is a rank. Each rank again consists of a
number of banks. The banks are typically the physical individual chips you can
see. Inside a bank you’d find a two dimensional
matrix of memory cells .There are 32k rows in the matrix and 16k or 512k cells
per row. Each cell stores one bit and
consists of a transistor for control and a capacitor which stores charge to
signify bit is equal to 1 and no charge when bit is equal to 0. Thus a row
stores 8kb or 64kb of data depending on the exact kind of DRAM you you have in
front of you. When you read from DRAM an entire row is always read at
once. Reading a row means checking if
the capacitor is charged or not. Since this operation inevitably discharges the
capacitor, DRAM first reads the entire row into a buffer and then returns the
read out values to the bus from here, then the entire row is written back into
the matrix to renew the charge. However if two reads to the same row occurs
there is no need to fetch the memory from the matrix the second time and thus
the row buffer works as a cache (This is why row hammering requires reading
from 2 rows within a bank to be successful). Such a row buffer is either 8kb or
64kb – obviously the same size as the row. For a more accurate description I
refer you to [2]Wikipedia.
How does my RBSC work
Just like a
standard cache we can use this to gain information about the memory access
patterns of another process. The obvious requirement is that we need to share
row buffers with the victim. While technically we can share a row buffer with
somebody without sharing memory and in fact it’s bound to happen to some degree
since page size is smaller than the row size and the operating system allocates
memory through paging. I shall not discuss this however. The reason is that to
utilize this for an attack requires knowledge of the micro architecture of the
processor – in particular physical address mapping. Mark Seaborn[1] has done
some work on this and going further in this direction is way beyond my time
schedule – no way I can research this in less than 24 hours and I don’t write
blog posts that take longer than this. Also it’s likely that we need an
additional information leak to really leverage RBSC in this fashion. I may get
back to this in another blog post, but don’t count on it. This leaves me with
spying on memory access to memory that is shared between victim and attacker.
Unfortunately there is plenty of shared memory between processes – typically
the operating system avoids loading shared modules twice and instead just maps
a single instance of physical memory logically in all processes that use it.
Even virtual machines often share memory. Thus my attack has similarities to
the “flush+reload” cache side channel attack. Where flush+reload CSC’s work on
cache lines and the attacker is able to determine access within an 64 byte
cache line, with row buffer side channel attacks we’ll only be able to tell
about a 8/64kb row buffer. The mechanics of the attack I’ve written up is
1) First read tons of memory to
statistically make sure another row is in the row buffer. This could surely be
done in a more efficient fashion, but again that would require knowledge of
micro architecture – however if you read the original row hammer paper you’d
know that my approach has very a very good chance of working correctly. See
Seaborn & Dullien(2015) for more. To be specific I read 200 times from a
random location in a large allocated buffer to give me a good chance of having
read from another row in the bank I’m interested in. Now I theoretically could
just do a normal read operation, but I think you’d have less noise if you make
sure the memory you’re trying to read isn’t in the cache. Therefore I’ve added
a CLFlush instruction prior to. Also since we need a baseline memory access timing I use this phase
of the attack to time the memory access using the RTDSC instruction. Because
modern processors are capable of executing out of order to make the most of
waiting for the memory it’s important that timing operating is bracket in the
mfence instruction to avoid reordering. It is important that the timing for
your baseline is the same address which you wish to attack for. The reason is,
If the address we use to do our baseline with is on a different core than the
one we wish we to attack, querying the L3 cache has a different timing due to
the ring buffer .
2) Finally I time how long it takes to
read the memory I'm interested in as an attacker. A fast read and the victim used
this row – that is the data was returned from the row buffer. A slow read and
the victim did not use this row – the data got returned from the matrix. If a
read was fast or slow is measured against the benchmark from 1. Here I also use
the clflush instruction prior to make sure the memory read does not arrive from
the cache as well as using the mfence instruction to protect against
reordering.
Now there
is lots of noise in this attack. All
though there is a measurable difference between the timing of buffered and
unbuffered reads, the difference is very small compared to a traditional
flush+reload attack. And on modern multi core computers, you always have
“noise” in execution times (especially on memory) leading to frequent
misidentification Another reason is that the chance another non-victim process
uses another row is much, much larger than another core causing a cache
eviction (cache has 14/16 64 byte ways - a bank 32k rows of 8kb memory) or even uses the same cache line as the victim.
And finally another core using the same DIMM (ram module) while you are polling
will cause delays in excess of the saving from using a row buffer as cache. Another potential source of noise is the memory refresh that'll read and rewrite rowbuffers periodically. With so much noise, coarse grained attacks are unlikely to bring significant rates of success. Additionally high levels of memory activity
on other cores is poison to the attack. When running under favorable conditions
and probing I only get around 70% correct predictions on if the attacker used my row buffer or not.
I need to
mention that my way of launching this attack can be improved upon instead of
just statistically making sure that another row buffer is loaded, you could do
it exactly if you had knowledge of the micro architecture. I often observe very
slow reads (factor 3 or more) for other reasons than missing the row buffer. The
amount of time spend however clearly indicates that we’re seeing noise and should
disregard the reading both in finding the baseline and while attacking. Likely
cause for these long waiting times is that we are forced to wait for another
core’s request to access memory from the DIMM.
I said in
the introduction that this attack vector probably isn’t that important on
modern computers. The reason is that the 3rd level cache is bound to pick up on
any memory operation occurring and thus the second read that is required is
likely to never make it to DRAM. Now we can work around this using CLFLush or
equivalent means – this is what I did as described above. But why would an
attacker bypass a better side channel to use a worse and in the process leave
behind all the hallmarks of a true CSC attack? Never-the-less there might be
corner cases where RBSC attacks could provide a viable alternative to CSC
attacks.
Mitigation
I think the
solution to RBSC channel attacks has to be a (micro?) architectural one. But
since we probably aren’t going to get one we may wish to consider software mitigations.
Mostly the same methodology for mitigation would apply to row buffer attacks as
to cache attacks. First on modern computers with a L3 cache we need to flush
cache to make it work and this makes my performance counter detection a viable
solution for high frequency polling. To access the row buffer instead of the
cache, the attacker must cause a cache miss - and typically where we'd normally
expect a hit. Another good solution would be making the timer less accurate as
I’ve suggested for cache side channel attacks, which would make it virtually
impossible to distinguish between row buffer reads and normal reads.
Conclusion
From a
theoretical point of view RBSC is a lesson into how apparent beneficial design
decisions such as using the row buffer as a cache can quickly turn into a
security issue. The row buffer is required in the DRAM scheme – however it’s
cache function probably isn’t (I’m no DRAM expert), but seems like a great idea
performance wise. As soon as we share resources across trust boundaries there
will always be a potential for side channels. It also serves as a warning that
fixing cache side channel attacks in micro architecture in the processor
requires that thought is given to RBSC
attacks. That being said RBSC attacks seem unlike to be a real world threat – at least of the x86 architecture.
However it might be quite interesting for other architectures.
Ending on a speculative
note, CATs and more advanced attacks
This part
was really added long after I'd finished up the above text. So consider this a
last minute addition.
More advanced attacks
The first
modification we can do to the attack above is to look not at what is in the row
buffer, but what isn't. If we in the above attack prime the row buffer with an
attacker memory location, checking to see if the victim removed it gives us
information about his activities. But this isn't much information because we'll
only know if the victim used the same bank or not. In essence knowing that the
victim used any one of 32k 8k buffers isn't too valuable.
But we can
improve on this. The details are micro architecture specific and I'll stick to my
wife's I3 Sandy Bridge dual channel architecture here. My guess is that other architectures will be
different, but have enough similarities for the basics to apply. Sometimes the
attacker shares a row with a victim without actually sharing memory and we can utilize
this to gain more information. First the operating system allocates memory to
processes on a per page basis and
security attributes too is on a per page basis. However a row is shared across
pages and thus potentially across privileges and we can use this to spy.
Typically there is parts of 4 pages in a row buffer. The first part of that
number is that the row typically contains 8kb while standard (small pages) are
typically 4kb. This accounts for two pages. Finally we know the first 12 bits
of a virtual address is identical to the corresponding physical address. Also
we know from Seaborn(2015) that bit 6 of a physical address is used to select
which DIMM should be used. This is probably implemented so by intel so that memory
requests can be handled in parallel by the two DIMM's. When the computer in
question has two DIMM's (memory modules) each page is split across two DIMMs
and thus two rows. Hence each row contains memory from 4 different pages. If
the attacker is in a position to cause
behavior in the victim he can map this behavior to row activity by timing and
eventually use this information to spy on activity not caused by him. It is
obviously a far more difficult attack to launch and I've not actually tried to.
CAT and the future of
cache attacks
In
Intel[3] chapter 17.15 the “cache allocation technology” (CAT) of “future” (I
think this means Broadwell and beyond, but I'm not sure) Xeon processors is
vaguely described. Unfortunately I don't own or have access to a "future
Xeon processor" and thus anything below remains pure speculation. The CAT allows the supervisor or hypervisor
to partition the cache and select which processes should have access to which
partitions. It is build in to give hypervisors and supervisors a way to provide
important processes with better performance by giving them better access to
cache. Without partitions processes compete for cache on equal terms, CAT is
supposed to tip the balance in favour of those who the OS think needs it. The documentations that I have state that only
the L3 cache is currently supported. The way it works is undocumented (or I
don't have/understand the documentation), but it is hinted that current
implementations is partitioning over ways. Which sounds like a easy and
sensible method. The current process is assigned a two bit mask in an MSR
called a COS value. Based on this COS value a so called capacity bit mask is
fetched from another MSR. I assume that for each bit set in this MSR
corresponding ways of each cache set is accessible for the current memory
access operation . The idea is that the operating system sets up the bitmask on
boot (default is everything is shared) and for each process/thread/what ever
sets the appropriate COS value. This all could be potentially be used to give 4
different users access to each a private L3 cache depending on how it’s
actually implemented. This could be a
nice flexible (leaving performance/security trade off to the OS implementation)
micro-architectural solution to the cache side channel attacks on L3 caches –
should it work.
I see two
likely scenarios for the actual implementation of CAT. The first an simple is a
hard partition of the cache. What I mean by this is that memory placed in a
cache partition only available to a COS=0 process will come from physical
memory if a COS=1 process should try to access it (or from COS=1's own
partition of memory). Further a COS=1 process could not possible evict or flush
memory from the COS=0's partition. In this scenario it's likely that all cross
COS attacks on the 3rd level cache is impossible when the bit makes are set up
appropriately. Obviously even though the cache is hard partition and attackers
cannot either "prime" or "load" across partitions
boundaries, instructions may still leak information about cache state and allow
some attack possibilities.
The second
scenario I call the "soft partition scenario". I find more likely
this scenario more likely. In this scenario only the second part would hold
true. Any memory access regardless of COS would be served from any cache partition
if available. However any eviction or flush decision would only be affected only
by cache events initiated with appropriate COS access. It would allow any
amount of "probing" however once a cache entry has been made by the
victim, the attacker could no longer flush it on his own. Thus flush or
evicting by an attacker becomes a race condition instead of a sure thing. So L3
cache attacks would lose a lot of power, but not generally be impossible. I
find this scenario more likely for two reasons. The first is that in this
scenario no new cache coherency between partitions would have to be enforced. Secondly
it allows synergies between COS 's to be leveraged for better performance.
It is
however easy to imagine other scenarios on how CAT could be implemented and
implementation details matters very much. There is work to be done here.
It might
also be valuable to look into chapter 17.14 ( cache monitoring) for CSC
mitigation detection techniques in the future.
Certainly
CAT will have a profound impact on RBSC too. The impact pushes RBSC's in two
directions first the problem with by passing the cache for another COS becomes
impossible in the first scenario and becomes a race condition in the second.
However CAT does not appear in any implementation to make RBSC unfeasible and
this is likely to make it a more viable options for attackers.
Acknowledgement
In this
place I must acknowledge Daniel Gruss who motivated me to keep on working on
this blog post, when it seemed like it'd be too much effort to finish. Also my
thanks should go out to Jacob Soo whose comments improved the text. All errors are
mine.
Litterature
[1]Seaborn(2015):” How physical addresses map to rows and banks in DRAM”;[2]:Wikipedia(XXXX): Dynamic random-access memory;https://en.wikipedia.org/wiki/Dynamic_random-access_memory#Operations_to_read_a_data_bit_from_a_DRAM_storage_cell
http://lackingrhoticity.blogspot.de/2015/05/how-physical-addresses-map-to-rows-and-banks.html
[3]Intel(2015): “Intel® 64 and IA-32 Architectures, Software Developer’s Manual,Volume 3 (3A, 3B & 3C):,System Programming Guide”
[4]Daniel
Gruss, Clémentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier
Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf