Errata regarding CAT
Yuval Yarom (University of Adelaide) was so kind to email me regarding my speculation on CAT changing the cache side channel game and it’s impact on row buffer side channels. Before I proceed with what Yuval had to say I’d like to recommend a paper Yuval resonantly co-authored called “Amplifying Side Channels Through Performance Degradation“. You can find it here: http://eprint.iacr.org/2015/1141.pdf
Yuval corrected me on some points and verified my thinking on others. Below is my summary of his comments:
1) CAT is available on some Haswell processors. E.g. Xeon 2618L v3.
2) CAT partitions on ways.
3) CAT only controls evictions. If a process can access memory CAT will not intervene with writing, reading or flushing from the L3 cache.
4) Thus CAT cannot protect against attacks that rely on clflush in particular: flush+reload and flush+flush. It could be used to protect against attacks that rely on eviction such as prime+probe.
5) With only 4 COS values it will be of limited use in many cloud use cases.
6) If you wish to defeat clflush based attacks the approach should be to avoid sharing memory.
I did not foresee that CAT would only control evictions and not flushes. Yuval ‘s point on CAT not protecting against flush+reload and flush+flush extends itself to Row Buffer Side Attacks.
Comment on concurrency
Peter Pessl, Daniel Gruss, Clémentine Maurice and Stefan Mangard released a paper called “Reverse Engineering Intel DRAM Addressing and Exploitation” you can find it here: http://arxiv.org/abs/1511.08756. This paper hinges around the exact same idea that I present in my blog post and was released almost simultaneously. Daniel Gruss and I were discussing the flush+flush paper prior to it’s release when I mentioned my work on RBSC attacks. He explained that he and others where working in the same direction and we agreed to work concurrently on our own, but synchronize our publication towards the end of November. I received a draft of their paper hours before they published. The only relevant influence I’ve had on their work is reporting a typo and properly my concurrent work moved up their schedule. They attack the micro architecture problems I’ve been skirting around in this blog head on which is pretty cool. I would’ve loved to play with those issues if I’d had the time. Interestingly their work might be interesting for my current “pet project”.
Cache side channel attacks has resonantly had a renaissance in the literature due to the emergence of a 3rd Level cache that is shared between cores on a system. However since the dawn of time a feature of DRAM has been available for side channel attacks very similar to CSC: The row buffer. I’ll make the case here that row buffer is usable for side channel attacks. I strongly recommend you read my “Cache Side Channel Attacks” blog post before proceeding, because there is a lot of cross over between how Cache Side Channel attacks work and how a row buffer side channel attack could work. I will not spend much time on these issues in this post and generally refer you to my older post. My actual PoC code for an attack is not public. As usual I do not make public offensive security resources.
Before anybody gets too exited the practical importance of a row buffer side channel attack is probably very limited. The reason for this is that the emergence of the 3rd Level cache has in many ways made it obsolete and the presence of the 3rd level cache makes the attack impractical because the effect of the cache will tend to be in the way. Thus the practical importance of such an attack is thus limited to corner cases or computers prior to the introduction of big caches. From a theoretical perspective, I’ll argue that my tiny effort has been worthwhile. Also my guess is that noise problems will significantly hamper it’s use. However the attack my very well scale to other platforms. This is because what I utilize here is a feature of DRAM and not a feature of the processor. This means potentially any device using DRAM could be vulnerable to this kind of attack.
Current DRAM comes in modules called DIMM’s. If you buy a modern memory module for your PC, you’re buying a DIMM. If you look at the DIMM most DIMMs will have chips on both sides. Each side of the DIMM is a rank. Each rank again consists of a number of banks. The banks are typically the physical individual chips you can see. Inside a bank you’d find a two dimensional matrix of memory cells .There are 32k rows in the matrix and 16k or 512k cells per row. Each cell stores one bit and consists of a transistor for control and a capacitor which stores charge to signify bit is equal to 1 and no charge when bit is equal to 0. Thus a row stores 8kb or 64kb of data depending on the exact kind of DRAM you you have in front of you. When you read from DRAM an entire row is always read at once. Reading a row means checking if the capacitor is charged or not. Since this operation inevitably discharges the capacitor, DRAM first reads the entire row into a buffer and then returns the read out values to the bus from here, then the entire row is written back into the matrix to renew the charge. However if two reads to the same row occurs there is no need to fetch the memory from the matrix the second time and thus the row buffer works as a cache (This is why row hammering requires reading from 2 rows within a bank to be successful). Such a row buffer is either 8kb or 64kb – obviously the same size as the row. For a more accurate description I refer you to Wikipedia.
How does my RBSC work
Just like a standard cache we can use this to gain information about the memory access patterns of another process. The obvious requirement is that we need to share row buffers with the victim. While technically we can share a row buffer with somebody without sharing memory and in fact it’s bound to happen to some degree since page size is smaller than the row size and the operating system allocates memory through paging. I shall not discuss this however. The reason is that to utilize this for an attack requires knowledge of the micro architecture of the processor – in particular physical address mapping. Mark Seaborn has done some work on this and going further in this direction is way beyond my time schedule – no way I can research this in less than 24 hours and I don’t write blog posts that take longer than this. Also it’s likely that we need an additional information leak to really leverage RBSC in this fashion. I may get back to this in another blog post, but don’t count on it. This leaves me with spying on memory access to memory that is shared between victim and attacker. Unfortunately there is plenty of shared memory between processes – typically the operating system avoids loading shared modules twice and instead just maps a single instance of physical memory logically in all processes that use it. Even virtual machines often share memory. Thus my attack has similarities to the “flush+reload” cache side channel attack. Where flush+reload CSC’s work on cache lines and the attacker is able to determine access within an 64 byte cache line, with row buffer side channel attacks we’ll only be able to tell about a 8/64kb row buffer. The mechanics of the attack I’ve written up is
1) First read tons of memory to statistically make sure another row is in the row buffer. This could surely be done in a more efficient fashion, but again that would require knowledge of micro architecture – however if you read the original row hammer paper you’d know that my approach has very a very good chance of working correctly. See Seaborn & Dullien(2015) for more. To be specific I read 200 times from a random location in a large allocated buffer to give me a good chance of having read from another row in the bank I’m interested in. Now I theoretically could just do a normal read operation, but I think you’d have less noise if you make sure the memory you’re trying to read isn’t in the cache. Therefore I’ve added a CLFlush instruction prior to. Also since we need a baseline memory access timing I use this phase of the attack to time the memory access using the RTDSC instruction. Because modern processors are capable of executing out of order to make the most of waiting for the memory it’s important that timing operating is bracket in the mfence instruction to avoid reordering. It is important that the timing for your baseline is the same address which you wish to attack for. The reason is, If the address we use to do our baseline with is on a different core than the one we wish we to attack, querying the L3 cache has a different timing due to the ring buffer .
2) Finally I time how long it takes to read the memory I'm interested in as an attacker. A fast read and the victim used this row – that is the data was returned from the row buffer. A slow read and the victim did not use this row – the data got returned from the matrix. If a read was fast or slow is measured against the benchmark from 1. Here I also use the clflush instruction prior to make sure the memory read does not arrive from the cache as well as using the mfence instruction to protect against reordering.
Now there is lots of noise in this attack. All though there is a measurable difference between the timing of buffered and unbuffered reads, the difference is very small compared to a traditional flush+reload attack. And on modern multi core computers, you always have “noise” in execution times (especially on memory) leading to frequent misidentification Another reason is that the chance another non-victim process uses another row is much, much larger than another core causing a cache eviction (cache has 14/16 64 byte ways - a bank 32k rows of 8kb memory) or even uses the same cache line as the victim. And finally another core using the same DIMM (ram module) while you are polling will cause delays in excess of the saving from using a row buffer as cache. Another potential source of noise is the memory refresh that'll read and rewrite rowbuffers periodically. With so much noise, coarse grained attacks are unlikely to bring significant rates of success. Additionally high levels of memory activity on other cores is poison to the attack. When running under favorable conditions and probing I only get around 70% correct predictions on if the attacker used my row buffer or not.
I need to mention that my way of launching this attack can be improved upon instead of just statistically making sure that another row buffer is loaded, you could do it exactly if you had knowledge of the micro architecture. I often observe very slow reads (factor 3 or more) for other reasons than missing the row buffer. The amount of time spend however clearly indicates that we’re seeing noise and should disregard the reading both in finding the baseline and while attacking. Likely cause for these long waiting times is that we are forced to wait for another core’s request to access memory from the DIMM.
I said in the introduction that this attack vector probably isn’t that important on modern computers. The reason is that the 3rd level cache is bound to pick up on any memory operation occurring and thus the second read that is required is likely to never make it to DRAM. Now we can work around this using CLFLush or equivalent means – this is what I did as described above. But why would an attacker bypass a better side channel to use a worse and in the process leave behind all the hallmarks of a true CSC attack? Never-the-less there might be corner cases where RBSC attacks could provide a viable alternative to CSC attacks.
I think the solution to RBSC channel attacks has to be a (micro?) architectural one. But since we probably aren’t going to get one we may wish to consider software mitigations. Mostly the same methodology for mitigation would apply to row buffer attacks as to cache attacks. First on modern computers with a L3 cache we need to flush cache to make it work and this makes my performance counter detection a viable solution for high frequency polling. To access the row buffer instead of the cache, the attacker must cause a cache miss - and typically where we'd normally expect a hit. Another good solution would be making the timer less accurate as I’ve suggested for cache side channel attacks, which would make it virtually impossible to distinguish between row buffer reads and normal reads.
From a theoretical point of view RBSC is a lesson into how apparent beneficial design decisions such as using the row buffer as a cache can quickly turn into a security issue. The row buffer is required in the DRAM scheme – however it’s cache function probably isn’t (I’m no DRAM expert), but seems like a great idea performance wise. As soon as we share resources across trust boundaries there will always be a potential for side channels. It also serves as a warning that fixing cache side channel attacks in micro architecture in the processor requires that thought is given to RBSC attacks. That being said RBSC attacks seem unlike to be a real world threat – at least of the x86 architecture. However it might be quite interesting for other architectures.
Ending on a speculative note, CATs and more advanced attacks
This part was really added long after I'd finished up the above text. So consider this a last minute addition.
More advanced attacks
The first modification we can do to the attack above is to look not at what is in the row buffer, but what isn't. If we in the above attack prime the row buffer with an attacker memory location, checking to see if the victim removed it gives us information about his activities. But this isn't much information because we'll only know if the victim used the same bank or not. In essence knowing that the victim used any one of 32k 8k buffers isn't too valuable.
But we can improve on this. The details are micro architecture specific and I'll stick to my wife's I3 Sandy Bridge dual channel architecture here. My guess is that other architectures will be different, but have enough similarities for the basics to apply. Sometimes the attacker shares a row with a victim without actually sharing memory and we can utilize this to gain more information. First the operating system allocates memory to processes on a per page basis and security attributes too is on a per page basis. However a row is shared across pages and thus potentially across privileges and we can use this to spy. Typically there is parts of 4 pages in a row buffer. The first part of that number is that the row typically contains 8kb while standard (small pages) are typically 4kb. This accounts for two pages. Finally we know the first 12 bits of a virtual address is identical to the corresponding physical address. Also we know from Seaborn(2015) that bit 6 of a physical address is used to select which DIMM should be used. This is probably implemented so by intel so that memory requests can be handled in parallel by the two DIMM's. When the computer in question has two DIMM's (memory modules) each page is split across two DIMMs and thus two rows. Hence each row contains memory from 4 different pages. If the attacker is in a position to cause behavior in the victim he can map this behavior to row activity by timing and eventually use this information to spy on activity not caused by him. It is obviously a far more difficult attack to launch and I've not actually tried to.
CAT and the future of cache attacks
In Intel chapter 17.15 the “cache allocation technology” (CAT) of “future” (I think this means Broadwell and beyond, but I'm not sure) Xeon processors is vaguely described. Unfortunately I don't own or have access to a "future Xeon processor" and thus anything below remains pure speculation. The CAT allows the supervisor or hypervisor to partition the cache and select which processes should have access to which partitions. It is build in to give hypervisors and supervisors a way to provide important processes with better performance by giving them better access to cache. Without partitions processes compete for cache on equal terms, CAT is supposed to tip the balance in favour of those who the OS think needs it. The documentations that I have state that only the L3 cache is currently supported. The way it works is undocumented (or I don't have/understand the documentation), but it is hinted that current implementations is partitioning over ways. Which sounds like a easy and sensible method. The current process is assigned a two bit mask in an MSR called a COS value. Based on this COS value a so called capacity bit mask is fetched from another MSR. I assume that for each bit set in this MSR corresponding ways of each cache set is accessible for the current memory access operation . The idea is that the operating system sets up the bitmask on boot (default is everything is shared) and for each process/thread/what ever sets the appropriate COS value. This all could be potentially be used to give 4 different users access to each a private L3 cache depending on how it’s actually implemented. This could be a nice flexible (leaving performance/security trade off to the OS implementation) micro-architectural solution to the cache side channel attacks on L3 caches – should it work.
I see two likely scenarios for the actual implementation of CAT. The first an simple is a hard partition of the cache. What I mean by this is that memory placed in a cache partition only available to a COS=0 process will come from physical memory if a COS=1 process should try to access it (or from COS=1's own partition of memory). Further a COS=1 process could not possible evict or flush memory from the COS=0's partition. In this scenario it's likely that all cross COS attacks on the 3rd level cache is impossible when the bit makes are set up appropriately. Obviously even though the cache is hard partition and attackers cannot either "prime" or "load" across partitions boundaries, instructions may still leak information about cache state and allow some attack possibilities.
The second scenario I call the "soft partition scenario". I find more likely this scenario more likely. In this scenario only the second part would hold true. Any memory access regardless of COS would be served from any cache partition if available. However any eviction or flush decision would only be affected only by cache events initiated with appropriate COS access. It would allow any amount of "probing" however once a cache entry has been made by the victim, the attacker could no longer flush it on his own. Thus flush or evicting by an attacker becomes a race condition instead of a sure thing. So L3 cache attacks would lose a lot of power, but not generally be impossible. I find this scenario more likely for two reasons. The first is that in this scenario no new cache coherency between partitions would have to be enforced. Secondly it allows synergies between COS 's to be leveraged for better performance.
It is however easy to imagine other scenarios on how CAT could be implemented and implementation details matters very much. There is work to be done here.
It might also be valuable to look into chapter 17.14 ( cache monitoring) for CSC mitigation detection techniques in the future.
Certainly CAT will have a profound impact on RBSC too. The impact pushes RBSC's in two directions first the problem with by passing the cache for another COS becomes impossible in the first scenario and becomes a race condition in the second. However CAT does not appear in any implementation to make RBSC unfeasible and this is likely to make it a more viable options for attackers.
In this place I must acknowledge Daniel Gruss who motivated me to keep on working on this blog post, when it seemed like it'd be too much effort to finish. Also my thanks should go out to Jacob Soo whose comments improved the text. All errors are mine.
LitteratureSeaborn(2015):” How physical addresses map to rows and banks in DRAM”;
:Wikipedia(XXXX): Dynamic random-access memory;https://en.wikipedia.org/wiki/Dynamic_random-access_memory#Operations_to_read_a_data_bit_from_a_DRAM_storage_cell
Intel(2015): “Intel® 64 and IA-32 Architectures, Software Developer’s Manual,Volume 3 (3A, 3B & 3C):,System Programming Guide”
Daniel Gruss, Clémentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf