Introduction
The final version of the slides of our talk is here
As some
might have noticed I had the honor of speaking at Black Hat 2015 in Las Vegas
with my old friend Nishad Herath. The topic of our talk was: “These Are Not Your
Grand Daddys CPU Performance Counters”. More specifically our talk was about
using the Intel performance counters (PMC’s) for defensive security purposes.
Unfortunately the talk didn’t go as well as I’d wanted it to and I certainly
don’t think I conveyed what I wished to. Also we did some last minute changes
to the slides in the presentation. Hence this blog post with the updated slides
and my personal comments to the slides.
I apologize
for that format of this blog, but I really want to get done with it and move on.
The majority of these comments were written down as notes and I had them on
stage with me. A few augmentations has been made for this blog post and I’ve
made an effort to make them readable to other people than myself. These notes where obviously was heavily
inspired by discussions with Mr. Herath. So I have to share any credit with him.
Never-the-less they where my notes and any errors are mine alone.
The story
The story
how I got to speak at black hat is interesting in itself (I think) and it
certainly did have an impact on the talk. If this doesn’t interest you skip
straight to the next header. Originally it was Mr. Herath talk and I was
invited to join late in the process. It wasn’t until the 25th of
June that things became official and I started really believing that it would
work out. This left just 5 weeks of preparation time for me and at that point I
didn’t even know that the slides where
due on the 20th of July. So the big rush started on my side to get
up-to-date on details of performance counters what other people had done with
Performance counters etc. It didn’t make it easier that Mr. Herath was in an
entirely different time zone either. Also worth mentioning was that we’d been
asked to spend a significant amount of our time on row hammer. Personally I
would’ve rather spend my time elsewhere given that Mark Seaborn and Halvar
Flake where already giving a talk on row hammer and they know much more about
it any way. Especially I found it silly that with two talks touching on row
hammer that they ended up being scheduled in the same time slot.
While we were
working on slides, the big speculation was that row hammer was doable in Java
Script and we wrote up slides in line with this speculation while working
frantically to actually figure out if/how this could actually be done. I
succeeded flushing the cache fast enough without clflush (only on Sandy Bridge)
the night before Daniel Gruss and Clémentine Maurice published their (really
nice) rowhammer.js paper which obviously then turned over our slides. No more
speculation, row hammer in JS was fact.
We had added
Cache Side Channel attacks as I’d noticed that my CSC code lighted up my row hammer
detection as well. It was always meant as a stop-gap in case we’d not use up
our time with ROP, row hammer and root kits and I saw it as a way to get some
fresh meat on the table. Just a few days before Black Hat a W3C java script
draft came to my attention. In this draft they wanted to make the timer less
accurate (3 us ) in response to “Spy in the sandbox”. This rang a
bell in my head and turned everything upside down in the CSC slides – from having
something that only worked reasonably on a very high frequency of polling the
side channel we went to something where we could actually do a fair detection
on a much lower frequencies. This caused the last minute changes to this
section – as you may notice they are pretty severe on the low frequency stuff.
Now I don’t pretend that the CSC changes are the final word on Cache Side
Channel attacks and performance counters. I think it’s enough to show that PMC’s
can be relevant to Cache Side Channel attacks. If you’re interested in this
subject I’m writing on another blog to elucidate my findings more accurately
than the slides where ever intended to.
It is such
fun to work with stuff that is still in movement, but it does throw you through
some hoops when you are on a tight dead line.
The final
event that really turned things upside down was that Mr. Herath laptop (which
we would use for the presentation) gave up and we had to spend the morning
before our talk (the only half a day of actually being physically at the same
location before the talk) summoning up a new laptop and agreeing on the last
minute slide changes – thanks to an anonymous benefactor that lend us a laptop
to do changes while we were still in pursuit of one for the presentation. Thus
I went to speak at the black hat conference without having ever seen a
conference talk and not with the best of preparations either. I don’t think it
was a disaster but I should’ve done a lot of things better. I learned a lot and
I’ll do better next time – if there will be one. As a kind soul remarked it
takes a few iterations to get right – though I agree with this, I certainly hat
set the bar higher for myself. The real error I think was insufficient preparation
on my part and too much detail and too much material.
My comments on the slides
Performance
Counters counts different micro events in the CPU. These micro events tell a
lot about the code running on the CPU and since malicious low-level code often
behave differently than “normal” code the performance counters can be used as a
signal to look for malicious code. As an example ROP code causes excessive “return
misses” as the “ret” opcode is not used in pair with a call, but rather to progress
a long a gadget chain. Unfortunately events are plentiful so there is a lot of
noise in the signal too. However there are ways to harness this noise and make
useful inference about the presence of malicious code from the performance
counters. The talk in my opinion summarized into one statement: Performance Counters can be very useful for
security purposes in a great many ways. My wishful thinking in this regard would
be that these things made their way to security products, that Microsoft got
their stuff together and made a decent interface for using PMC’s under Windows
and Intel would consider adding security related events to the PMC’s instead of
just stuff for optimizing their CPU or your code.
Slide 24
I think it’s
worth noting that I did the majority of my work with PMC’s on Windows XP32. The
reason behind this move was that I had a system available to me and that I
avoid driver signing and patch guard issues. Most other work has been done in linux
derivatives or debian both of which has a much nicer interface to PMC’s than Windows.
There are of cause ways around Patch Guard, but it wouldn’t be nice to pursue
them. Also there might be undocumented Hal API such as the
HalpSetSystemInformation() on WinXP to hook into the PMI. For non PMI there is
an api on newer systems on a per process basis for usermode. Visual Studio 12
comes with horrible tools for using that API – I’m not aware of any
documentation but you can reverse engineer those tools – they are not too big.
Sorry – I don’t have much information on this. Maybe I’ll look into it…no
promises though.
Slide 30
This is slide is to me our key slide. Not only
does it show how you can minimize the performance impact from using performance
counters to look for malicious activity, further more it gives some basic ideas
on how you can manage the noise problem with performance counters. The basic
idea is that we can often from other analysis say something about the
conditions under which malicious code will run and where it won’t. The simple
example is row hammering is not very useful if the attacker is already running
code in ring 0, thus we need not examine events in ring 0 to detect row hammering. Of our
4 examples of detecting/mitigating malicious activity all of the methods uses
some aspects of the methodology in this slide.
Slide 35
Ret-prediction
is terrible around task-switches, because the CPU’s shadow stack does not get
swapped, the real stack however does. The 400k bad case scenario on slide 29
was for (for (int i=0;i<300;i++) Sleep(100); which causes lots of task
switches. Thus on task switches we’d see heavy performance penalties and
problems with limited depth of the LBR. kBouncer essentially get’s around the
limited depth of the LBR register (which
records only the last 16 branches) by instrumentalizing the code of targets of
interest around API calls. This can be seen as the instrumentalization method
of slide 30. Interestingly the first work around from an attackers point of
view was to use the still limited depth of the LBR to bypass kBouncer. However traditional
performance counters could be used with instrumentalization as well and then
depth would be limited only by acceptable performance loss and memory. The
implementation would be slightly more complex than kBouncer (because we’d need
instrumentalization to start and stop collection), but we are well within the
realm of possibility.
Slide 37
We have the
essentially same problem as in slide 35 for kBouncer, however the solution is
different. Georg Wicherski limits to ring 0 and get’s around the problem of
task switching that way.
Slide 48
I wished to
make the point that row hammer kinds of problems are likely to increase in the
future as DRAM grows increasingly dense. It also means that the problem is
likely to see hardware fixes in the future.
Slide 49
This slide
can serve a way of sorting the following row hammer mitigation methods into
categories. It should be noted that having two physical rows in the same bank
is required even for single side hammering because of the row buffer. This
slide gives a false impression.
Slide 52
Even though
not good enough on it’s own it might be a useful supplement for other methods,
including ours.
Slide 53
The
performance measure is on memory access only. It’s doesn’t map well defined to
system performance because the intel CPU is able to rearrange instructions
while waiting for memory. This will tend to make system loss smaller. On the
other hand memory speed is very often the bottle neck on performance dragging
things in the other direction. I think the latter effect will dominate, but I
have no evidence towards backing this belief.
The answer
to the last question is a definite no, though some it might be enough for some
RAM.
Slide 54
It’s my
guess that power consumption will not be relevant. I found two different
sources painting different pictures. But I think that it would make a
difference in sleep, but since we cannot row hammer in sleep, the point is
mood. While the system is awake other components of a laptop should out spend
ram by a large factor. So if implemented right, it should be a non-issue.
Slide 52+55+56
All three
are mitigation through faster refresh on slide 49. PARA + pTRR are more target
refresh to avoid the steep penalties of refreshing the entire RAM more. The
latter two methods seems to me to be the “right” solution – essentially row
hammer is a micro architecture problem and it should be fixed in micro
architecture too and I consider that likely to be done – see also slide 48
comments on ram growing more dense. However that requires that people get new
hardware.
Slide 60
I’ve not
been able to do row hammer with Non-temporal instructions without flushing
cache with other means first. Seems like that using these instructions from
java script is very difficult because JIT compiler don’t generate them. (There
was a paper on this, if you’re are interested ask me and I’ll find it) There
might be other reasons why intel should consider making CLFlush a privileged
instruction: Cache side channel attacks. Addtionally there is little use for
CLFlush except for optimizing cache usage with it to speed up code, which seems
rather difficult. Again making CLFLush priveledge does not solve the problem,
but it doesn’t exactly hurt it either.
Slide 68
Additionally
to mitigation through delay we could use try to figure out the victim row and
read from it to trigger a refresh (a row is automatically re-written on read
Also interesting here we use a rarely
triggering interrupt (say every 1000 LLC misses to trigger the costly
slowdown). It’s an example of using a rare event from slide 30 to trigger more
costly analysis. (Totally irrelevant note that I didn’t know where else to put:
I at one point used ret-miss events to alert me of process switching during
playing around - instead of hooking my
way into getting alerted on task switch)
Performance
cost analysis: A normal LLC miss cost around 200 NS and an interrupt costing
around 500 NS – triggering an interrupt every 1000 costs only ~2.5% performance
and 1000 is a really low number for Row hammer. 100000 would probably be more
appropriate.
Slide 71
Essentially
the root kit detection here is detecting code executing out-of-bounds. This is
outside of a known white list. It we use the probabilistic element of slide 30
to keep performance cost down while staying in ring 0 only makes it feasible to
have a white list with known good code. I dislike using the word heuristic from
slide 30 in this sense – if an interrupt triggers on this code, there is no
doubt the code executed, however nobody says that executing the code will
trigger a PMI (thus probabilistic). Finally we hint that using
instrumentalization around particularly vulnerable code could increase true
positive rate enough to get a chance at finding traditional exploits doing
their work.
Slide 86 & Cache side channel attacks in
general
I’ll just
not that we’re using the instrumentalization (slide 30 again) of the fine
grained timer in addition to making it less fine grained to force behavior that
is less likely to occur naturally and thus make it detectable. It is in many
ways a heuristic approach. I should note too that false positives are possible
and you’d have to measure your response in that respect – e.g. make timer even
more inaccurate, flush cache, trigger copy on write etc. What should also not
be forgotten: This is intended to show that PMC’s can be relevant for Cache
side channel attacks. It is not the final word as the attacker does have
options to evade, and I’ll get back to that when I finish my next blog. On the
good side though – there is more defensive possibilities too…. To be continued….