The final version of the slides of our talk is here
As some might have noticed I had the honor of speaking at Black Hat 2015 in Las Vegas with my old friend Nishad Herath. The topic of our talk was: “These Are Not Your Grand Daddys CPU Performance Counters”. More specifically our talk was about using the Intel performance counters (PMC’s) for defensive security purposes. Unfortunately the talk didn’t go as well as I’d wanted it to and I certainly don’t think I conveyed what I wished to. Also we did some last minute changes to the slides in the presentation. Hence this blog post with the updated slides and my personal comments to the slides.
I apologize for that format of this blog, but I really want to get done with it and move on. The majority of these comments were written down as notes and I had them on stage with me. A few augmentations has been made for this blog post and I’ve made an effort to make them readable to other people than myself. These notes where obviously was heavily inspired by discussions with Mr. Herath. So I have to share any credit with him. Never-the-less they where my notes and any errors are mine alone.
The story how I got to speak at black hat is interesting in itself (I think) and it certainly did have an impact on the talk. If this doesn’t interest you skip straight to the next header. Originally it was Mr. Herath talk and I was invited to join late in the process. It wasn’t until the 25th of June that things became official and I started really believing that it would work out. This left just 5 weeks of preparation time for me and at that point I didn’t even know that the slides where due on the 20th of July. So the big rush started on my side to get up-to-date on details of performance counters what other people had done with Performance counters etc. It didn’t make it easier that Mr. Herath was in an entirely different time zone either. Also worth mentioning was that we’d been asked to spend a significant amount of our time on row hammer. Personally I would’ve rather spend my time elsewhere given that Mark Seaborn and Halvar Flake where already giving a talk on row hammer and they know much more about it any way. Especially I found it silly that with two talks touching on row hammer that they ended up being scheduled in the same time slot.
While we were working on slides, the big speculation was that row hammer was doable in Java Script and we wrote up slides in line with this speculation while working frantically to actually figure out if/how this could actually be done. I succeeded flushing the cache fast enough without clflush (only on Sandy Bridge) the night before Daniel Gruss and Clémentine Maurice published their (really nice) rowhammer.js paper which obviously then turned over our slides. No more speculation, row hammer in JS was fact.
We had added Cache Side Channel attacks as I’d noticed that my CSC code lighted up my row hammer detection as well. It was always meant as a stop-gap in case we’d not use up our time with ROP, row hammer and root kits and I saw it as a way to get some fresh meat on the table. Just a few days before Black Hat a W3C java script draft came to my attention. In this draft they wanted to make the timer less accurate (3 us ) in response to “Spy in the sandbox”. This rang a bell in my head and turned everything upside down in the CSC slides – from having something that only worked reasonably on a very high frequency of polling the side channel we went to something where we could actually do a fair detection on a much lower frequencies. This caused the last minute changes to this section – as you may notice they are pretty severe on the low frequency stuff. Now I don’t pretend that the CSC changes are the final word on Cache Side Channel attacks and performance counters. I think it’s enough to show that PMC’s can be relevant to Cache Side Channel attacks. If you’re interested in this subject I’m writing on another blog to elucidate my findings more accurately than the slides where ever intended to.
It is such fun to work with stuff that is still in movement, but it does throw you through some hoops when you are on a tight dead line.
The final event that really turned things upside down was that Mr. Herath laptop (which we would use for the presentation) gave up and we had to spend the morning before our talk (the only half a day of actually being physically at the same location before the talk) summoning up a new laptop and agreeing on the last minute slide changes – thanks to an anonymous benefactor that lend us a laptop to do changes while we were still in pursuit of one for the presentation. Thus I went to speak at the black hat conference without having ever seen a conference talk and not with the best of preparations either. I don’t think it was a disaster but I should’ve done a lot of things better. I learned a lot and I’ll do better next time – if there will be one. As a kind soul remarked it takes a few iterations to get right – though I agree with this, I certainly hat set the bar higher for myself. The real error I think was insufficient preparation on my part and too much detail and too much material.
My comments on the slides
Performance Counters counts different micro events in the CPU. These micro events tell a lot about the code running on the CPU and since malicious low-level code often behave differently than “normal” code the performance counters can be used as a signal to look for malicious code. As an example ROP code causes excessive “return misses” as the “ret” opcode is not used in pair with a call, but rather to progress a long a gadget chain. Unfortunately events are plentiful so there is a lot of noise in the signal too. However there are ways to harness this noise and make useful inference about the presence of malicious code from the performance counters. The talk in my opinion summarized into one statement: Performance Counters can be very useful for security purposes in a great many ways. My wishful thinking in this regard would be that these things made their way to security products, that Microsoft got their stuff together and made a decent interface for using PMC’s under Windows and Intel would consider adding security related events to the PMC’s instead of just stuff for optimizing their CPU or your code.
I think it’s worth noting that I did the majority of my work with PMC’s on Windows XP32. The reason behind this move was that I had a system available to me and that I avoid driver signing and patch guard issues. Most other work has been done in linux derivatives or debian both of which has a much nicer interface to PMC’s than Windows. There are of cause ways around Patch Guard, but it wouldn’t be nice to pursue them. Also there might be undocumented Hal API such as the HalpSetSystemInformation() on WinXP to hook into the PMI. For non PMI there is an api on newer systems on a per process basis for usermode. Visual Studio 12 comes with horrible tools for using that API – I’m not aware of any documentation but you can reverse engineer those tools – they are not too big. Sorry – I don’t have much information on this. Maybe I’ll look into it…no promises though.
This is slide is to me our key slide. Not only does it show how you can minimize the performance impact from using performance counters to look for malicious activity, further more it gives some basic ideas on how you can manage the noise problem with performance counters. The basic idea is that we can often from other analysis say something about the conditions under which malicious code will run and where it won’t. The simple example is row hammering is not very useful if the attacker is already running code in ring 0, thus we need not examine events in ring 0 to detect row hammering. Of our 4 examples of detecting/mitigating malicious activity all of the methods uses some aspects of the methodology in this slide.
Ret-prediction is terrible around task-switches, because the CPU’s shadow stack does not get swapped, the real stack however does. The 400k bad case scenario on slide 29 was for (for (int i=0;i<300;i++) Sleep(100); which causes lots of task switches. Thus on task switches we’d see heavy performance penalties and problems with limited depth of the LBR. kBouncer essentially get’s around the limited depth of the LBR register (which records only the last 16 branches) by instrumentalizing the code of targets of interest around API calls. This can be seen as the instrumentalization method of slide 30. Interestingly the first work around from an attackers point of view was to use the still limited depth of the LBR to bypass kBouncer. However traditional performance counters could be used with instrumentalization as well and then depth would be limited only by acceptable performance loss and memory. The implementation would be slightly more complex than kBouncer (because we’d need instrumentalization to start and stop collection), but we are well within the realm of possibility.
We have the essentially same problem as in slide 35 for kBouncer, however the solution is different. Georg Wicherski limits to ring 0 and get’s around the problem of task switching that way.
I wished to make the point that row hammer kinds of problems are likely to increase in the future as DRAM grows increasingly dense. It also means that the problem is likely to see hardware fixes in the future.
This slide can serve a way of sorting the following row hammer mitigation methods into categories. It should be noted that having two physical rows in the same bank is required even for single side hammering because of the row buffer. This slide gives a false impression.
Even though not good enough on it’s own it might be a useful supplement for other methods, including ours.
The performance measure is on memory access only. It’s doesn’t map well defined to system performance because the intel CPU is able to rearrange instructions while waiting for memory. This will tend to make system loss smaller. On the other hand memory speed is very often the bottle neck on performance dragging things in the other direction. I think the latter effect will dominate, but I have no evidence towards backing this belief.
The answer to the last question is a definite no, though some it might be enough for some RAM.
It’s my guess that power consumption will not be relevant. I found two different sources painting different pictures. But I think that it would make a difference in sleep, but since we cannot row hammer in sleep, the point is mood. While the system is awake other components of a laptop should out spend ram by a large factor. So if implemented right, it should be a non-issue.
All three are mitigation through faster refresh on slide 49. PARA + pTRR are more target refresh to avoid the steep penalties of refreshing the entire RAM more. The latter two methods seems to me to be the “right” solution – essentially row hammer is a micro architecture problem and it should be fixed in micro architecture too and I consider that likely to be done – see also slide 48 comments on ram growing more dense. However that requires that people get new hardware.
I’ve not been able to do row hammer with Non-temporal instructions without flushing cache with other means first. Seems like that using these instructions from java script is very difficult because JIT compiler don’t generate them. (There was a paper on this, if you’re are interested ask me and I’ll find it) There might be other reasons why intel should consider making CLFlush a privileged instruction: Cache side channel attacks. Addtionally there is little use for CLFlush except for optimizing cache usage with it to speed up code, which seems rather difficult. Again making CLFLush priveledge does not solve the problem, but it doesn’t exactly hurt it either.
Additionally to mitigation through delay we could use try to figure out the victim row and read from it to trigger a refresh (a row is automatically re-written on read
Also interesting here we use a rarely triggering interrupt (say every 1000 LLC misses to trigger the costly slowdown). It’s an example of using a rare event from slide 30 to trigger more costly analysis. (Totally irrelevant note that I didn’t know where else to put: I at one point used ret-miss events to alert me of process switching during playing around - instead of hooking my way into getting alerted on task switch)
Performance cost analysis: A normal LLC miss cost around 200 NS and an interrupt costing around 500 NS – triggering an interrupt every 1000 costs only ~2.5% performance and 1000 is a really low number for Row hammer. 100000 would probably be more appropriate.
Essentially the root kit detection here is detecting code executing out-of-bounds. This is outside of a known white list. It we use the probabilistic element of slide 30 to keep performance cost down while staying in ring 0 only makes it feasible to have a white list with known good code. I dislike using the word heuristic from slide 30 in this sense – if an interrupt triggers on this code, there is no doubt the code executed, however nobody says that executing the code will trigger a PMI (thus probabilistic). Finally we hint that using instrumentalization around particularly vulnerable code could increase true positive rate enough to get a chance at finding traditional exploits doing their work.
Slide 86 & Cache side channel attacks in general
I’ll just not that we’re using the instrumentalization (slide 30 again) of the fine grained timer in addition to making it less fine grained to force behavior that is less likely to occur naturally and thus make it detectable. It is in many ways a heuristic approach. I should note too that false positives are possible and you’d have to measure your response in that respect – e.g. make timer even more inaccurate, flush cache, trigger copy on write etc. What should also not be forgotten: This is intended to show that PMC’s can be relevant for Cache side channel attacks. It is not the final word as the attacker does have options to evade, and I’ll get back to that when I finish my next blog. On the good side though – there is more defensive possibilities too…. To be continued….