Thursday, November 5, 2015

x86 assembler coding nostalgia

Foreword

This blog was actually finished a long time ago. I've not posted it before because I'm anything but happy with how it turned out. I hope you enjoy it anyways. In case you actually read this I added an errata to the cache side channel post. A friend of mine in the infosec community insists on calling me old and he is right too. You are old when you write nostalgia blogs.

Introduction

The idea for this post came shortly after I’d posted the last nostalgia post. That post remains one of the most popular on my blog. The trigger was a question by MalwareTech on twitter. I don’t recall the exact quote, but it was along these lines: “Is there a way to figure out if an instruction is privileged from it’s encoding?”. My first thought was that it’s a silly question. But thinking about it made me reconsider. The question has lots of merit and the answer is silly. Any reasonable designer would have grouped privileged instructions together, yet x86 doesn’t. The reason is that x86 assembler wasn’t designed, it evolved. It’s layers and layers of added designs and I’ve had the mixed pleasure to follow lots of this evolution first hand. This post however is not the story of x86 processors – other people could tell that story better than I could, I’m sure. This post is about me and x86 processors. I’ll try to focus on the tech stuff, because I assume you’re only interested in me to the extend, that my story reflect yours. I hope you enjoy this trip down memory lane.

Setting the stage

Somewhere in time Intel made a new processor called 8086. This CPU would've been long forgotten by now if not IBM had made an open computer design around it. As a consequence companies like Nixdorf of Germany, Olivetti of Italy, Commodore (and much later Apple) and many others produced their own computers based on this design and the 8086 and it's successors became the de facto business PC and remains so today. It had an instruction set inherited from older Intel processors like the 8085, to make porting software easier. And a set of 16 bit of registers that by now sound awfully familiar: AX, BX, CX, DX, SP, BP, DI, SI, CS,DS,ES and SS. The first 4 of which each could be accessed as two 8-bit registers as well.

In the early 1980'ies - I don't know exactly when my father bought his first x86'er PC. It was a 4.77 MHZ 8088 IBM portable (if you can call anything weighing 14 kilos portable and not to mention it could not run unplugged). It was soon sold again, because the display did not live up to the standard our old Commodore already had. Consequently it was replaced with a Commodore PC 10 with a NEC V20 processor. It was instruction compatible with the 8088, but running 8MHZ and if I recall correctly a better memory bus which made it a pretty solid computer for the day. At first I didn't care much for the monster. First you had to boot the OS from a floppy disc (rather than from a ROM), you had to enter the time on each boot too as no battery was on board to keep the clock running, the games sucked and I had nobody to swap games with. It took a while for me to get interested in that computer. My father being a geek relatively quickly acquired a 10Mb hard drive and a multifunction card which cured the having to enter the time problem and boot from floppy disks. However what made the difference to me was a word processor. Once I got used to it, I found that the easy with which I could do my homework would save me time. Having a spell checker in the mid 80ies and search function to approximate commas was brought down time spend on my Danish homework significantly for the same grades (Me being a freak I actually timed it). Another cool thing I just have to mention is that my best friend bought a used IBM PC and today it defies belief but it actually came with a diagram of all the electronics on the mother board so you could repair the beast with a soldering iron should something break.

Hacking did exist in these days, but not for kids like me. Mischief obviously existed and that usually meant borrowing the library's computer (an Olivetti) that booted into a menu from which you could selected a handful of innocuous programs and require a password to get back into DOS to do maintenance. A dos prompt was root, since the early processors in the family 8086, 8088, V20, 80186, 80286 etc. came completely without security functions. These processors had no memory protection, any memory was R/W/X and the entire address space was accessible from anywhere. The in/out instructions always worked, the interrupt table was placed at a fixed location in memory etc. etc. Consequently if you got into dos you owned the computer. As a direct consequence of the processors inability (insufficient memory, too slow, etc.) to do meaningful multitasking, many programs, such as the dominant word processor Word Perfect, allowed you to open a dos prompt in foreground to format floppy disks, look for files etc. and then close the dos prompt to continue working on your document. I used that countless times to get that black hat feeling of power, back when I was 12 or younger. Don't recall actually doing anything with my power though. But I wanted to crack games, have a cool nick name in intros and for that I had to learn assembly.

My first 0x86 assembly program

My first assembly program would’ve looked much like this:

.MODEL small
.STACK 100h
.DATA
HelloMessage DB ‘Hello, world’,13,10,’$’
.CODE
.startup
mov ax,@data
mov ds,ax
mov ah,9
mov dx,OFFSET HelloMessage
int 21h
mov ah,4ch
int 21h
END

When linked you’d have a small hello world executable. This code obviously only has remote similarity to any assembler you’d write for windows or other modern operating system today. Back then it was just 0x86 assembler. With the arrival of the 80386 processor it became “real mode” as opposed to “protected mode”. I’ll get back to this later. The two things that seems particularly weird in the above assembler is writing to DS (which we today call a descriptor register, but back then it was a segment register – there was nothing to “describe” and thus it’s only say which segment of memory to read from) and the use of interrupts.

To understand why segment registers had to be written, the first thing we need to understand is that the early x86 processors came without paging and where 16 bit. Thus without changing the segment register we could only access 64kb – not very much even by the standard of the time. Thus the changing of the segment registers. The “Physical” address could be calculated by segment (register << 4) | address. 0000:0211 would be the same memory as 0001:0201. The upside to this scheme was that loading an executable or allocating memory you’d only loose a max of 15 bytes to alignment. This seems like a reasonable compromise considering the address space was 20 bits: 1 Mb – of which only 640kb was to the users disposal, the last 384kb belonged to the bios (well sort of – eventually things like himem.sys partially recaptured memory from this region). I like to think that history repeats itself and the scheme is a bit like Amd64 which is a 64bit addressing in a 48bit address space and the reserved 384kb can in wierd way be thought of a early version of SMM memory.

Interrupts in those days where not only hardware and exceptions service as it is today. It was also the API. Though many interrupts where used (0x19=reboot,0x20=exit process) interrupt 0x21 and 0x13 where the once you had to know. 0x21 was the dos services – which in todays terms are what is exported from kernel32.dll in windows. The calling convention was all register based (funny how history repeats itself. After more than a decade with stack based parameters, it’s typical to use much more registers for call styles in x64 – actually until WinXP int 0x2e was used internal to forward calls from user mode to kernel mode and was then replaced with the new instruction syscall, but programmers rarely see this nowadays). I never saw any official documentation, I doubt most people did – remember without internet to download stuff and no Intel or Microsoft websites documentation was tricky. Instead unofficial documentation especially in the form of interrupt list was floating around. The most important was Ralph Brown’s interrupt list. For interrupt 0x21 the ah register selected the function. 0x9 was Writeln() (that is write a $ terminated string to the console and 0x4c was ExitProcess(). The interrupt 0x13 was very important to system programmers, because this was where the bios exported it’s functionality. If you wanted to read sector based from disk for example, the easiest way was to use this interrupt which was supported by the bios. All boot loaders at the time where full of int 0x13 and copy protections too…

Another funny thing about early x86 was that lot’s of stuff was mapped to fixed addresses in memory. At 0000:0000 was the start of the interrupt table, which was just an array of far pointers. An interesting side effect of this was, that any program could globally hook all operating system API: An offer that virus authors would not turn down. Even weirder was address 0000:b800. It was the console. Mov ax, 0; mov ds,ax; mov [b800], ‘A’ would print an “A” in the top left corner of the console. The graphics adaptor would read it directly from here. Consequently you could read what was in the console from this address too. Graphics was rare, so the console mode was where computers where actually used in the early x86 days. EGA graphics had a fixed address too – I think it was 0000:e800 – but I really do forget these things.

Writing my first demo

A demo was at the time the “proof” of coding skill in 1994. Doing nice motions graphics required pushing the processor pretty far and writing a good demo in a high level language usually didn’t turn out well. I wrote my first and last demo in real mode assembler. It was a sine scroller – which essentially means that a text scrolls across the screen in a sine curve. To realize this, the first thing I had to do was to define a font. Other than the standard ascii, the operating system or hardware didn’t offer anything. This was done in a graphics program and then the output was converted into a table for each letter by hand. With only 256 colors it was a byte table. This was a lot of work, even when I drew the font only for the letters I needed. The second thing I needed was a sine table. Today you’d just calculate one as you go. Not so back then. The first problem was that early x86’s did not come with floating point instructions. To get floating point instructions you’d actually have to buy a so called co-processor (typically called 8087, 80387 or so) which you could plug into your motherboard. The second reason was that a table lookup was far more efficient. I generated my sine table in pascal that had a library for calculating sine with integer instructions and then added it to my demo. Drawing the actual graphics on screen was challenge of it own. The screen refresh was pretty slow and you’d see artifacts from it if you didn’t synchronize with it. Thus updating your screen was a loop of waiting for the refresh to finish and then a complex memory move operating to the graphics card memory (a fixed address as seen above). Waiting was achieved by using the “in” instruction directly (No clue what port it was) until a certain bit was deleted (or so I recall it). My demo sucked and I distinctively remember that despite being able to code these kinds of things, I had a huge disconnect in what I thought it would look like and what it looked like. I knew I was never going to be an artist and had already lost my heart to systems programming and computer virus stuff.

80386

80386 in many ways was the game changer for x86. It was introduced in 1985 according to Wikipedia. It was also the first computer I bought – it was 1991 or 92. Imagine a processor architecture staying pretty much unchanged for 7 years – you hardly see that today. The irony is that despite of this back then a computer lasted 2 or 3 years before it was antiquated, by today standard no time at all. The 386 was probably the biggest revolution in the entire history of the 0x86 platform and for infosec people almost certainly. The most profound change was the move to a 32 bit architecture from the previous 16 bit. Not only was the address space increased from 20 bit to 32 bit, but the registers and operand sizing was upgraded too, but also a paging system was added. This allowed 32 bit assembly language code to look like it does today. Further more the security architecture with ring 0 – ring 3 which some of us have a love/hate relationship with today was introduced here. New debugging features were added in particular the hardware memory break point. The CRx, DRx, GDT, LDT and IDT registers was introduced with the 80836. The real irony was that this processor was so far ahead of the software that it wasn’t until Windows 95 a full 10 years later operating system that made full use of the opportunities of the 386 processor became wide spread. What certainly was a revolution in hardware, became slow evolution in software. Dos extenders that swapped from real to protected mode became standard for games (especially DOS4GW from Watcom, but Pharlab also had a market share). Short of an operating system it basically just provided a flat memory layout to C++ programs. . There was a DR dos which had true multitasking but didn’t break the 640kb boundary and thus was pretty much unusable. We saw horrible Windows 2.0 and 3.0 with terrible, terrible 16 bit protected mode execution with their NE format which can best be described as a crime against good style.

Virus and infosec in the early 1990ies

I had been infected with an interest in computer viruses in the late very early 90ies reading about the exploits of Dark Avenger and his anti-virus nemesis: Vesselin Bontchev. Both were heroes of mine. Thus as soon as I’d written my first program in assembler I started wanting to write anti-virus software – some how writing virus code always seemed immoral to me. There of cause isn’t much anti-virus code you can write when you’ve only written “Hello World” before that. That did not scare me and I quickly wrote up a number of goat programs. For those unfamiliar with goat programs, they are programs meant as scapegoats for virus to infect. Since computer virus at the time tried to avoid detection and crashes by carefully selecting which files it would infect, having a large collection of known executables on your computer would increase the attack surface for any virus and of cause allow for detection too. On the assembly side, it was pretty much programs with different junk instructions to fill up the executables – nothing fancy. With time I moved on to writing encryptors and matching decrypters for executables. That was pretty much the same code as you’d use in a virus but without the moral implications. The first target was the .com file format of DOS. This was the easiest executable format imaginable. It was simple the binary code and data. It was loaded into a free segment at xxxx:0100, where xxxx denotes the free segment the operating system loaded the.com at and execution would start from this address too. The memory below 100h was used for command line and other start up parameters. This made .com a very easy format to encrypt. Basically I made a jmp instruction instead of the first two bytes to the end of the file, where a decrypter was appended. However with a file format this simple removing a virus or an encryptor was trivial and thus anti-debugging had to be made. Hooking interrupts 1 and 3 could throw off most debuggers. Running checksums on your self would detect if somebody was stepping/gotoing using interrupt 3 inserted into the code and using the pushf instruction to read the trace flag was popular. Finding back-door interfaces for the most common debuggers was a common enterprise for me and writing up detection code for them using this new found information. On the other hand patching backdoor interfaces in your own debugger allowing you to decrypt the competition was another essential trick. A real game changer was a debugger called TRW. It emulated each instruction instead of executing them and was thus independent of all these games we were playing. As a response I started checking instructions for emulation errors – lot’s of work, but real fun. In many ways a sign of what was to come in this area. After I had my fun with the errors, I reported them back to the author, who by the way was a really nice guy.

With the arrival of Windows new territory was everywhere and I did lot’s of work on the windows internals in the beginning particularly the PE executable format. Despite all things being new and all – the first (public) generic type PE decryptor (ProcDump32 from 1998) used the same principle as had been used to unpack .com files generically in the DOS days. It set the trace flag and single stepped everything until the original entry point and then dumped the memory. The anti-anti-debugging was inspired heavily from TRW: A driver would emulate instructions that could be used to endanger the safe tracing such as pushfd/rdtsc instead of actually executing the instructions. ProcDump32 could dump running programs too –except those wouldn’t actually run afterwards because the data wasn’t correctly restored. It was however enough to do reverse engineering on the code in many cases.

Towards the end of the first half of my career as an infosec hobbyist, I was introduced to what had been more or less the birth of modern hacking: The stack overflow hack. My interest at the time was Windows internals, but in IRC EFFnet you could not avoid being confronted with “0-dayz”. Being used to assembler coding I quickly understood the mechanics of the attack. It was occasionally useful for privilege elevation attacks on Windows NT. For windows 98 internals freaks like myself had no use for it – with the front door left open, why look for a window to break? I didn't consider priviledge elevation to be hacking at the time. Besides finding buffer overflows were too easy – in the very early days of it looking for strcpy() routines. It was quickly followed by attacks on sprintf type formatting stings. Usually if you could find one of those where the input string came from “user” input, you would have a 50% chance of having a 0 day in your hand. You could report bugs like you wanted to and it wouldn’t change a thing. My experience was that if you reported a privilege elevation bug to MS engineers you’d be ignored. If you reported a BSOD that had to be provoked to appear, you’d be thanked. If you reported a BSOD that could happen by accident it would be fixed. It would take forever though till it was distributed though. It was all reliability, no body cared about security in between 1996 and late 2000 where I put my infosec carrier to rest.

My last act before retiring as a malware hobbyist, was writing a blog post on the favorite malware of the press in the day – Back Orifice. I’d figured out how to detect Back Orifice in specific and most other remote control malwares of the day generically. Nobody cared. I tried for a while to get a job with anti-virus companies, but short of a nice email from Mikko Hyponen, nothing ever came of it. So I got a job which quickly killed my interest in spending time behind a computer in my spare time– but my life with x86 was not yet over .

Optimization

Assembler has always played a major role in optimization. As I started out reading about optimization it was all what we see in compilers these days. E.g. Use shl for multiplication by 2 or 4. The first kind of optimization I actually did was pre-computed tables. At the time it didn’t take much calculation for pre-computations to be meaningful. Part of this was because that the processors where much slower relative to RAM that they are these days. Also accessing hardware directly instead of through bios/OS channels worked wonders. Say the demo above would not have worked without directly accessing the graphics memory. In the mid/late 90ies a friend of mine was tasked with optimizing an interrupt routine for an industrial barcode scanner which would be the heart of a large sorting machine for logistics. I joined the effort for the fun of it. First we spend a lot of time translating from pascal to pure assembler and then more time was spend replacing multiplications with combinations of “shl”,”lea” and “add” instructions. Then we spend hours on avoiding push and pop and keeping everything we needed in registers and eventually got it fast enough for the requirements.

In 2000 I was working with optimizing an MPEG2 decoder. By now compilers had made the above optimizations internal and it was rarely worth the effort to see if you could do it better. The optimization game had moved yet again. The name of the game was now MMX – I realize that SSE3 was available at this time – but not in the low end computers and MMX optimization would’ve do the trick at both ends of the scale.

By 2002 it was an MPEG2 transrater I was optimizing on. The idea was copying a DVD9 to a DVD5 without re-encoding. I had a beta version of a Pentium 4 and Intel really wanted me to work on using hyper threading. But threading is difficult and it brought less than optimizing memory access and again real life market perspective: Nobody but a few select people had a hyper threading computer and there was little chance that they’d be wide spread within a year. Especially using the non-temporal instructions of SSE32/3 and aligned memory access made a huge impact. Another thing that made a big impact on my optimization work was that performance counters, had become really useful.

By 2006 I was back to optimizing video – this time h.264 codec for massive amounts of video. This time around optimizations now moved to threading – with 8 cores in a computer and the prospect of using multiple computers, splitting tasks in a meaningful way gave much more performance gain than the 50% you could get from hand optimizing big spenders. To be fair I did write up SSE3 routines for the biggest spenders. But assembly language optimization had finally taken the back seat.

Friday, October 30, 2015

Speculation: What the next row hammer style exploit could look like

One advantage of being a blogger is that I can speculate wildly and I’ll do this in this blog.

A few years ago I was part of a team that developed video codecs. H.264 and mpeg2 essentially. To test these we had a student aid and around 15 computers. The student aid checked that test where running 24/7 and that the results wasn’t garbage. In this way the poor PC’s ragged up months worth of CPU time. Now occasionally we had to change components on these computers – often RAM. Resonantly I had written what was essentially a file copy routine and for some reason it ended up being tested on one of these old computers. The result was lots of binary compare errors – always a single bit in a random byte. I spend hours trying to find the bug in my code, to no avail. The bug it turned out was that the DRAM modules where broken and produced enough bit errors that copying a 8gig set of files would produce compare errors, but not enough that the system seemed to run unstably. It had been less than a month after the laptop we were supposed to do my black hat presentation on died while row hammer testing. Probably the RAM went dead. Now video encoding and row hammer has something in common. They produce lots of cache misses. Row hammer by design – it doesn’t work without them- and video codecs simply because motion search and motion compensation routines plows through lots and lots of memory and back then there was no 3rd level cache. And finally came an article about causing ageing in a Sparc processor: KARIMI et al(2015):.

Now the speculation of mine is now that memory circuits age. Which seems to be true.See: Schroeder, Pinhiero & Weber(xxxx). Further we know that memory circuits that are used are more prone to error than idle ones. Not really surprising. What we don’t know is if using memory causes it to age. For static memory like a usb-stick it’s relatively easy to produce. But for DRAM I found nothing. I speculate that our video codec testing caused ageing in memory and that row hammer testing possibly did the same on our presentation laptop for black hat. The point is normally the cache will catch the majority of memory access, which isn’t much on most computers anyway since they spend most of their life idling. However as we know from row hammer we can force memory access in an attempt to age the memory. A rough guess is that most addresses in a random computer see only one access per 6 milliseconds or so. With malicious code we can do an access around every 250 nano seconds. In case you wonder it’s factor 24000 – and DRAM age in normal computers too. If this works we’re basically doing hardware destruction – nothing new under the sky here. Virus tried to destroy hard drives and floppy drives in the 90ies.

The new thing is row hammer. Seaborn and Dullien (2015) forcefully illustrated that what is normally a reliability issue can become a security issue very fast. The way that worked was simply to spray the memory with PTE’s, hammer away and wait until read write access was given to a page owned by the attacker. Now my file copying incidence produced random bit flips during normal execution – I was definitely not hammering anything. So bit flips without hammering could be possible in aged chips.

If we are so unfortunate that I’m right about the DRAM ageing then there is a good chance that it’s exploitable. Read /Write memory patterns for a couple of months non-stop, then when errors start to occur spray the memory with PTE’s and wait for the local privilege escalation take place.

Granted this is a pretty rare scenario and it’s real world importance probably shouldn’t be over estimated. But it is a pretty nifty idea. I share it, instead of developing it because I do not have the resources (time and hardware) to pursue this at all. And before you accuse me of selling the skin before the bear has been shot: There is so much speculation in here. Last if you have some old hardware you can mistreat and you are willing to put up with the electricity cost to test this. Please get in contact and I’ll write up a small test program in a jiffy – remember even if my speculation is wrong these kinds of experiments can shed some light on ageing in DRAM – and interesting endeavor all together. Thus even if I’m wrong we might learn something.

Litterature:

KARIMI et al(2015): MAGIC: Malicious Aging in Circuits/Cores: https://drive.google.com/file/d/0B9i8WqXLW451MTIyM2lqR1lpZ3M/view?pli=1

Seaborn & Dullien (2015): Exploiting the DRAM rowhammer bug to gain kernel privileges.

http://googleprojectzero.blogspot.de/2015/03/exploiting-dram-rowhammer-bug-to-gain.html

Schroeder, Pinhiero & Weber(xxxx): DRAM Errors in the Wild: A Large-Scale Field Study: https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

Sunday, September 27, 2015

Cache side channel attacks

Errata 5.11.2015:

As I expected a few errors and less than accurate formulations got into this post. There is two things I’d specifically wish to address here.

My comments on using a non-inclusive cache hierarchy to defeat cache attacks are misleading at best. Obviously since the L1 and L2 caches are on a per core basis. It’s the L3 shared nature combined with the non-inclusive cache hierarchy that allows us to flush these caches from another core in the first place. However these caches are relatively small so that with only a “small” amount of memory activity by the victim or other code running on the same core (this includes the attacker if she can get herself executing on that core) rare events can still be effectively spied upon.

On using performance counters to detect cache side channel attacks I focused only on low frequency attacks in this text. Mr. Herath and my slides from Black hat does not. The idea is that high frequency cache attacks is relatively easily identified as such from performance counters (either through the simple amount of cache misses or say a ratio of cache misses to cache hits). Alternatively machine learning methods could be applied to such performance counter data. The low frequency data is much more difficult to keep apart from noise generate by normal usage of memory in a busy system. Say the attacker is looking for keyboard events. For this he is polling every .5 second and have say 15 good addresses he polls. That gives us 15 cache misses in .5 second periods (possibly even spread out over the .5 seconds) which isn’t much if a video encoder is causing thousands of misses also going up and down periodically as the video encoder switches between motion search (large number of cache misses and lots of hits) and entropy encoding (few hits and misses). This is where the approach I describe here can filter the noise and force the attacker into territory where we can with confidence detect him.

Prelude

No tech stuff in this part here. Skip to introduction to read the actual blog post. If you care about my ramblings do continue.

It’s been 8 months awesome months since I started my comeback into infosec with this blog. It’s time to do status. A few times I’ve felt misunderstood. The most common misunderstanding is that I have a different concept of research than many in the infosec community. This blog isn’t research. It is play. I do it for fun and games and that’s it. There is no intend on my side to do research, because that requires me to be systematic about it and that again is outside of my time budget. I hope from time to time to present a new idea here, but that’s the loftiest I wish to get with this blog. I do hope that the ambitious infosec newbee might be inspired to look into some of the things I write about and I also hope that the infosec professionals occasionally enjoys a lunch break with my musings.

The most read post so far was the nostalgia post. Followed by anything row hammer, followed by anything with machine learning in the title. Ironically the post I like the best on machine learning doesn’t have those words in the title and isn’t being read much. The post nobody cares about is about H.a.r.e.s. which I think is a shame though I admit is not the most relevant, but H.a.r.e.s. sounds like such a cool idea. The LangSec post was the biggest surprise to me: It’s quite well read.

As a personal note – I find the best infosec blogs is by far http://lackingrhoticity.blogspot.com/ by Mark Seaborn and http://www.malwaretech.com/ by MalwareTech. Mark’s blog is really inspiring on a technical level and MalwareTech’s joy of finding things out is contagious to me.

For now I’ll go offline for a couple of weeks and hope to spend a lot of time with a print out of an article about using performance counters to reverse engineer caches.

Introduction

This blog post benefitted from conversations with some really smart people . I’d like to extend my thanks to them. Nevertheless this blog is probably full of errors and they are mine alone. I’ve not released source codes because the topic is mostly offensive and the source codes would be more useful for offensive purposes than defensive. If you for some reason think my sources could be useful, get in contact (I’ll be offline the next 3 weeks, but I will get back to you). FInally I'd like to apologize that this post got rather rushed at the end.Infact I'm writting this with a plane to catch. Particular damage was done to the different timers section where I really would've liked to add more information.

Cache side channel attacks have been around since 1996 or so perhaps even longer. Recently they had a renaissance in the literature because of the (re-?) emergence of a 3rd level cache . (The 3^rd level cache is identical to the last level cache on systems that have 3 cache levels - this is worth to note when reading the literature). The 3^rd level cache on modern intel processors is shared between cores and this effectively allows a program at any privilege level to spy on programs running on other cores simultaneously without being concerned about sandboxes, privilege level and to some degree even through hypervisors. The sheer size of the new 3^rd level cache also allows for spying being more accurate than with previous caches. The usage of cache side channel attacks on the L3 cache has been show to be able to successfully allow you to track mouse cursors from a java script on a webpage, grab keyboard input from other users on the same system and grab cryptographic keys on co-located virtual computers in the cloud. It is thus not entirely an academic pursuit.

There are currently three different kinds of cache side channel attacks. I shall write about them all to some extend here so I’ll introduce them in short before I go into a slightly more detailed description of how the cache works.

1. Evict + time

The attacker measures the time it takes to execute a piece of victim code. Then attacker flushes part of the cache, executes and times the victim code again. The difference in timing tells something about whether the victim uses that part of the cache.

2. Prime + probe

The attacker now accesses memory to fill part of the cache with his own memory and waits for the victim code to execute. (Prime) Then the attacker measures the time it takes to access the memory that he would carefully placed in cache before. If it’s slow it is because the victim needed the cache and this gives us knowledge about what victim did. (Probe)

3. Flush + reload

The flush and reload attack utilizes that processes often share memory. By flushing a shared address, then wait for the victim and finally measuring the time it takes to access the address an attacker can tell if the victim placed the address in question in the cache by accessing it.

A short introduction into level 3 cache

So let’s spend a bit of time considering how the level 3 cache is build. I’ll use details from the Sandy Bridge as an example, but it’s mostly similar on different intel architecture. The details however can be important in implementing an attack and defense. I use the sandy bridge for 3 reasons: I only have access to a sandy bridge (details and sizes are based on my wife’s computer), it’s the best researched in the literature and it’s likely to be the simplest.

At some point in the past CPU’s became much faster than memory causing CPU’s to be forced to stall when waiting for memory. Because memory accesses typically concentrate on a relatively small amount of memory at any given time adding a cache of super fast (and expensive – in terms of CPU real estate and money) memory directly in the CPU gave a large performance gain for only a small price. Thus the first level cache was added. The later as CPU’s turned faster and software more memory hungry a slightly less expensive a second level cache was added – real estate prices in the CPU drops with the distance to where magic happens. And finally a 3^rd level was added, which is a lot slower that L1 and L2, but still much faster than memory and comparatively huge.

L1 and L2 cache operations are slightly more advanced than L3 due to the fact that they are per core, so a messaging system exists to keep L1 and L2 coherent between cores. I’ll not discuss that here since my focus is the L3 cache, but suffice to say they are otherwise similarly build. There is however one design feature by intel which is important: The cache hierarchy is inclusive. This means that if memory is loaded in L1 it’s also loaded in L2 and L3. If it’s loaded in L2 it’s also loaded in L3. This is intel specific – AMD K7 for instance also has a cache hierachy but it’s not inclusive. For an attacker an inclusive cache makes life much easier. If he can flush memory from L3, he will automatically flush it in all caches.

When a CPU vendor is building a cache they need to track what part of the memory is in the cache and what’s not. This gives rise to a trade off between how much memory you spend for managing the cache, how flexible the cache is and how fast you can do your look up. The level 3 cache on Sandy bridge balances these trade offs in the following way. Sandy Bridge’s L3 cache and many other CPU's is a so called N-Way set associative cache. This means first the memory is divided into a small number of blocks and each blocks can be mapped to a particular set in the cache. Each set can store any memory location within the block in N different places in the set, N being a small integer. For Sandy bridge L3 has either 12 or 16 depending on the exact model. It is implemented this way as a compromise between the two extremes: The first is called direct mapping meaning that each memory location has only 1 location where it can be stored in cache making for quick look up, but fierce competition of different parts of memory to be stored in this magic location - it is the N=1 scenario. The second extreme is called fully associative cache and means that any memory location can be cached in anywhere in the cache, the N=cache size scenario (only 1 set). While this makes for a very flexible cache, it also requires searching the entire cache for a match. A good way to think about it is that sets are indexed and ways are searched when looking for a cache hit. Also to further reduce the problem the cache isn't working on single bytes, but on 64 byte chunks called cache lines. Finally Sandy Bridge and many of subsequent CPU's have many cores and each core has a part of the cache located next to it, called a slice. The slices are connected to a ring buffer (intel calls it QuickPath Interconnect) so that any core can access the slice of another core, but only at a penalty. This is a compromise for fast lookup on cache belonging to the core, and access to the entire cache for all cores. Graphically this looks something like this:

Summary example for my wifes Sandy Bridge: Any given byte in memory can be cached as part of a 64 byte cache line. This cache line is cached in exactly one slice which belongs to a core. In this slice it can be an occupant of exactly one cache set out of 2048. Within the cache set it can occupy any of 12 ways. 2 (slices/ cores) * 2048 (sets) * 12 (ways) * 64 (bytes per cache line) = 3 Mb cache total.

Let's take a look at it differently. A request for data is send from the core 1 down towards the memory. We start where one request has arrived at the L3 cache:

1. First part of the address is used to index the correct slice. If we are lucky it's likely to be on the current core's slice otherwise we pay a small penalty for using the ring buffer to access the slice of core 2.

2. Other bits of the address is now used to index the set to which the address belongs.

3. Finally the N-ways is searched for our address. If it's not found the cache returns a miss and the request will be forwarded to memory.

4. If the requested memory is found in the cache and the cache returns the relevant bytes from the cache line using the lowest 6 bits of the address to index into the cache line.

For the details of how the bits of the physical address of the request maps to the cache see Seaborn(2015).

Eviction

The cache is filled either through memory reads or through the prefetcher. When you read memory very often you’ll shortly thereafter either read or write the same memory or memory in the same cache line. Thus reading memory usually causes the memory you just read to be stored in the appropriate cache set. The prefetcher looks for patterns in your memory access and tries to move cache lines from memory to the cache before you need them. Either way a set is going to be full at some point and to keep the system up to date memory needs to be evicted from the cache. The method by which is determined what cache lines are no longer needed in the cache is called the eviction policy and the act of dropping a cache line is called to evict – or sometimes flush. The first algorithm to come to mind is to drop the least recently used cache line to make space. This is not surprisingly called a LRU eviction policy. The problem with such a policy is that you need to keep a track of the order that cache lines where accessed within the cache and to evict the least recently used. This again requires you to use n*log2(n) bits (where n is the number of ways) and search through the bits to figure out which one to evict. This can be done in different ways, but it’s always computationally intensive. With bits and time being expensive in the CPU, Sandy Bridge opted for a so called pLRU or “Pseudo least recently used” eviction policy. Instead of accurately keeping track a simple binary tree is represent with a suitable number of bits made to represent the cache set so that each leave is a way in the set. On access the bits are flipped in nodes of the tree that where used and for eviction the opposite path is followed to figure out which way is to be evicted. pLRU is not only faster algorithmically since only the nodes needs update, but also require less bits. The downside is that it only approximates LRU. To read more about eviction policies you may like to read this Jimenez (201x). More modern CPU’s than the Sandy Bridge has more complex (adaptive) eviction policies. We can now figure out that if we wish to remove other programs memory from the cache we can do so by systematically load our own memory from each cache set. Theoretically for Sandy Bridge we need only N addresses (12 or 16) which belong to a cache set of interest and load each of those twice and the cache will be filled with those addresses. Real life experiments suggest that a little more than twice provides better results. The interesting thing is that it’s quite a doable proposition. In fact it’s so doable that even without knowing the actually addresses we can figure them out by using timing and thus do it from java script without pointers. Oren et al (2015) does this. A detailed discussion of optimally filling cache sets is giving in Gruss &Maurice(2015)

Cache side channel attacks

Now that we have a partial overview of how the cache works we can turn our attention on how the 3 before mentioned cache attacks works the cache to gain information.

Evict + time

In the beginning of this blog I wrote: “The attacker measures the time it takes to execute a piece of victim code. Then attacker evicts part of the cache, executes and times the victim code again. The difference in timing tells something about whether the victim uses that part of the cache.” We can now fill in the blanks. If we know which cache sets the victim is likely to use we setup the cache as to give us an idea about what memory the victim used. If we run the same victim once we know it’s likely to have cached part of the memory it used. Now we can time how long it takes to run with a high resolution timer to give us a baseline. Now we load our own memory systematically so that a number (in which we are interested in terms of the victim) of cache sets becomes full with our data –now we can assume that anything the victim placed in the cache has been evicted. Then we let the victim run again and compare how long it took to our new baseline. We can repeat this process for different cache sets until we’ve figured out which cache set he used. If branches are sufficiently large (larger than the cache line size) we can say which parts of the code the victim uses. If memory structures are bigger or displaced across cache lines we can say which memory structures he used. This information can be used to draw inference on information private to the victim. Typical examples are attacking AES that is using tables for encryption with the key being private to the victim or KASRL where actual addresses are hidden from view. We should notice here that other tasks using the cache will add noise to our measurements so repeating the attack again and again will improve accuracy as on multi core system there is always noise on anything to do with memory. Also the prefetcher is an enemy to be considered as it might load memory in anticipation of being used by the victim without the victim actually using it and thus giving us false information. In practice these things makes attacks less reliable, but by no mean infeasible. The real down side to this attack is that we need to be able to cause execution of victim code which might be difficult in the first place. The positive side of this attack is that we can apply it in java script or native code since we need not know any addresses from the victim to carry out the attack – the actual cache sets used is all the information that is required.

Prime + probe

The Prime + probe attack looses the restriction of evict + time that the attacker needs to be able to start execution. To make this happen we again make sure that interesting cache sets are filled with memory we control. Once this is the case we’ll wait a predefined amount of time to see if a victim uses memory in the region covered by the cache set. If he does he’ll cause one of our addresses to be evicted. This is the prime phase of the attack. The probe phase is now to access each address we originally filled the cache set with. If this is slow, the victim has used memory with in this cache set as well. Probing fast enough we’ll get a series of observations over time of what cache sets are being used on the computer. This allows us to draw lots of inference on what the victim is doing at any time. Typically to map cache set activity the attacker runs lots of attacks on a system under his control to generate a baseline data set. He then uses a machine learning method to map the time series he got from the victim to figure out what the profile means the victim is actually doing. Prime+probe is in my opinion the most powerful of the cache side channel attacks. Again we don’t need actual addresses so we can make it work in java script. We don’t need to actually cause execution of the victim code so just need to loiter around – which is one less prerequisite for the attacker. Thus this attack vector is suitable for almost any situation where we get access to running our own code on the same computer as our victim. Hypervisors, privilege levels, sandboxes and other barriers are irrelevant because we all meet up in the cache. The down side to this attack is that the data we get back tends to be very noisy. We only get which cache sets where use (and partly how many cache lines where loaded), but we have no initial information about which core, thread or even which code cause the access. So to successfully we need to repeat the process a lot and we can spy only on events that are relatively common. Oren et al(2015) covers this kind of attack in more details.

Flush + reload

Flush+reload over comes evict+time requirement of causing execution as well as prime +probe noisy data. It does this by assuming that victim and attacker are using the same memory. This proposition at first seems weird since even in unsafe operating system such as windows 95 each process is given it’s own memory space. But it’s not as weird as it may seem. This is because of a memory management feature by most operating system called page dedublication. If two identical memory pages are loaded in separate processes, operating systems can save physical memory (and more more efficient use of the cache) by mapping the same page in each process. Most operating system does this – even across different users and privilege levels. Typical example would be the code of kernel32.dll in windows. It’s shared by processes but most of the DLL (especially read-only-portions) such as code is only mapped in physical memory once. Should a process decided to write to such memory most operating systems incorporate a feature called copy-on-write, essentially the OS sees to it that an exception is thrown on write access and then makes a new copy of the page for the offending process only before retrying the write operation. Page dedublication is rampant in all modern operating system and even some hypervisors are capable of doing it. The attacker now uses this information by flushing a specific address shared by attacker and victim. Because he needs to flush a specific address this attack seem impractical in java. In native code it’s quite easy to flush an address using the clflush instruction, but there is also no reason why he shouldn’t be able to do it using the normal eviction policy of the cache. Once the address is flushed the attacker waits an interval and then times the access time to the address. If the victim used the address it’s likely to be in cache and thus the access will be fast. Since we know exactly what address we are talking about we have very little noise. Flush+reload can be used as a tailored attack or by comparing what happens on a machine under the attackers control with what happens on a victim computer using machine learning methods. The latter called template cache attacks. The best article on flush+reload attacks is probably: Gruss, Spreitzer, Mangard(2015)

Current counter measures to cache side channel attacks

My general opinion on security problems such as CSC’s is that we should fix the root cause and in this case the root cause is the micro architecture of the cache. The core problem however is the cache being shared across privilege levels and that might be a significant hurdle to climb for CPU designers. Also there is already plenty of existing hardware in use so even if intel where to fix the root problem we’d be stuck with these issues for a considerable amount of time unless we look at other options.

1. Less accurate timer

Most cache side channel attacks use the timing of a memory access to determine if it’s cached or not cached. This usually works through the RDTSC/RDTSP instructions to read the current time stamp counter of the processor or an operating system/library wrapper thereof. For javascript this can be implemented directly in the wrapper and has so far this has been suggested and already implemented in response to Oren et al.(2015) Obviously if the time is sufficiently coarse we cannot tell a cache hit from a cache miss apart. For user mode attacks RDTSC(P) can be disabled in user mode through a bit in CR4, which will cause it to make an access violation. This again can be used to emulated the instruction with any coarseness we wish, leaving compatibility with most existing software. Now this approach is never going to be perfect. The reason is, that if we start a timer at a random point in time and then carry out a flush or a probe the probability of seeing an increment reading when reading the timer again is increasing in the time the flush or probe took. Thus if we can sample enough events we’ll be able to draw inference what is happening – that is spy. Say keyboard presses are not such an event type as they’ll be more or less unique events – we can’t just ask the user to type his password 10000 times. Where as blitting a mouse cursor touches multiple cache lines at once and in neighboring positions as it’s moved we could probably get a good idea about where it is even with a middle coarse timer. One approach to avoid this for attackers would be using performance counters to count LLC cache misses instead of relying on timing. levels allows for it. Also on virtual machines performance counters are often not implemented. OS and virtualization implementers however should be aware that performance counters provide a powerful tool for reading into side channels even when you only get your own performance data because of sharing of resources in the system. I’ll return to timing later.

2. Non-inclusive cache hierarchy

Obviously this does nothing against code using the CLFLUSH instruction, but people trying to evict the cache through filling a cache set now need to bother about flushing all levels in the hierarchy separately. Obviously this makes it more difficult but I don’t think it’s a deal breaker – since we can relatively easy evict a cache set in a 3MB cache, it seems like evicting one in a 32 kbyte cache should be a feasible task. I’ve not read about cache attacks on AMD k7, but I’m fairly pessimistic that despite the non-inclusive cache hierarchy that CSC’s are possible on this processor too.

3. Fully associative caches though they may not hold they same performance, they do make it very difficult to sufficiently fast evicting the cache without the use of CLFlush instructions as we’d essentially have an extreme amounts of ways in our eviction set. Again this would only be a solution on new hardware and to really close off the gap the Clflush instruction should be made priviledged. There are other reasons – row hammer – to follow such a path.

4. Disable cache-line sharing

Flush + reload attacks can be dealt with effectively if there is no shared memory. This ranges from an operating system/ virtual machine level disabling page deduplication to actively causing copy-on-write on particularly interesting pages. On Windows 7 a simple ReadProcessMemory() followed by WriteProcessMemory() is enough to trigger a copy-on-write and thus disable Flush+reload – you don’t actually need to change memory! There is other reasons to reconsider page dedublication: The copy-on-write takes times and opens up for a side channel attack by timing write operating to pages. I believe Daniel Gruss wrote a paper on the subject, but I could be mistaken.

5. Accessing and flushing memory “randomly” by a victim or operating system obviously adds lots of noise to the process. I love the irony of using cache side channel attacks to mitigate/detect cache side channel attacks – but despite this irony there is obviously a heavy performance penalty associated with causing cache misses to protect against attacks and in scenarios where the attacker is making many repeated attempts this is likely to be too big.

6. The intel CPU tries to load stuff into the cache before it’s used prefetching. It’s especially useful for code and reads which are predictable like memcpy(). Obviously the prefetcher continually causes cache hits even when nobody already accessed the memory thus making more difficult to get good data on memory activity for spying purposes. However I find it unlikely that we’ll see a prefetcher that’ll be sufficiently good to do land a very strong blow on the possibilities of an attacker.

7. Also it’s been suggested to modify software so that each user has a version with a unique memory layout and access path. Imagine a powerful mutation engine at work. This has been suggested as a remedy against code reuse attacks in the past and is likely to make CSC quite a bit more difficult. It is also quite feasible as such a mutation engine was used in a successful virus by Z0mbie back in 2005. The downside is of cause that mutation engines of any kind posses a Q&A problem to software developers. Certainly this method has not yet gained traction in fighting against code reuse attacks so the probability of it gaining tracking for fighting CSC’s seem slim.

8. Using performance counters is also an option for detection. I suggested in mine and Nishad Herath’s black hat slides to instrumentalize RDTSC(P) or other high resolution timers to first make them less accurate and second to start counting cache misses using the intel performance counters. This forces a spy into territory where he has to look at more cache misses to get reliable information. This combined with the presence of a high resolution timer makes a very good heuristic for cache side channel attacks. Especially since one potential response would be to make the timer even less accurate – thus only mildly impacting the system as a whole in the event of a false positive. In my experiments false positives have been rare though low-frequency spying sometimes ends with false negatives. I imagine that continued spying as well as tuning with the coarseness of the timer could make it a viable solution. However let’s take a look at timing from an attacker perspective. Even if the attacker manages without a high resolution timer, it might be well worth looking into performance counters to detect cache side channel attacks – because any high-resolution-timer is likely to leave uncommon artifacts in the performance counters.

Timer issues

From the above it looks like the attacking the timer is the most promising defensive technique as it deals with all three methods of cache side channel attacks. It’s feasible, relatively easy to implement, at least if we don’t consider Patch Guard and though it might not give 100% piece of mind it at least goes a long way in the right direction. However there might be other ways to get a high resolution timer on the x86. I’ll suggest two such methods here, others may exist:

1. Delta timer. Using a low-res timer, then wait for the moment when it increments and then do your measurement and try with register only instructions to fill the gap so that a cache miss will exactly cause the timer to increment and a cache hit will not. I call this method of timing a delta timer. Probably somebody in the literature has assigned another name and if so please contact me and tell me. On old single core CPU’s register only instructions had deterministic timing. This is not the case on multi core CPU’s and this makes it a bit less reliable. The figure below shows a graph of 30 observations from my Sandy bridge trying to when trying to time out 8000 nano seconds. This suggest that if we need to use a delta timer to get 95% correct we need at least 10 cache misses to be sure there was cache misses at all. This brings it into a range where the performance counter detection method would work well. However If we accept an error rate of 1/6 the delta timer might do the job with while being difficult to detect using performance counters. Now remember that this is not a real scenario where we have a lot more noise already. But it’s unfortunately not infeasible that delta timer would work in many real world scenarios – especially for flush+reload since flush+reload is a lot less susceptible to noise anyway than the other methods. But who says the coarseness of the timer should be constant – with a random value (random drift?)the delta timer would be difficult to use – that again is a different scenario than what has been proposed for java scripts.

2. Thread timer. I’ve dubbed this kind of time “Thread timer” without knowing if there is any literature on that subject. The idea is to have a thread running and by a signal through a global variable start counting. When the global variable is reset it stops counting. The count then serves as a high resolution timer. There is problems in implementing such a timer. Mainly accesses to the global variable is not automatically perfectly synchronized leaving a non deterministic (and significant) time gap from the timer is set to start to it actually starts – and for stopping the same problem exists. Now intel does provide instructions to deal with this – mfence and the lock prefix – however I’ve not investigated those as I figure they’d be difficult to generate in java (I find the java attack vector the most threatening). For native code they are likely to bring around improvement, but not determinism. The first thing I noticed while experimenting was that I got absolutely random results until I started assigning the thread to a specific CPU (core or hyper thread). The behavior is very different on hyper threads and real cores. The unit for the thread timer is the number of times ecx was incremented during 1 uncached read and for 10 Cache misses. The curve for hits look similar. You may notice that it’s shorter than the delta timer. This is because I’ve deleted a handful of observation with the value 0 – that is where the thread timer didn’t start at all! When timing for only one cache miss I get only a few observations with counts. This hints at that we need at least a handful of cache misses to be able to tell misses from hits with a thread timer. Obviously this is not the last words on thread timers, but it hints at that they are fairly reliable on hyper threads and a little less so real cores and that thread timers are unsuitable for spying on events that gives only a few cache misses as feed back. The thread timer code I used is listed below. In short my suggested use of CR4 to reduce resolution does pose a problem to an attack looking for rare events with only a few cache misses – say key presses.

while (g_Start == 0);

g_Time = 0;

__asm

{

WaitLooo:

//mfence

cmp g_Start, 0

jz WaitLooo

xor ecx, ecx

again:

inc ecx

cmp[g_Start], 0

jnz again

mov[g_Time], ecx

}

g_Start = 0;

Maybe I drink too much

In private conversations I’ve suggested in the past to use cache side channel attacks to detect rootkits including SMM root kits. My suggestion goes like this:If we have a root kit type that tries to read usermode private keys in memory. The possibility exists that we can trigger this behavior and use it for an evict + time attack. Imagine an SMM with a hook on SomeOpenPrivateKeyFunction() which reads the input irrespectively of validity and that our input exceeds a cache line but that SomeOpenPrivateKeyFunction ()won’t read past a syntax error in the key. The scenario sounds farfetched but maybe not too farfetched. Then setup a private key the size of two cache lines with a syntax error in the first cache line. Flush the second cache line and then call SomeOpenPrivateKeyFunction(). Now time accessing the second cache line of the fake private key. If SMM touched it chances are it used the cache and we’ll be able to see the time difference. Obviously there are plenty of things a root kit could do to mitigate, but at least we started a game of hide and seek.

Another interesting (mis-)use for cache side channel attacks would be detection of undesired runtime environments. Emulators and VM’s could be detected. Emulators could be detected because it’s quite difficult to emulate a cache correctly and most just plain skip it. VM’s could be detected by bootkits through the fact that the cache is likely to be full under a VM, but on a real system it should be empty on boot. I’m sure other cache based methods would be available.

Conclusion

Cache side channel attacks certainly have a high coolness factor which is why I bothered to write about it. On the surface of it spying on another VM from a sandboxed java script webpage – being able to exfiltrate cryptokey,key strokes, mouse movement etc. sounds like the perfect storm. Add to it that(in my opinion and limited knowledge) nobody found a good strategy for mitigation yet.

Certainly cache side channels poses a real danger to systems. But cache side channel attacks won’t be responsible for a buffer-overflow-type wave of new breeches. The value of figuring out private keys or reducing the entropy of KASLR is just so much lower than a traditional breech. In the face of the mitigation options already available cache side channel attacks becomes even further reduced. Attackers can only figure out private keys if they are used often, attackers cannot really spy on key strokes with reduced resolution of timer. Java script spying is made more difficult and with it’s level of noise it spyes only on repeated events and only until somebody closes the webpage. However the message should be: On sensitive systems these mitigations should be implemented – especially the lower resolution timer and dedublication measures. And that java is already implementing a lower timer resolution is a good thing, even in the presense of other timers and even though I to some extend would’ve done details in different ways.

And finally just because there currently are no good mitigations on the table certainly doesn’t mean there isn’t one and even less than perfect mitigations raises the cost/benefit ratio for an attacker. I for one would love to spend more time with performance counters. I just have to wonder why kind of quirks thread timers and delta timers have in the microops world of performance counters.

Literature

Oren et al (2015): “Spy in the Sandbox, Practical Cache Attacks in Javascript”; Yossef Oren, P.Kemerlis, Simha Sethumadhavan and Angelos D. Keromytis. arXiv: 1502.07373v2

Gruss, Spreitzer & Mangard (2015): “Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches”

Jiménez(201x) : “Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches”; http://taco.cse.tamu.edu/pdfs/15-jimenez.pdf

Seaborn (2015): “L3 cache mapping on Sandy Bridge CPUs“; Mark Seabrorn. http://lackingrhoticity.blogspot.de/2015/04/l3-cache-mapping-on-sandy-bridge-cpus.html

Gruss &Maurice(2015): “Rowhammer.js”; Daniel Gruss, Clementine Maurice. http://arxiv.org/pdf/1507.06955v1.pdf

Sunday, August 16, 2015

Speaking at Black Hat

Introduction

The final version of the slides of our talk is here

As some might have noticed I had the honor of speaking at Black Hat 2015 in Las Vegas with my old friend Nishad Herath. The topic of our talk was: “These Are Not Your Grand Daddys CPU Performance Counters”. More specifically our talk was about using the Intel performance counters (PMC’s) for defensive security purposes. Unfortunately the talk didn’t go as well as I’d wanted it to and I certainly don’t think I conveyed what I wished to. Also we did some last minute changes to the slides in the presentation. Hence this blog post with the updated slides and my personal comments to the slides.

I apologize for that format of this blog, but I really want to get done with it and move on. The majority of these comments were written down as notes and I had them on stage with me. A few augmentations has been made for this blog post and I’ve made an effort to make them readable to other people than myself. These notes where obviously was heavily inspired by discussions with Mr. Herath. So I have to share any credit with him. Never-the-less they where my notes and any errors are mine alone.

The story

The story how I got to speak at black hat is interesting in itself (I think) and it certainly did have an impact on the talk. If this doesn’t interest you skip straight to the next header. Originally it was Mr. Herath talk and I was invited to join late in the process. It wasn’t until the 25^th of June that things became official and I started really believing that it would work out. This left just 5 weeks of preparation time for me and at that point I didn’t even know that the slides where due on the 20^th of July. So the big rush started on my side to get up-to-date on details of performance counters what other people had done with Performance counters etc. It didn’t make it easier that Mr. Herath was in an entirely different time zone either. Also worth mentioning was that we’d been asked to spend a significant amount of our time on row hammer. Personally I would’ve rather spend my time elsewhere given that Mark Seaborn and Halvar Flake where already giving a talk on row hammer and they know much more about it any way. Especially I found it silly that with two talks touching on row hammer that they ended up being scheduled in the same time slot.

While we were working on slides, the big speculation was that row hammer was doable in Java Script and we wrote up slides in line with this speculation while working frantically to actually figure out if/how this could actually be done. I succeeded flushing the cache fast enough without clflush (only on Sandy Bridge) the night before Daniel Gruss and Clémentine Maurice published their (really nice) rowhammer.js paper which obviously then turned over our slides. No more speculation, row hammer in JS was fact.

We had added Cache Side Channel attacks as I’d noticed that my CSC code lighted up my row hammer detection as well. It was always meant as a stop-gap in case we’d not use up our time with ROP, row hammer and root kits and I saw it as a way to get some fresh meat on the table. Just a few days before Black Hat a W3C java script draft came to my attention. In this draft they wanted to make the timer less accurate (3 us ) in response to “Spy in the sandbox”. This rang a bell in my head and turned everything upside down in the CSC slides – from having something that only worked reasonably on a very high frequency of polling the side channel we went to something where we could actually do a fair detection on a much lower frequencies. This caused the last minute changes to this section – as you may notice they are pretty severe on the low frequency stuff. Now I don’t pretend that the CSC changes are the final word on Cache Side Channel attacks and performance counters. I think it’s enough to show that PMC’s can be relevant to Cache Side Channel attacks. If you’re interested in this subject I’m writing on another blog to elucidate my findings more accurately than the slides where ever intended to.

It is such fun to work with stuff that is still in movement, but it does throw you through some hoops when you are on a tight dead line.

The final event that really turned things upside down was that Mr. Herath laptop (which we would use for the presentation) gave up and we had to spend the morning before our talk (the only half a day of actually being physically at the same location before the talk) summoning up a new laptop and agreeing on the last minute slide changes – thanks to an anonymous benefactor that lend us a laptop to do changes while we were still in pursuit of one for the presentation. Thus I went to speak at the black hat conference without having ever seen a conference talk and not with the best of preparations either. I don’t think it was a disaster but I should’ve done a lot of things better. I learned a lot and I’ll do better next time – if there will be one. As a kind soul remarked it takes a few iterations to get right – though I agree with this, I certainly hat set the bar higher for myself. The real error I think was insufficient preparation on my part and too much detail and too much material.

My comments on the slides

Performance Counters counts different micro events in the CPU. These micro events tell a lot about the code running on the CPU and since malicious low-level code often behave differently than “normal” code the performance counters can be used as a signal to look for malicious code. As an example ROP code causes excessive “return misses” as the “ret” opcode is not used in pair with a call, but rather to progress a long a gadget chain. Unfortunately events are plentiful so there is a lot of noise in the signal too. However there are ways to harness this noise and make useful inference about the presence of malicious code from the performance counters. The talk in my opinion summarized into one statement: Performance Counters can be very useful for security purposes in a great many ways. My wishful thinking in this regard would be that these things made their way to security products, that Microsoft got their stuff together and made a decent interface for using PMC’s under Windows and Intel would consider adding security related events to the PMC’s instead of just stuff for optimizing their CPU or your code.

Slide 24

I think it’s worth noting that I did the majority of my work with PMC’s on Windows XP32. The reason behind this move was that I had a system available to me and that I avoid driver signing and patch guard issues. Most other work has been done in linux derivatives or debian both of which has a much nicer interface to PMC’s than Windows. There are of cause ways around Patch Guard, but it wouldn’t be nice to pursue them. Also there might be undocumented Hal API such as the HalpSetSystemInformation() on WinXP to hook into the PMI. For non PMI there is an api on newer systems on a per process basis for usermode. Visual Studio 12 comes with horrible tools for using that API – I’m not aware of any documentation but you can reverse engineer those tools – they are not too big. Sorry – I don’t have much information on this. Maybe I’ll look into it…no promises though.

Slide 30

This is slide is to me our key slide. Not only does it show how you can minimize the performance impact from using performance counters to look for malicious activity, further more it gives some basic ideas on how you can manage the noise problem with performance counters. The basic idea is that we can often from other analysis say something about the conditions under which malicious code will run and where it won’t. The simple example is row hammering is not very useful if the attacker is already running code in ring 0, thus we need not examine events in ring 0 to detect row hammering. Of our 4 examples of detecting/mitigating malicious activity all of the methods uses some aspects of the methodology in this slide.

Slide 35

Ret-prediction is terrible around task-switches, because the CPU’s shadow stack does not get swapped, the real stack however does. The 400k bad case scenario on slide 29 was for (for (int i=0;i<300;i++) Sleep(100); which causes lots of task switches. Thus on task switches we’d see heavy performance penalties and problems with limited depth of the LBR. kBouncer essentially get’s around the limited depth of the LBR register (which records only the last 16 branches) by instrumentalizing the code of targets of interest around API calls. This can be seen as the instrumentalization method of slide 30. Interestingly the first work around from an attackers point of view was to use the still limited depth of the LBR to bypass kBouncer. However traditional performance counters could be used with instrumentalization as well and then depth would be limited only by acceptable performance loss and memory. The implementation would be slightly more complex than kBouncer (because we’d need instrumentalization to start and stop collection), but we are well within the realm of possibility.

Slide 37

We have the essentially same problem as in slide 35 for kBouncer, however the solution is different. Georg Wicherski limits to ring 0 and get’s around the problem of task switching that way.

Slide 48

I wished to make the point that row hammer kinds of problems are likely to increase in the future as DRAM grows increasingly dense. It also means that the problem is likely to see hardware fixes in the future.

Slide 49

This slide can serve a way of sorting the following row hammer mitigation methods into categories. It should be noted that having two physical rows in the same bank is required even for single side hammering because of the row buffer. This slide gives a false impression.

Slide 52

Even though not good enough on it’s own it might be a useful supplement for other methods, including ours.

Slide 53

The performance measure is on memory access only. It’s doesn’t map well defined to system performance because the intel CPU is able to rearrange instructions while waiting for memory. This will tend to make system loss smaller. On the other hand memory speed is very often the bottle neck on performance dragging things in the other direction. I think the latter effect will dominate, but I have no evidence towards backing this belief.

The answer to the last question is a definite no, though some it might be enough for some RAM.

Slide 54

It’s my guess that power consumption will not be relevant. I found two different sources painting different pictures. But I think that it would make a difference in sleep, but since we cannot row hammer in sleep, the point is mood. While the system is awake other components of a laptop should out spend ram by a large factor. So if implemented right, it should be a non-issue.

Slide 52+55+56

All three are mitigation through faster refresh on slide 49. PARA + pTRR are more target refresh to avoid the steep penalties of refreshing the entire RAM more. The latter two methods seems to me to be the “right” solution – essentially row hammer is a micro architecture problem and it should be fixed in micro architecture too and I consider that likely to be done – see also slide 48 comments on ram growing more dense. However that requires that people get new hardware.

Slide 60

I’ve not been able to do row hammer with Non-temporal instructions without flushing cache with other means first. Seems like that using these instructions from java script is very difficult because JIT compiler don’t generate them. (There was a paper on this, if you’re are interested ask me and I’ll find it) There might be other reasons why intel should consider making CLFlush a privileged instruction: Cache side channel attacks. Addtionally there is little use for CLFlush except for optimizing cache usage with it to speed up code, which seems rather difficult. Again making CLFLush priveledge does not solve the problem, but it doesn’t exactly hurt it either.

Slide 68

Additionally to mitigation through delay we could use try to figure out the victim row and read from it to trigger a refresh (a row is automatically re-written on read

Also interesting here we use a rarely triggering interrupt (say every 1000 LLC misses to trigger the costly slowdown). It’s an example of using a rare event from slide 30 to trigger more costly analysis. (Totally irrelevant note that I didn’t know where else to put: I at one point used ret-miss events to alert me of process switching during playing around - instead of hooking my way into getting alerted on task switch)

Performance cost analysis: A normal LLC miss cost around 200 NS and an interrupt costing around 500 NS – triggering an interrupt every 1000 costs only ~2.5% performance and 1000 is a really low number for Row hammer. 100000 would probably be more appropriate.

Slide 71

Essentially the root kit detection here is detecting code executing out-of-bounds. This is outside of a known white list. It we use the probabilistic element of slide 30 to keep performance cost down while staying in ring 0 only makes it feasible to have a white list with known good code. I dislike using the word heuristic from slide 30 in this sense – if an interrupt triggers on this code, there is no doubt the code executed, however nobody says that executing the code will trigger a PMI (thus probabilistic). Finally we hint that using instrumentalization around particularly vulnerable code could increase true positive rate enough to get a chance at finding traditional exploits doing their work.

Slide 86 & Cache side channel attacks in general

I’ll just not that we’re using the instrumentalization (slide 30 again) of the fine grained timer in addition to making it less fine grained to force behavior that is less likely to occur naturally and thus make it detectable. It is in many ways a heuristic approach. I should note too that false positives are possible and you’d have to measure your response in that respect – e.g. make timer even more inaccurate, flush cache, trigger copy on write etc. What should also not be forgotten: This is intended to show that PMC’s can be relevant for Cache side channel attacks. It is not the final word as the attacker does have options to evade, and I’ll get back to that when I finish my next blog. On the good side though – there is more defensive possibilities too…. To be continued….