Tuesday, December 1, 2015

Rowbuffer side channel attacks

Update 30.11.2015

Errata regarding CAT

Yuval Yarom (University of Adelaide) was so kind to email me regarding my speculation on CAT changing the cache side channel game and it’s impact on row buffer side channels. Before I proceed with what Yuval had to say I’d like to recommend a paper Yuval resonantly co-authored called “Amplifying Side Channels Through Performance Degradation“. You can find it here: http://eprint.iacr.org/2015/1141.pdf

Yuval corrected me on some points and verified my thinking on others. Below is my summary of his comments:

1) CAT is available on some Haswell processors. E.g. Xeon 2618L v3.

2) CAT partitions on ways.

3) CAT only controls evictions. If a process can access memory CAT will not intervene with writing, reading or flushing from the L3 cache.

4) Thus CAT cannot protect against attacks that rely on clflush in particular: flush+reload and flush+flush. It could be used to protect against attacks that rely on eviction such as prime+probe.

5) With only 4 COS values it will be of limited use in many cloud use cases.

6) If you wish to defeat clflush based attacks the approach should be to avoid sharing memory.

My comments:

I did not foresee that CAT would only control evictions and not flushes. Yuval ‘s point on CAT not protecting against flush+reload and flush+flush extends itself to Row Buffer Side Attacks.

Comment on concurrency

Peter Pessl, Daniel Gruss, Clémentine Maurice and Stefan Mangard released a paper called “Reverse Engineering Intel DRAM Addressing and Exploitation” you can find it here: http://arxiv.org/abs/1511.08756. This paper hinges around the exact same idea that I present in my blog post and was released almost simultaneously. Daniel Gruss and I were discussing the flush+flush paper prior to it’s release when I mentioned my work on RBSC attacks. He explained that he and others where working in the same direction and we agreed to work concurrently on our own, but synchronize our publication towards the end of November. I received a draft of their paper hours before they published. The only relevant influence I’ve had on their work is reporting a typo and properly my concurrent work moved up their schedule. They attack the micro architecture problems I’ve been skirting around in this blog head on which is pretty cool. I would’ve loved to play with those issues if I’d had the time. Interestingly their work might be interesting for my current “pet project”.

Introduction

Cache side channel attacks has resonantly had a renaissance in the literature due to the emergence of a 3rd Level cache that is shared between cores on a system. However since the dawn of time a feature of DRAM has been available for side channel attacks very similar to CSC: The row buffer. I’ll make the case here that row buffer is usable for side channel attacks. I strongly recommend you read my “Cache Side Channel Attacks” blog post before proceeding, because there is a lot of cross over between how Cache Side Channel attacks work and how a row buffer side channel attack could work. I will not spend much time on these issues in this post and generally refer you to my older post. My actual PoC code for an attack is not public. As usual I do not make public offensive security resources.

Before anybody gets too exited the practical importance of a row buffer side channel attack is probably very limited. The reason for this is that the emergence of the 3rd Level cache has in many ways made it obsolete and the presence of the 3rd level cache makes the attack impractical because the effect of the cache will tend to be in the way. Thus the practical importance of such an attack is thus limited to corner cases or computers prior to the introduction of big caches. From a theoretical perspective, I’ll argue that my tiny effort has been worthwhile. Also my guess is that noise problems will significantly hamper it’s use. However the attack my very well scale to other platforms. This is because what I utilize here is a feature of DRAM and not a feature of the processor. This means potentially any device using DRAM could be vulnerable to this kind of attack.

Background

Current DRAM comes in modules called DIMM’s. If you buy a modern memory module for your PC, you’re buying a DIMM. If you look at the DIMM most DIMMs will have chips on both sides. Each side of the DIMM is a rank. Each rank again consists of a number of banks. The banks are typically the physical individual chips you can see. Inside a bank you’d find a two dimensional matrix of memory cells .There are 32k rows in the matrix and 16k or 512k cells per row. Each cell stores one bit and consists of a transistor for control and a capacitor which stores charge to signify bit is equal to 1 and no charge when bit is equal to 0. Thus a row stores 8kb or 64kb of data depending on the exact kind of DRAM you you have in front of you. When you read from DRAM an entire row is always read at once. Reading a row means checking if the capacitor is charged or not. Since this operation inevitably discharges the capacitor, DRAM first reads the entire row into a buffer and then returns the read out values to the bus from here, then the entire row is written back into the matrix to renew the charge. However if two reads to the same row occurs there is no need to fetch the memory from the matrix the second time and thus the row buffer works as a cache (This is why row hammering requires reading from 2 rows within a bank to be successful). Such a row buffer is either 8kb or 64kb – obviously the same size as the row. For a more accurate description I refer you to [2]Wikipedia.

How does my RBSC work

Just like a standard cache we can use this to gain information about the memory access patterns of another process. The obvious requirement is that we need to share row buffers with the victim. While technically we can share a row buffer with somebody without sharing memory and in fact it’s bound to happen to some degree since page size is smaller than the row size and the operating system allocates memory through paging. I shall not discuss this however. The reason is that to utilize this for an attack requires knowledge of the micro architecture of the processor – in particular physical address mapping. Mark Seaborn[1] has done some work on this and going further in this direction is way beyond my time schedule – no way I can research this in less than 24 hours and I don’t write blog posts that take longer than this. Also it’s likely that we need an additional information leak to really leverage RBSC in this fashion. I may get back to this in another blog post, but don’t count on it. This leaves me with spying on memory access to memory that is shared between victim and attacker. Unfortunately there is plenty of shared memory between processes – typically the operating system avoids loading shared modules twice and instead just maps a single instance of physical memory logically in all processes that use it. Even virtual machines often share memory. Thus my attack has similarities to the “flush+reload” cache side channel attack. Where flush+reload CSC’s work on cache lines and the attacker is able to determine access within an 64 byte cache line, with row buffer side channel attacks we’ll only be able to tell about a 8/64kb row buffer. The mechanics of the attack I’ve written up is

1) First read tons of memory to statistically make sure another row is in the row buffer. This could surely be done in a more efficient fashion, but again that would require knowledge of micro architecture – however if you read the original row hammer paper you’d know that my approach has very a very good chance of working correctly. See Seaborn & Dullien(2015) for more. To be specific I read 200 times from a random location in a large allocated buffer to give me a good chance of having read from another row in the bank I’m interested in. Now I theoretically could just do a normal read operation, but I think you’d have less noise if you make sure the memory you’re trying to read isn’t in the cache. Therefore I’ve added a CLFlush instruction prior to. Also since we need a baseline memory access timing I use this phase of the attack to time the memory access using the RTDSC instruction. Because modern processors are capable of executing out of order to make the most of waiting for the memory it’s important that timing operating is bracket in the mfence instruction to avoid reordering. It is important that the timing for your baseline is the same address which you wish to attack for. The reason is, If the address we use to do our baseline with is on a different core than the one we wish we to attack, querying the L3 cache has a different timing due to the ring buffer .

2) Finally I time how long it takes to read the memory I'm interested in as an attacker. A fast read and the victim used this row – that is the data was returned from the row buffer. A slow read and the victim did not use this row – the data got returned from the matrix. If a read was fast or slow is measured against the benchmark from 1. Here I also use the clflush instruction prior to make sure the memory read does not arrive from the cache as well as using the mfence instruction to protect against reordering.

Now there is lots of noise in this attack. All though there is a measurable difference between the timing of buffered and unbuffered reads, the difference is very small compared to a traditional flush+reload attack. And on modern multi core computers, you always have “noise” in execution times (especially on memory) leading to frequent misidentification Another reason is that the chance another non-victim process uses another row is much, much larger than another core causing a cache eviction (cache has 14/16 64 byte ways - a bank 32k rows of 8kb memory) or even uses the same cache line as the victim. And finally another core using the same DIMM (ram module) while you are polling will cause delays in excess of the saving from using a row buffer as cache. Another potential source of noise is the memory refresh that'll read and rewrite rowbuffers periodically. With so much noise, coarse grained attacks are unlikely to bring significant rates of success. Additionally high levels of memory activity on other cores is poison to the attack. When running under favorable conditions and probing I only get around 70% correct predictions on if the attacker used my row buffer or not.

I need to mention that my way of launching this attack can be improved upon instead of just statistically making sure that another row buffer is loaded, you could do it exactly if you had knowledge of the micro architecture. I often observe very slow reads (factor 3 or more) for other reasons than missing the row buffer. The amount of time spend however clearly indicates that we’re seeing noise and should disregard the reading both in finding the baseline and while attacking. Likely cause for these long waiting times is that we are forced to wait for another core’s request to access memory from the DIMM.

I said in the introduction that this attack vector probably isn’t that important on modern computers. The reason is that the 3rd level cache is bound to pick up on any memory operation occurring and thus the second read that is required is likely to never make it to DRAM. Now we can work around this using CLFLush or equivalent means – this is what I did as described above. But why would an attacker bypass a better side channel to use a worse and in the process leave behind all the hallmarks of a true CSC attack? Never-the-less there might be corner cases where RBSC attacks could provide a viable alternative to CSC attacks.

Mitigation

I think the solution to RBSC channel attacks has to be a (micro?) architectural one. But since we probably aren’t going to get one we may wish to consider software mitigations. Mostly the same methodology for mitigation would apply to row buffer attacks as to cache attacks. First on modern computers with a L3 cache we need to flush cache to make it work and this makes my performance counter detection a viable solution for high frequency polling. To access the row buffer instead of the cache, the attacker must cause a cache miss - and typically where we'd normally expect a hit. Another good solution would be making the timer less accurate as I’ve suggested for cache side channel attacks, which would make it virtually impossible to distinguish between row buffer reads and normal reads.

Conclusion

From a theoretical point of view RBSC is a lesson into how apparent beneficial design decisions such as using the row buffer as a cache can quickly turn into a security issue. The row buffer is required in the DRAM scheme – however it’s cache function probably isn’t (I’m no DRAM expert), but seems like a great idea performance wise. As soon as we share resources across trust boundaries there will always be a potential for side channels. It also serves as a warning that fixing cache side channel attacks in micro architecture in the processor requires that thought is given to RBSC attacks. That being said RBSC attacks seem unlike to be a real world threat – at least of the x86 architecture. However it might be quite interesting for other architectures.

Ending on a speculative note, CATs and more advanced attacks

This part was really added long after I'd finished up the above text. So consider this a last minute addition.

More advanced attacks

The first modification we can do to the attack above is to look not at what is in the row buffer, but what isn't. If we in the above attack prime the row buffer with an attacker memory location, checking to see if the victim removed it gives us information about his activities. But this isn't much information because we'll only know if the victim used the same bank or not. In essence knowing that the victim used any one of 32k 8k buffers isn't too valuable.

But we can improve on this. The details are micro architecture specific and I'll stick to my wife's I3 Sandy Bridge dual channel architecture here. My guess is that other architectures will be different, but have enough similarities for the basics to apply. Sometimes the attacker shares a row with a victim without actually sharing memory and we can utilize this to gain more information. First the operating system allocates memory to processes on a per page basis and security attributes too is on a per page basis. However a row is shared across pages and thus potentially across privileges and we can use this to spy. Typically there is parts of 4 pages in a row buffer. The first part of that number is that the row typically contains 8kb while standard (small pages) are typically 4kb. This accounts for two pages. Finally we know the first 12 bits of a virtual address is identical to the corresponding physical address. Also we know from Seaborn(2015) that bit 6 of a physical address is used to select which DIMM should be used. This is probably implemented so by intel so that memory requests can be handled in parallel by the two DIMM's. When the computer in question has two DIMM's (memory modules) each page is split across two DIMMs and thus two rows. Hence each row contains memory from 4 different pages. If the attacker is in a position to cause behavior in the victim he can map this behavior to row activity by timing and eventually use this information to spy on activity not caused by him. It is obviously a far more difficult attack to launch and I've not actually tried to.

CAT and the future of cache attacks

In Intel[3] chapter 17.15 the “cache allocation technology” (CAT) of “future” (I think this means Broadwell and beyond, but I'm not sure) Xeon processors is vaguely described. Unfortunately I don't own or have access to a "future Xeon processor" and thus anything below remains pure speculation. The CAT allows the supervisor or hypervisor to partition the cache and select which processes should have access to which partitions. It is build in to give hypervisors and supervisors a way to provide important processes with better performance by giving them better access to cache. Without partitions processes compete for cache on equal terms, CAT is supposed to tip the balance in favour of those who the OS think needs it. The documentations that I have state that only the L3 cache is currently supported. The way it works is undocumented (or I don't have/understand the documentation), but it is hinted that current implementations is partitioning over ways. Which sounds like a easy and sensible method. The current process is assigned a two bit mask in an MSR called a COS value. Based on this COS value a so called capacity bit mask is fetched from another MSR. I assume that for each bit set in this MSR corresponding ways of each cache set is accessible for the current memory access operation . The idea is that the operating system sets up the bitmask on boot (default is everything is shared) and for each process/thread/what ever sets the appropriate COS value. This all could be potentially be used to give 4 different users access to each a private L3 cache depending on how it’s actually implemented. This could be a nice flexible (leaving performance/security trade off to the OS implementation) micro-architectural solution to the cache side channel attacks on L3 caches – should it work.

I see two likely scenarios for the actual implementation of CAT. The first an simple is a hard partition of the cache. What I mean by this is that memory placed in a cache partition only available to a COS=0 process will come from physical memory if a COS=1 process should try to access it (or from COS=1's own partition of memory). Further a COS=1 process could not possible evict or flush memory from the COS=0's partition. In this scenario it's likely that all cross COS attacks on the 3rd level cache is impossible when the bit makes are set up appropriately. Obviously even though the cache is hard partition and attackers cannot either "prime" or "load" across partitions boundaries, instructions may still leak information about cache state and allow some attack possibilities.

The second scenario I call the "soft partition scenario". I find more likely this scenario more likely. In this scenario only the second part would hold true. Any memory access regardless of COS would be served from any cache partition if available. However any eviction or flush decision would only be affected only by cache events initiated with appropriate COS access. It would allow any amount of "probing" however once a cache entry has been made by the victim, the attacker could no longer flush it on his own. Thus flush or evicting by an attacker becomes a race condition instead of a sure thing. So L3 cache attacks would lose a lot of power, but not generally be impossible. I find this scenario more likely for two reasons. The first is that in this scenario no new cache coherency between partitions would have to be enforced. Secondly it allows synergies between COS 's to be leveraged for better performance.

It is however easy to imagine other scenarios on how CAT could be implemented and implementation details matters very much. There is work to be done here.

It might also be valuable to look into chapter 17.14 ( cache monitoring) for CSC mitigation detection techniques in the future.

Certainly CAT will have a profound impact on RBSC too. The impact pushes RBSC's in two directions first the problem with by passing the cache for another COS becomes impossible in the first scenario and becomes a race condition in the second. However CAT does not appear in any implementation to make RBSC unfeasible and this is likely to make it a more viable options for attackers.

Acknowledgement

In this place I must acknowledge Daniel Gruss who motivated me to keep on working on this blog post, when it seemed like it'd be too much effort to finish. Also my thanks should go out to Jacob Soo whose comments improved the text. All errors are mine.

Litterature

[1]Seaborn(2015):” How physical addresses map to rows and banks in DRAM”;

[2]:Wikipedia(XXXX): Dynamic random-access memory;https://en.wikipedia.org/wiki/Dynamic_random-access_memory#Operations_to_read_a_data_bit_from_a_DRAM_storage_cell

http://lackingrhoticity.blogspot.de/2015/05/how-physical-addresses-map-to-rows-and-banks.html

[3]Intel(2015): “Intel® 64 and IA-32 Architectures, Software Developer’s Manual,Volume 3 (3A, 3B & 3C):,System Programming Guide”

[4]Daniel Gruss, Clémentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf

Saturday, November 21, 2015

Detecting stealth mode cache attacks: Flush+Flush

Update (22 Nov. 2015):

Daniel Gruss commented on twitter that he was a bit sceptical about the performance of my detection scheme. I think it’s fair to be skeptical on this point. I’ve augmented the blog post with some information about the performance of the detection. And a small comment on how the CPU could be changed to allow better mitigation. The augmented stuff is appended.

Introduction

In my previous works, I described the how Cache side channel attacks work, why I consider them a serious threat and spend a lot of time on mitigation methods and detection. Resonately a new paper was published that avoid the detection methods that I’ve suggested by avoiding the signature behavior I’ve looked for. Thus not only my methods, but also variations of the method failed against this new attack. In this blog post I’ll shortly describe the new attack and propose a detection method for it. I’ll touch on subjects which I explained in my previous cache side channel post. I won’t spend much time on explaining the back ground of CSC’s in this post. So if you’re not up to date on cache side channel attacks, you may wish to read this post before proceeding.

Developments since my last post

Since my last cache side channel attacks post 3 important papers has been released. First was Chiappetta, Savas & Yilmaz[1], about using performance counters for detecting cache side channel attacks. Unlike what Nishat Herath and I wrote about in our black hat slides, Chiappetta, Savas & Yilmaz[1], coupled up the performance counters to machine learning methods – specifically a neural network and showed you could detect attacks real time using this methodology. It’s an interesting upgrade to my work on the subject and I suspect their method to work well on high frequency attacks. My unpublished results from the work I did in connection with the black hat speech, suggest that low frequency cache attacks in a noisy environment may not detect well using this method, but I’ll be happy to be shown otherwise. I’ll return in part to this point later in this blog post. Never the less I’m really happy, because it was the first independent confirmation that the ideas Mr. Herath and I brought forth a black hat has bearing and an article like that always carry more information that could be presented at a conference.

Then came a very interesting paper by Lipp et al. [2]: In this paper the author breaks down the hurdles for using a shared cache on ARM devices and thus brings CSC from intel to other platforms. It is a development that should not be taken lightly as the potential for information leakage is huge, simply because the installed base is really large. In my last blog on CSC’s I warned about performance counters being leaked to unprivileged processes and the irony is that exactly this, is what makes the ARMageddon attack possible in the first place.

Finally Gruss, Maurice& Wagner [3] struck again with the Flush+Flush: A Stealthier Last-Level Cache Attack paper. It is this paper I shall write about and respond to in this blog.

The story

A while back Daniel Gruss contacted me and essentially teased me if I tought it’d be easy to detect a CSC that did not cause last level cache misses. My answer was “depends on the details” which is a nice way of hedging my bets. I thought about how Mr. Gruss would be going about it and the answer was on my notes, specifically on my to do list. While I’ve been playing with CSC’s I’d figured that it would be likely that clflush would leak information through how long it would take to execute. Actually flushing memory would likely take a different amount of time than just searching through the cache set and causing a miss. I had never actually done any work on it, but it allowed me to guess what Mr. Gruss’s was hinting at. In the end I ended up with a draft of the paper and send back a few comments to Mr. Gruss & Clémentine Maurice. I’m really grateful for gesture of trust on their part and it’s always a joy reading their papers – and early access is pretty cool too. My early access wasn’t just used to “review” their paper, but also to ponder how I could detect this new evil beast and I’ve come up with a solution.

More data on performance counters and CSC

Gruss, Maurice & Wagner[3] spends part of their paper documenting how a number of performance counters responds to different kinds of attacks and benign use cases. After carefully analyzing the results they suggest a modification of Mr. Herath and my method by using a fixed decision boundry on the ratio of cache hits to cache misses. This is indeed a good idea. They also show that the method gives fairly good results. Again I’m somewhat skeptical about the performance of this measure under noise for low frequency attacks. My argument goes that if the attacker (or another benign process) causes lots of cache hits on unrelated sets, the performance counters are likely to show a ratio that is considered ok and thus not detect the attack. The attack would remain successful as long as the noise doesn’t extend to the attackers cache sets of interest. High frequency attacks are more difficult to “mask” in this way. Lowering the demands on the ratio will help, but may lead to false positives on other memory intensive applications. This all said I count their results as evidence that that performance counters can make a real impact on CSC’s. In the black hat slides and my previous blog post my suggestion on how to deal with these issues can be found.

The new stealthier attack

The new stealthier attack works by timing the clflush instruction. It is slower when it actually has to flush the cache as one would expect a priori. Now obviously clflush takes an address as parameter and this essentially forces an attacker to have read access to the memory he wishes to spy on. This makes the attack in most respects similar to the flush+reload attack which I described in my blog on cache side channel attacks. The difference is now that the attacker no longer needs to cause a significant amount of cache misses on his own since the CLFlush instruction on it’s own is enough and the reload “phase” of the flush+reload attack entirely disappears. This fact not only makes it more difficult to pick up with performance counters, but it also makes the attack much faster. There are other aspects of this attack, which makes it even scarier, but it’s beyond the scope of this post to go into that here. The pseudo code for the attack code looks like this:

1. While (true) {

2. m_fence

3. starttime = rdtsc

4. m_fence

5. clflush(target_address)

6. m_fence

7. delta = rdtsc – starttime

8. derive information from delta (cache miss or hit)

9. optionally wait for victim here }

An important feature of the flush+flush attack is that the time difference between hit and miss is relatively small compared to the flush+reload attack. I shall come back to this later.

The new detection

My new detection bases it self on this code. The idea is that the presence of a RDTSC instruction closely followed by a clflush is likely to be a CSC. The presence of a second RDTSC instruction only hardens the suspicion. Even more so if it’s called repeatedly.

Obviously I cannot just search the entire memory for code that looks like this. So I use a variation of the method Mr. Herath & I suggested for finding low frequency attacks. Below is the sketch description of the most primitive variation of this detection mechanism.

1. First I setup a PMI interrupt handler to handle interrupts that appear when a performance counter overflows.

2. To get notification of the use of a RDTSC instruction in user mode I set the CR4.TSD flag which will make the RDTSC instruction(and it’s brother the RDTSCP instruction) privileged and thus cause access violations when called outside ring 0.

3. If my exception handler was called for other reasons than RDTSC I chain the previous handler. Otherwise I setup performance counting to cause interrupts on “instruction retired” and the appropriate is counter is set to -1 to cause an interrupt on the next instruction. Here after I emulate the RDTSC instruction by using it and forwarding the right values to the user mode client . I do this step for compatibility reasons, so that legitimate use of the RDTSC does not break. There after the interrupt handler exits without chaining further handlers.

4. Once I get the PMI interrupt I check if the instruction following the R/EIP in the client is a clflush. If it is I’ve detected a CSC in progress, if not I set the counter register to -1 to cause an interrupt on the next instruction. If 4. Has been called a sufficient number of time I stop the performance counters and consider the RDTSC event to be unrelated to a CSC attack. Then I exit gracefully from the interrupt handler.

First observation is that we cannot count on the attacker not obfuscating the attack. In the code listed for the attack he could add dummy instructions such as nop or even (conditional) jmps to make a search very difficult. My first solution to this problem was that I needed an emulator to search for the clflush, but that seems like a terrible solution, slow, prone to error etc. I quickly realized that I could use the trace-flag to let the processor do the work. I scrapped that idea too. The reason was I needed to install another interrupt handler and that the trace flag can be deleted/detected from user mode, which again would require at least a partial emulator for it to work. Thus I scrapped this approach. Eventually I figured that using performance counters with interrupt provided a neat solution to search for the needle in the haystack. Also it would line up well with detections of the other types of CSC’s in a real world implementation. Now the prerequisite for all these methods to work is of cause that the attacker doesn’t use tons of obfuscation. He is however not in a position to do so. Since the clflush timing differences the attacker relies on are relatively short and modern processors isn’t exact time on any instruction, the attacker is limited in how much obfuscation he can use, and also very much in what kind of obfuscation he can use. From Gruss, Maurice & Wagner[3] I gathered that clflush causes cache hits and dummy instruction that accesses memory has particularly high variance, and is thus unsuitable for the attacker to use as obfuscation. Thus it would be likely I’d get to my clflush in just one interrupt and that I could stop looking after just a few. To my surprise I didn’t get any cache hits on clflush on a Kentsfield processor. This lead me to conclude that most likely there is a difference in the implementation of clflush/cache manager across processors in respect to the interpretation of when to count. Being conservative I thus went with the “instructions retired” event with the downside that I need more interrupts and thus a high performance penalty to rule out false negatives. The smart ones amongst you will notice that Kentsfield doesn’t have a level 3 cache and my guess is that this is the reason for not causing cache misses. Thus I suspect that the cache hit event would do the trick in a real world implementation.

A note on my detection PoC

I developed my PoC on a WinXP 32 on a 4 core Kentsfield system. The reason for picking this system was two fold. First and foremost it was the only one available to me. I don’t own a computer and my wife would not find it amusing if my driver crashing ended up causing damage to the file system on hers. Using Windows XP 32 also have a significant advantage – there is no patch guard which allows me easy access to access the hardware directly. Real world implementation needs to consider this. Now I did not actually hook the access violation interrupt handler in the IDT. I opted for the easier solution of using vectored exception handling in user mode. My reason to do this was that my method was based on hardware features of the intel processor and thus I could use the existing handler of windows without contaminating my results. From a development perspective this allows me to work in ring 3 (which is a really great thing when you’re testing a driver using remote desktop over VPN) and it gives me a lot of handling of the interrupt for free – like figuring out why the access violation was trigged in the first place. Despite using a Kentsfield processor, where the attack cannot actually take place, the mechanism my detection can be tested reliably and thus I’m confident that general process will extend to other architectures as well.

Performance

First comment I need to make is about the word performance. I this context it can mean two things: First how well does the detection work in terms of false positives and false negatives. Second what is the speed penalty for running the detection? I shall try to answer both separately.

False positives and false negatives

The detection hinges on two things that an attacker uses RDTSC(P) twice and a clflush. Now if the attacker is able to make a suitable high precision timer and thus avoid using RDTSC my detection fails. In my cache side channel blog post I argued that a thread timer may be a solution to that problem. With flush+flush the margin for success is smaller because the timing difference is used to measure hit or miss is much smaller than in the traditional CSCs. I don’t think these timers work well enough, but I could be wrong. Second option for the attacker to break my detection is dummy instructions that I discussed above. Third the clflush instruction itself can be obsfuscated using prefixes such as ds,es,cs,ss or even combinations of these since the intel CPUs tends to ignore the illegality of double address prefixes. Never-the-less this should not pose much of a challenge for the defender. If the defender is able to use the “cache hit” performance counter to find the clflush instruction doing so will come with a slim chance of false positives, because we’ll no longer find the beginning of the instruction, but the end and the instruction triggering the cache hit could be one that looks like the clflush instruction that is the 0x0f, 0xa bytes could be present in the ModRM/SIB/Displacement of say a mov instruction. The defender could deal with this using the “instruction retired” method for the next round of the attack. Finally there is the possibility that a benign program would use the rdtsc/clflush combination for some reason and that would definitely turn up a false positive. But the quick summary is that the attacker is not left with much wiggle room to avoid detection and false positives are unlikely.

The speed penalty of this detection scheme

The normal performance counter method for CSC detection is virtually free. Mr. Herath and I ran tests on the performance loss and found a drop in the range of 0.3% (statistically insignificant too). What one would consider an acceptable cost is of cause highly subjective, so I’ll just give some numbers here. In this flush+flush detection there is however a performance loss. This performance loss has two components the first is the handler for the PMI interrupt and the second is the access violation on the RDTSC instruction. The overflow interrupt on performance counting cost in and of itself around 530 ns on the Kentsfield processor I used to write the POC for the detection. Mr. Herath measured around 250ns on his more modern computer as part of our presentation for black hat. I suspect the access violation interrupt to be in the same order of magnitude. The access violation handler isn’t all that simple because we need to figure out what caused the exception before we can pass it on or emulate the RDTSC instruction. I do not have any measurements for that, but I suspect it’ll be less than a micro second total over head to the RDTSC instruction (which of cause is massive amount of added inaccuracy too). The handler for the PMI is relatively fast as we only need to search for the clflush instruction. For reference human perception starts around 20 milliseconds.

With ball park figures for the duration of the components of the detection attempt we now need turn our attention to incidence figures. Unfortunately I don’t have any. The first thing we must take note of is that the RDTSC instruction is rarely used. Visual studio 2013 and search indexer on WInXP both crashed after I set the CR4.TSD flag without a handler, but other than that the PC continued to run stabile. So we wouldn’t be running one detection attempt after another. The PMI will only come into play once we found an RDTSC. If we use the “instruction retired” variant we’ll probably spend a significant amount of time before we can dismiss an incidence of rdtsc as benign. Depending on just how much dummy instruction we think the attacker could add without adding too much noise. However if we can use the “cache hit” variation we’ll probably only need a few interrupts to confidently say it’s not an attack – again depending on how much cache access an attacker could actually do without adding too much noise.

The last thing we should consider is that the general method can be applied with deliberation. We can set the CR4.TSD flag only for suspected applications or white list known benign software. We can even white list dynamically – say we have another thread that analyzes the code statically around each RDTSC and if it’s sure to be benign then we wouldn’t have to trigger PMI’s in response to the next time we encounter the same instance of RDTSC. Also we could moderate our use of the PMI. Say we trigger the PMI only on every other instruction – this gives us a 50% reduction in number of interrupts and a 50% chance of detecting an attacker – per round he runs the attack(number of loops in the “while(true)” loop in the pseudo code), which means our chance of detection is extremely high in real use cases.

To summarize I think the detection scheme could be implemented at a acceptable speed penalty.

What could intel do?

It has often been suggested to make clflush privileged. Adding a CR4.TSD type flag for clflush would be a more flexible solution because it’d grant operating system the power to allow and disallow clflush where it sees fit.

Acknowledgement

I need to acknowledge Mr. Herath for his contributions to our black hat slides. The conversations and work I did with Mr. Nishat for black hat has been instrumental for this post to work out. Also thanks goes to Daniel Gruss for enlightening conversations and to Clementine Maurice for trusting me with a draft of their excellent paper. Any errors or misunderstandings are entirely mine!

Litterature

[1] Marco Chiappetta, Erkay Savas & Cemal Yilmaz (2015): “Real time detection of cache-based side-channel attacks using Hardware Performance Counters”: http://eprint.iacr.org/2015/1034.pdf

[2] Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Stefan Mangard (2015), ”ARMageddon: Last-Level Cache Attacks on Mobile Devices”. http://arxiv.org/abs/1511.04897

[3] Daniel Gruss, Clémentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf

Saturday, November 14, 2015

Reverse engineering and the security of software

Erratum: I apologize if somebody is left with the impression that the "pond model" is of my invention.It is not. I cannot trace the history of it, but here is a reference for it: http://leansoftwareengineering.com/2007/06/05/the-capture-recapture-code-inspection/

Introduction

Large portions of this blog post was written in April (or there abouts). I’ve only updated it in some places and seriously reworked a comment on bug bounties. It was part of a larger blog post that I wanted to make and never finished. It just kept on growing. The blog post would’ve been called “Perverse infosec incentives” and is about the totally messed up incentive structure present in infosec sector today. Any micro economist would have a field day applying standard micro economic theory – especially the industrial organization subsection of micro economics and analyze just how badly infosec economics works. The other sections that I’ve not finished and probably never will, have the following headers – I’m sure most of you guys can fill in the blanks from reading just the headers.

Why exploits are weapons

Why #yourboyserge proves that anti-virus works and why I am free riding

Why anti-virus companies might be better off showing false positives

Why does public available information about offensive security dominate defensive

Why would it be a good idea for an infosec company to blog

Peak malware

You may ask why I’m posting this now and haven’t done so earlier. Well – one reason Is that I was beat to bringing up the subject. I believe Haroon Meer was responsible for that, but I’m not sure. Any ways I’m glad that it is brought up, whoever did it. The other was I actually halfway decided against trespassing on info sec politics terrain. My reason was I wanted to keep my blog about technical stuff. Eventually there was a bit of mission creep with the “Nostalgia posts” and I think I’ll just let it creep some more. I blog for fun, why should I care about mission creep? Finally came a tweet from Marion Marschalek that read: “Infosecs first problem in policy work is its heterogeneous community that doesnt get its plenty opinions together. Free after @halvarflake“. All this made me think that even though these things have been said before and almost certainly better that I can, maybe if enough people join the choir somebody will listen.

Reverse engineering and the security of software

It is often that reverse engineering software is disallowed in license agreements. Sometimes it’s even argued by software manufacturers that reverse engineering software should be generally outlawed to protect their intellectual property. I shall argue here that reverse engineering for information security is in fact important for all information security and software quality in general using classical arguments from the Industrial Organization branch of economics. This blog post is not meant to be rigorous, but to introduce some useful economic concepts. The release of this blog post was motivated by discussion on reverse engineering and the relevance of the Akerlof(1970) article for this discussion. Though I agree that this article is indeed relevant, I shall argue here that the while Akerlof’s problem has simple solutions those solutions do not work well for the software security problem and that reverse engineering does provide a solution.

You would obviously be right if you immediately though that economics doesn’t say anything about security. But it does say something about quality and a product being secure would be considered a quality by most customers and by a society as a whole it certainly should. Thus when I talk about software quality I’m talking about how secure it is in this text. There are other aspects of software quality which I’ll ignore here even though the same argument extends themselves to other measures of quality. I’ll use the quality lingo to stay in tune with classic economics.

Akerlof(1970) made an economic model for the market of used cars. His model has two kinds of used cars: cherries and lemons. Consumers are assumed not to be able to identify if a car was in great shape (a cherry) or in bad shape (a lemon – after your facial expression once you find out you bought one). However the car dealer / previous owner would know. Say the value of a lemon is $600 and the value of a cherry $2000. With every other car being a lemon the average value would be $1300 and a rational; risk neutral consumer would thus be willing to pay this for a car. Obviously no dealer would sell a cherry for this price and the market would break down.

Akerlofs example illustrates what economics calls imperfect information. The parallel with software security is that normal people are perfectly incapable of detecting if software is easily exploitable or not, in the same way people like me couldn’t tell a cherry from a lemon. But this is where the parallels start to break down. The solution is easy in the used car market because a dealer warranty goes a long way to realign incentives – nobody wants to give a warranty on a lemon. Vendors could also start selling only cherries to build reputation. No such simple solution exists for Software.

Economist sometimes distinguishes between 3 kinds of products in relation to quality and imperfect information. The first is the “search good”, I don’t know which laptop I’d like to buy, but if I research it long enough I’ll figure it out. The second kind of good is the “experience good”. You find out the quality of the product after you buy it. Drive a car for a couple of years and you know if it is a lemon. The last kind of good is the “credence good”. Chances are that you’ll never know if your tooth paste contains the right amount of fluoride for your teeth.

I argue that software is a “credence good”. That the software consumer is unlikely to ever know if software was of good quality and thus secure. The thing with software exploits is that they often leave little or no trace, so even if you do get hacked you’d often not know how it happened – even though you may very well figure out that you were hacked. This gives software vendors a perverse incentive not to invest in software security – consumers cannot identify it in any way and it’s likely to be costly, thus making software vendors investing in security less competitive. We would predict only a partial market break down for software – everybody knows that there is no cherry software because there is no reason to produce it. Because of the “credence good” nature of software – that is the fact that consumers never know the quality – there is no easy way out of the problem through warranties or liability. Nor can vendors build reputation by only selling cherries. In this oversimplified model we’ve institutionalized software insecurity.

We need to address the problem of software insecurity at it’s root: The imperfect information. Restaurants are reviewed, government agencies and scientists spends lots of time on figuring out how much fluoride should be in tooth paste and sometimes even how much fluoride is in a given tooth paste. Cars under go technical tests, etc. Why should this be different for software? Even if only a few customers are informed it’ll increase incentives to make quality products to the benefit of all. By being more demanding they drive up quality on the products. Or in economic terms: “The informed customer exert a positive externality on the uninformed ones” – Tirole(1994).

The litmus test for the security of a software product is how easy it is to find insecurities in it and the way this is done is by reverse engineering parts of the software looking for security relevant errors. Even when the results of a reverse engineers work , is also not understood by the broad public, the presence of informed customers improves quality for everybody. Sharing the information only helps the process.

We should not make any assumptions that hackers with ulterior motives will abide by license terms or even laws. And while open sharing information on inadequate quality of software may give rise to opportunities for hackers with less noble motives, in the long run institutionalized insecurity will prove worse that occasional glitches.

Now the model I proposed here is obviously over simplified. Sometimes hacks can be attributed to bugs in specific software. Mikko Hypponen(2011) argues that a major security focus by Microsoft was a direct result of Code Red and it’s clones causing havoc to Windows Systems. To me that is evidence of the mechanism actually working. Microsoft gathered that consumers would identify their lack of quality. Also people tend to be not to be of the Homo Economics species, but rather homo sapiens with much more diverse behavior. I do however think that my model is an important part of reality. Fortunately some tech companies see it similarly and companies like Microsoft, Google and F-Secure either pay for security bugs found or keep a hall of fame of people having reported security bugs found by reverse engineering.

Reverse engineering for security purposes should be embraced by anybody who favors information security.

The not so depressing effects of bug bounties

Resonantly Jacob Torrey wrote a blog on the “Depressing Effect of Bug Bounties” Torrey(2015). Mr. Torrey argues that the work of reverse engineers to find bugs in response to bug bounties will give companies incentives to lower their own Q&A efforts. He argues: “By artificially deflating the cost of finding and fixing bugs in operation/shipped product through monopolistic means, bug bounties remove the economic incentive to develop better software by integrating security-aware architects into the SDLC. Bug bounties use their monopoly on setting prices“. I’ll grant Mr. Torrey that such effects are bound to happen, but I still come out in favor of bug bounties.

My first quarrel with Mr. Torrey’s argument is that bug bounties are not that monopolistic. Anybody wanting to claim price money from bug bounties are likely to pick the bug bounty (and thus product to reverse) to where she is most likely to achieve the most money for the least effort. Indeed other alternatives such as selling bugs to companies that trade bugs for weaponization exists. There is little argument that the market for “bug bounty hunters” has an inelastic supply allowing software companies to dictate market conditions. What I mean is there is little reason to think that “bug bounty hunters” cannot pick and choose as they wish between different bounties.

My second contention is that bug bounties isn’t new. Microsoft has had a “hall of fame” for security researchers reporting bugs since 2007. Microsoft(xxxx): Now this isn’t a monetary reward – but people in the infosec community does collect this kind of recognition to find a well paying employer. Thus it’s economically equivalent of a bug bounty –it’s value for service.

However bug bounty prices at the moment are ridiculous. F-Secure offering a mere EUR 100 – EUR 15000 for a bug report and they alone decided what the actual pay day is going to look like. F-Secure(2015). 100 EUR is about 2 hours worth of brick laying in Germany, you can’t get a decent engineer to write a report for that amount of money, let alone research anything. It’s certainly is a big differences to the prices acquired on the weaponization market, but at least the white hat side of things offers up competition and that’s a good thing. I believe I read somewhere that $45.000 was paid by hacking team for a flash exploit. I think these prices reflect that bug bounties are not yet taken serious by some companies rather than monopolistic competition.

Finally bug bounties carry the possibility for companies to use them as a signal of quality. Offering a million dollars for a bug says a lot about a company’s confidence in their software’s security. It goes a long way to solve the imperfect information problem. It is in many ways equivalent to the warranty solution in the market for cars. That reverse engineering is allowed is always a prerequisite for bug bounties to work though.

The fact is bug bounties incentivize bug hunting in a white hat manner and the depressing effect on in-house q&a mr. Torrey argues for is likely to be small – especially if bug bounties becomes wide spread and a fair evaluation system can be applied. Further most bug bounties only suppress information on bugs for a period of time thus allowing for information on product quality to arrive at the market. Is EUR 15.000 is a strong signal of confidence in F-Secure’s security? I doubt it.

Despite my “disagreement” with Mr. Torrey’s article, I find it refreshing to see such blogs about infosec. It’s good work and it deserves a read .

The pond argument

It is sometimes argued that hunting for exploitable bugs is like catching frogs in a pond. If there is many frogs in the pond, removing one will have a negligible effect on the frog population and thus the effort required to catch another. If there are only 3 frogs in the pond, you’ll eradicate a third of the population making it much harder for other (more dubious) people to catch a frog. It’s a wonderful argument and I find it has lots of bearing on exploits. I think there is lots of exploits in the pond and thus finding a 0-day doesn’t contribute significantly to security in general.

However the pond argument ignores the imperfect information argument of reverse engineering above. Having bugs found too often implies insecure software. Flash and Font bugs has gained that reputation. In fact the many security bugs in flash have become a real problem in the market place for Adobe. The number of bugs that have been reported on flash is often used as an argument to turn it off and replace it with something else. That is an argument that I fully endorse.

However my guesstimate remains that software authors produce more security relevant bugs, than security researchers can find. Making errors is human and programming is for now a very human task. I shall not deny that I’m responsible for my share of bugs. The attack surface (pond) is expanding faster than security researchers can keep up. The infosec community would be wise to look for methods for avoiding bugs in the first place (google “langsec” to see what I think is the most promising path) or at least solutions that offers significantly hurdles to utilizing bugs for security breaches.

The real problem

While bug bounties, in my view certainly, is a step forward, the core problem of the insecurity state of software development is liability. Companies that gets hacked does not carry the cost of it. Sony spend $15 millions on “investigation and remediation costs“ according to their Q3 2014 financial statement. In essence Sony gave everybody who’s data was dropped a credit monitoring subscription and that’s it. The trouble for those whose private data actually get abused is that they are never reimbursed in any meaningful way. More damaging information leaks such as those that happened with the Ashley Madison hack, significantly damage those involved, but there is no compensation and thus only a fraction of the societal cost is billed to the hacked company. This is called moral hazard in economics terms, because the decision on how much to spend on security is divorced from who carries the cost in the event of a hack. The problem of moral hazard in infosec is two fold: First, companies has little incentive to invest in security and this little incentive is bound to be passed down the food chain to software developers. Second, companies has every incentive to store any and all information making hacking a much more prosperous venture. I based this portion loosely on a wonderful article in “The conversation”. You can find it here:

http://theconversation.com/why-companies-have-little-incentive-to-invest-in-cybersecurity-37570

Tirole(1994): The Theory of Instrial Organzination.

Akerlof(1970): “The market for Lemons: Quality uncertainty and the market mechanism”; Quarterly Journal of economics 84

Hypponen, Mikko(xxxx): “The history and evolution of Computer Viruses”. DefCon 19. https://www.youtube.com/watch?v=Xr0ESMH1hwY

Torrey, Jacob(2015): “Depressing Effect of Bug Bounties”; http://blog.jacobtorrey.com/the-depressing-effect-of-bug-bounties

Microsoft(xxxx): “Security Researcher Acknowledgments” https://technet.microsoft.com/en-us/security/cc308575

F-Secure(2015): “VULNERABILITY REWARD PROGRAM” : https://www.f-secure.com/en/web/labs_global/vulnerability-reward-program

Thursday, November 5, 2015

x86 assembler coding nostalgia

Foreword

This blog was actually finished a long time ago. I've not posted it before because I'm anything but happy with how it turned out. I hope you enjoy it anyways. In case you actually read this I added an errata to the cache side channel post. A friend of mine in the infosec community insists on calling me old and he is right too. You are old when you write nostalgia blogs.

Introduction

The idea for this post came shortly after I’d posted the last nostalgia post. That post remains one of the most popular on my blog. The trigger was a question by MalwareTech on twitter. I don’t recall the exact quote, but it was along these lines: “Is there a way to figure out if an instruction is privileged from it’s encoding?”. My first thought was that it’s a silly question. But thinking about it made me reconsider. The question has lots of merit and the answer is silly. Any reasonable designer would have grouped privileged instructions together, yet x86 doesn’t. The reason is that x86 assembler wasn’t designed, it evolved. It’s layers and layers of added designs and I’ve had the mixed pleasure to follow lots of this evolution first hand. This post however is not the story of x86 processors – other people could tell that story better than I could, I’m sure. This post is about me and x86 processors. I’ll try to focus on the tech stuff, because I assume you’re only interested in me to the extend, that my story reflect yours. I hope you enjoy this trip down memory lane.

Setting the stage

Somewhere in time Intel made a new processor called 8086. This CPU would've been long forgotten by now if not IBM had made an open computer design around it. As a consequence companies like Nixdorf of Germany, Olivetti of Italy, Commodore (and much later Apple) and many others produced their own computers based on this design and the 8086 and it's successors became the de facto business PC and remains so today. It had an instruction set inherited from older Intel processors like the 8085, to make porting software easier. And a set of 16 bit of registers that by now sound awfully familiar: AX, BX, CX, DX, SP, BP, DI, SI, CS,DS,ES and SS. The first 4 of which each could be accessed as two 8-bit registers as well.

In the early 1980'ies - I don't know exactly when my father bought his first x86'er PC. It was a 4.77 MHZ 8088 IBM portable (if you can call anything weighing 14 kilos portable and not to mention it could not run unplugged). It was soon sold again, because the display did not live up to the standard our old Commodore already had. Consequently it was replaced with a Commodore PC 10 with a NEC V20 processor. It was instruction compatible with the 8088, but running 8MHZ and if I recall correctly a better memory bus which made it a pretty solid computer for the day. At first I didn't care much for the monster. First you had to boot the OS from a floppy disc (rather than from a ROM), you had to enter the time on each boot too as no battery was on board to keep the clock running, the games sucked and I had nobody to swap games with. It took a while for me to get interested in that computer. My father being a geek relatively quickly acquired a 10Mb hard drive and a multifunction card which cured the having to enter the time problem and boot from floppy disks. However what made the difference to me was a word processor. Once I got used to it, I found that the easy with which I could do my homework would save me time. Having a spell checker in the mid 80ies and search function to approximate commas was brought down time spend on my Danish homework significantly for the same grades (Me being a freak I actually timed it). Another cool thing I just have to mention is that my best friend bought a used IBM PC and today it defies belief but it actually came with a diagram of all the electronics on the mother board so you could repair the beast with a soldering iron should something break.

Hacking did exist in these days, but not for kids like me. Mischief obviously existed and that usually meant borrowing the library's computer (an Olivetti) that booted into a menu from which you could selected a handful of innocuous programs and require a password to get back into DOS to do maintenance. A dos prompt was root, since the early processors in the family 8086, 8088, V20, 80186, 80286 etc. came completely without security functions. These processors had no memory protection, any memory was R/W/X and the entire address space was accessible from anywhere. The in/out instructions always worked, the interrupt table was placed at a fixed location in memory etc. etc. Consequently if you got into dos you owned the computer. As a direct consequence of the processors inability (insufficient memory, too slow, etc.) to do meaningful multitasking, many programs, such as the dominant word processor Word Perfect, allowed you to open a dos prompt in foreground to format floppy disks, look for files etc. and then close the dos prompt to continue working on your document. I used that countless times to get that black hat feeling of power, back when I was 12 or younger. Don't recall actually doing anything with my power though. But I wanted to crack games, have a cool nick name in intros and for that I had to learn assembly.

My first 0x86 assembly program

My first assembly program would’ve looked much like this:

.MODEL small
.STACK 100h
.DATA
HelloMessage DB ‘Hello, world’,13,10,’$’
.CODE
.startup
mov ax,@data
mov ds,ax
mov ah,9
mov dx,OFFSET HelloMessage
int 21h
mov ah,4ch
int 21h
END

When linked you’d have a small hello world executable. This code obviously only has remote similarity to any assembler you’d write for windows or other modern operating system today. Back then it was just 0x86 assembler. With the arrival of the 80386 processor it became “real mode” as opposed to “protected mode”. I’ll get back to this later. The two things that seems particularly weird in the above assembler is writing to DS (which we today call a descriptor register, but back then it was a segment register – there was nothing to “describe” and thus it’s only say which segment of memory to read from) and the use of interrupts.

To understand why segment registers had to be written, the first thing we need to understand is that the early x86 processors came without paging and where 16 bit. Thus without changing the segment register we could only access 64kb – not very much even by the standard of the time. Thus the changing of the segment registers. The “Physical” address could be calculated by segment (register << 4) | address. 0000:0211 would be the same memory as 0001:0201. The upside to this scheme was that loading an executable or allocating memory you’d only loose a max of 15 bytes to alignment. This seems like a reasonable compromise considering the address space was 20 bits: 1 Mb – of which only 640kb was to the users disposal, the last 384kb belonged to the bios (well sort of – eventually things like himem.sys partially recaptured memory from this region). I like to think that history repeats itself and the scheme is a bit like Amd64 which is a 64bit addressing in a 48bit address space and the reserved 384kb can in wierd way be thought of a early version of SMM memory.

Interrupts in those days where not only hardware and exceptions service as it is today. It was also the API. Though many interrupts where used (0x19=reboot,0x20=exit process) interrupt 0x21 and 0x13 where the once you had to know. 0x21 was the dos services – which in todays terms are what is exported from kernel32.dll in windows. The calling convention was all register based (funny how history repeats itself. After more than a decade with stack based parameters, it’s typical to use much more registers for call styles in x64 – actually until WinXP int 0x2e was used internal to forward calls from user mode to kernel mode and was then replaced with the new instruction syscall, but programmers rarely see this nowadays). I never saw any official documentation, I doubt most people did – remember without internet to download stuff and no Intel or Microsoft websites documentation was tricky. Instead unofficial documentation especially in the form of interrupt list was floating around. The most important was Ralph Brown’s interrupt list. For interrupt 0x21 the ah register selected the function. 0x9 was Writeln() (that is write a $ terminated string to the console and 0x4c was ExitProcess(). The interrupt 0x13 was very important to system programmers, because this was where the bios exported it’s functionality. If you wanted to read sector based from disk for example, the easiest way was to use this interrupt which was supported by the bios. All boot loaders at the time where full of int 0x13 and copy protections too…

Another funny thing about early x86 was that lot’s of stuff was mapped to fixed addresses in memory. At 0000:0000 was the start of the interrupt table, which was just an array of far pointers. An interesting side effect of this was, that any program could globally hook all operating system API: An offer that virus authors would not turn down. Even weirder was address 0000:b800. It was the console. Mov ax, 0; mov ds,ax; mov [b800], ‘A’ would print an “A” in the top left corner of the console. The graphics adaptor would read it directly from here. Consequently you could read what was in the console from this address too. Graphics was rare, so the console mode was where computers where actually used in the early x86 days. EGA graphics had a fixed address too – I think it was 0000:e800 – but I really do forget these things.

Writing my first demo

A demo was at the time the “proof” of coding skill in 1994. Doing nice motions graphics required pushing the processor pretty far and writing a good demo in a high level language usually didn’t turn out well. I wrote my first and last demo in real mode assembler. It was a sine scroller – which essentially means that a text scrolls across the screen in a sine curve. To realize this, the first thing I had to do was to define a font. Other than the standard ascii, the operating system or hardware didn’t offer anything. This was done in a graphics program and then the output was converted into a table for each letter by hand. With only 256 colors it was a byte table. This was a lot of work, even when I drew the font only for the letters I needed. The second thing I needed was a sine table. Today you’d just calculate one as you go. Not so back then. The first problem was that early x86’s did not come with floating point instructions. To get floating point instructions you’d actually have to buy a so called co-processor (typically called 8087, 80387 or so) which you could plug into your motherboard. The second reason was that a table lookup was far more efficient. I generated my sine table in pascal that had a library for calculating sine with integer instructions and then added it to my demo. Drawing the actual graphics on screen was challenge of it own. The screen refresh was pretty slow and you’d see artifacts from it if you didn’t synchronize with it. Thus updating your screen was a loop of waiting for the refresh to finish and then a complex memory move operating to the graphics card memory (a fixed address as seen above). Waiting was achieved by using the “in” instruction directly (No clue what port it was) until a certain bit was deleted (or so I recall it). My demo sucked and I distinctively remember that despite being able to code these kinds of things, I had a huge disconnect in what I thought it would look like and what it looked like. I knew I was never going to be an artist and had already lost my heart to systems programming and computer virus stuff.

80386

80386 in many ways was the game changer for x86. It was introduced in 1985 according to Wikipedia. It was also the first computer I bought – it was 1991 or 92. Imagine a processor architecture staying pretty much unchanged for 7 years – you hardly see that today. The irony is that despite of this back then a computer lasted 2 or 3 years before it was antiquated, by today standard no time at all. The 386 was probably the biggest revolution in the entire history of the 0x86 platform and for infosec people almost certainly. The most profound change was the move to a 32 bit architecture from the previous 16 bit. Not only was the address space increased from 20 bit to 32 bit, but the registers and operand sizing was upgraded too, but also a paging system was added. This allowed 32 bit assembly language code to look like it does today. Further more the security architecture with ring 0 – ring 3 which some of us have a love/hate relationship with today was introduced here. New debugging features were added in particular the hardware memory break point. The CRx, DRx, GDT, LDT and IDT registers was introduced with the 80836. The real irony was that this processor was so far ahead of the software that it wasn’t until Windows 95 a full 10 years later operating system that made full use of the opportunities of the 386 processor became wide spread. What certainly was a revolution in hardware, became slow evolution in software. Dos extenders that swapped from real to protected mode became standard for games (especially DOS4GW from Watcom, but Pharlab also had a market share). Short of an operating system it basically just provided a flat memory layout to C++ programs. . There was a DR dos which had true multitasking but didn’t break the 640kb boundary and thus was pretty much unusable. We saw horrible Windows 2.0 and 3.0 with terrible, terrible 16 bit protected mode execution with their NE format which can best be described as a crime against good style.

Virus and infosec in the early 1990ies

I had been infected with an interest in computer viruses in the late very early 90ies reading about the exploits of Dark Avenger and his anti-virus nemesis: Vesselin Bontchev. Both were heroes of mine. Thus as soon as I’d written my first program in assembler I started wanting to write anti-virus software – some how writing virus code always seemed immoral to me. There of cause isn’t much anti-virus code you can write when you’ve only written “Hello World” before that. That did not scare me and I quickly wrote up a number of goat programs. For those unfamiliar with goat programs, they are programs meant as scapegoats for virus to infect. Since computer virus at the time tried to avoid detection and crashes by carefully selecting which files it would infect, having a large collection of known executables on your computer would increase the attack surface for any virus and of cause allow for detection too. On the assembly side, it was pretty much programs with different junk instructions to fill up the executables – nothing fancy. With time I moved on to writing encryptors and matching decrypters for executables. That was pretty much the same code as you’d use in a virus but without the moral implications. The first target was the .com file format of DOS. This was the easiest executable format imaginable. It was simple the binary code and data. It was loaded into a free segment at xxxx:0100, where xxxx denotes the free segment the operating system loaded the.com at and execution would start from this address too. The memory below 100h was used for command line and other start up parameters. This made .com a very easy format to encrypt. Basically I made a jmp instruction instead of the first two bytes to the end of the file, where a decrypter was appended. However with a file format this simple removing a virus or an encryptor was trivial and thus anti-debugging had to be made. Hooking interrupts 1 and 3 could throw off most debuggers. Running checksums on your self would detect if somebody was stepping/gotoing using interrupt 3 inserted into the code and using the pushf instruction to read the trace flag was popular. Finding back-door interfaces for the most common debuggers was a common enterprise for me and writing up detection code for them using this new found information. On the other hand patching backdoor interfaces in your own debugger allowing you to decrypt the competition was another essential trick. A real game changer was a debugger called TRW. It emulated each instruction instead of executing them and was thus independent of all these games we were playing. As a response I started checking instructions for emulation errors – lot’s of work, but real fun. In many ways a sign of what was to come in this area. After I had my fun with the errors, I reported them back to the author, who by the way was a really nice guy.

With the arrival of Windows new territory was everywhere and I did lot’s of work on the windows internals in the beginning particularly the PE executable format. Despite all things being new and all – the first (public) generic type PE decryptor (ProcDump32 from 1998) used the same principle as had been used to unpack .com files generically in the DOS days. It set the trace flag and single stepped everything until the original entry point and then dumped the memory. The anti-anti-debugging was inspired heavily from TRW: A driver would emulate instructions that could be used to endanger the safe tracing such as pushfd/rdtsc instead of actually executing the instructions. ProcDump32 could dump running programs too –except those wouldn’t actually run afterwards because the data wasn’t correctly restored. It was however enough to do reverse engineering on the code in many cases.

Towards the end of the first half of my career as an infosec hobbyist, I was introduced to what had been more or less the birth of modern hacking: The stack overflow hack. My interest at the time was Windows internals, but in IRC EFFnet you could not avoid being confronted with “0-dayz”. Being used to assembler coding I quickly understood the mechanics of the attack. It was occasionally useful for privilege elevation attacks on Windows NT. For windows 98 internals freaks like myself had no use for it – with the front door left open, why look for a window to break? I didn't consider priviledge elevation to be hacking at the time. Besides finding buffer overflows were too easy – in the very early days of it looking for strcpy() routines. It was quickly followed by attacks on sprintf type formatting stings. Usually if you could find one of those where the input string came from “user” input, you would have a 50% chance of having a 0 day in your hand. You could report bugs like you wanted to and it wouldn’t change a thing. My experience was that if you reported a privilege elevation bug to MS engineers you’d be ignored. If you reported a BSOD that had to be provoked to appear, you’d be thanked. If you reported a BSOD that could happen by accident it would be fixed. It would take forever though till it was distributed though. It was all reliability, no body cared about security in between 1996 and late 2000 where I put my infosec carrier to rest.

My last act before retiring as a malware hobbyist, was writing a blog post on the favorite malware of the press in the day – Back Orifice. I’d figured out how to detect Back Orifice in specific and most other remote control malwares of the day generically. Nobody cared. I tried for a while to get a job with anti-virus companies, but short of a nice email from Mikko Hyponen, nothing ever came of it. So I got a job which quickly killed my interest in spending time behind a computer in my spare time– but my life with x86 was not yet over .

Optimization

Assembler has always played a major role in optimization. As I started out reading about optimization it was all what we see in compilers these days. E.g. Use shl for multiplication by 2 or 4. The first kind of optimization I actually did was pre-computed tables. At the time it didn’t take much calculation for pre-computations to be meaningful. Part of this was because that the processors where much slower relative to RAM that they are these days. Also accessing hardware directly instead of through bios/OS channels worked wonders. Say the demo above would not have worked without directly accessing the graphics memory. In the mid/late 90ies a friend of mine was tasked with optimizing an interrupt routine for an industrial barcode scanner which would be the heart of a large sorting machine for logistics. I joined the effort for the fun of it. First we spend a lot of time translating from pascal to pure assembler and then more time was spend replacing multiplications with combinations of “shl”,”lea” and “add” instructions. Then we spend hours on avoiding push and pop and keeping everything we needed in registers and eventually got it fast enough for the requirements.

In 2000 I was working with optimizing an MPEG2 decoder. By now compilers had made the above optimizations internal and it was rarely worth the effort to see if you could do it better. The optimization game had moved yet again. The name of the game was now MMX – I realize that SSE3 was available at this time – but not in the low end computers and MMX optimization would’ve do the trick at both ends of the scale.

By 2002 it was an MPEG2 transrater I was optimizing on. The idea was copying a DVD9 to a DVD5 without re-encoding. I had a beta version of a Pentium 4 and Intel really wanted me to work on using hyper threading. But threading is difficult and it brought less than optimizing memory access and again real life market perspective: Nobody but a few select people had a hyper threading computer and there was little chance that they’d be wide spread within a year. Especially using the non-temporal instructions of SSE32/3 and aligned memory access made a huge impact. Another thing that made a big impact on my optimization work was that performance counters, had become really useful.

By 2006 I was back to optimizing video – this time h.264 codec for massive amounts of video. This time around optimizations now moved to threading – with 8 cores in a computer and the prospect of using multiple computers, splitting tasks in a meaningful way gave much more performance gain than the 50% you could get from hand optimizing big spenders. To be fair I did write up SSE3 routines for the biggest spenders. But assembly language optimization had finally taken the back seat.

Friday, October 30, 2015

Speculation: What the next row hammer style exploit could look like

One advantage of being a blogger is that I can speculate wildly and I’ll do this in this blog.

A few years ago I was part of a team that developed video codecs. H.264 and mpeg2 essentially. To test these we had a student aid and around 15 computers. The student aid checked that test where running 24/7 and that the results wasn’t garbage. In this way the poor PC’s ragged up months worth of CPU time. Now occasionally we had to change components on these computers – often RAM. Resonantly I had written what was essentially a file copy routine and for some reason it ended up being tested on one of these old computers. The result was lots of binary compare errors – always a single bit in a random byte. I spend hours trying to find the bug in my code, to no avail. The bug it turned out was that the DRAM modules where broken and produced enough bit errors that copying a 8gig set of files would produce compare errors, but not enough that the system seemed to run unstably. It had been less than a month after the laptop we were supposed to do my black hat presentation on died while row hammer testing. Probably the RAM went dead. Now video encoding and row hammer has something in common. They produce lots of cache misses. Row hammer by design – it doesn’t work without them- and video codecs simply because motion search and motion compensation routines plows through lots and lots of memory and back then there was no 3rd level cache. And finally came an article about causing ageing in a Sparc processor: KARIMI et al(2015):.

Now the speculation of mine is now that memory circuits age. Which seems to be true.See: Schroeder, Pinhiero & Weber(xxxx). Further we know that memory circuits that are used are more prone to error than idle ones. Not really surprising. What we don’t know is if using memory causes it to age. For static memory like a usb-stick it’s relatively easy to produce. But for DRAM I found nothing. I speculate that our video codec testing caused ageing in memory and that row hammer testing possibly did the same on our presentation laptop for black hat. The point is normally the cache will catch the majority of memory access, which isn’t much on most computers anyway since they spend most of their life idling. However as we know from row hammer we can force memory access in an attempt to age the memory. A rough guess is that most addresses in a random computer see only one access per 6 milliseconds or so. With malicious code we can do an access around every 250 nano seconds. In case you wonder it’s factor 24000 – and DRAM age in normal computers too. If this works we’re basically doing hardware destruction – nothing new under the sky here. Virus tried to destroy hard drives and floppy drives in the 90ies.

The new thing is row hammer. Seaborn and Dullien (2015) forcefully illustrated that what is normally a reliability issue can become a security issue very fast. The way that worked was simply to spray the memory with PTE’s, hammer away and wait until read write access was given to a page owned by the attacker. Now my file copying incidence produced random bit flips during normal execution – I was definitely not hammering anything. So bit flips without hammering could be possible in aged chips.

If we are so unfortunate that I’m right about the DRAM ageing then there is a good chance that it’s exploitable. Read /Write memory patterns for a couple of months non-stop, then when errors start to occur spray the memory with PTE’s and wait for the local privilege escalation take place.

Granted this is a pretty rare scenario and it’s real world importance probably shouldn’t be over estimated. But it is a pretty nifty idea. I share it, instead of developing it because I do not have the resources (time and hardware) to pursue this at all. And before you accuse me of selling the skin before the bear has been shot: There is so much speculation in here. Last if you have some old hardware you can mistreat and you are willing to put up with the electricity cost to test this. Please get in contact and I’ll write up a small test program in a jiffy – remember even if my speculation is wrong these kinds of experiments can shed some light on ageing in DRAM – and interesting endeavor all together. Thus even if I’m wrong we might learn something.

Litterature:

KARIMI et al(2015): MAGIC: Malicious Aging in Circuits/Cores: https://drive.google.com/file/d/0B9i8WqXLW451MTIyM2lqR1lpZ3M/view?pli=1

Seaborn & Dullien (2015): Exploiting the DRAM rowhammer bug to gain kernel privileges.

http://googleprojectzero.blogspot.de/2015/03/exploiting-dram-rowhammer-bug-to-gain.html

Schroeder, Pinhiero & Weber(xxxx): DRAM Errors in the Wild: A Large-Scale Field Study: https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf