Saturday, November 21, 2015

Detecting stealth mode cache attacks: Flush+Flush

Update (22 Nov. 2015):

Daniel Gruss commented on twitter that he was a bit sceptical about the performance of my detection scheme. I think it’s fair to be skeptical on this point. I’ve augmented the blog post with some information about the performance of the detection. And a small comment on how the CPU could be changed to allow better mitigation. The augmented stuff is appended.

Introduction

In my previous works, I described the how Cache side channel attacks work, why I consider them a serious threat and spend a lot of time on mitigation methods and detection. Resonately a new paper was published that avoid the detection methods that I’ve suggested by avoiding the signature behavior I’ve looked for. Thus not only my methods, but also variations of the method failed against this new attack. In this blog post I’ll shortly describe the new attack and propose a detection method for it. I’ll touch on subjects which I explained in my previous cache side channel post. I won’t spend much time on explaining the back ground of CSC’s in this post. So if you’re not up to date on cache side channel attacks, you may wish to read this post before proceeding.

Developments since my last post

Since my last cache side channel attacks post 3 important papers has been released. First was Chiappetta, Savas & Yilmaz[1], about using performance counters for detecting cache side channel attacks. Unlike what Nishat Herath and I wrote about in our black hat slides, Chiappetta, Savas & Yilmaz[1], coupled up the performance counters to machine learning methods – specifically a neural network and showed you could detect attacks real time using this methodology. It’s an interesting upgrade to my work on the subject and I suspect their method to work well on high frequency attacks. My unpublished results from the work I did in connection with the black hat speech, suggest that low frequency cache attacks in a noisy environment may not detect well using this method, but I’ll be happy to be shown otherwise. I’ll return in part to this point later in this blog post. Never the less I’m really happy, because it was the first independent confirmation that the ideas Mr. Herath and I brought forth a black hat has bearing and an article like that always carry more information that could be presented at a conference.
Then came a very interesting paper by Lipp et al. [2]: In this paper the author breaks down the hurdles for using a shared cache on ARM devices and thus brings CSC from intel to other platforms. It is a development that should not be taken lightly as the potential for information leakage is huge, simply because the installed base is really large. In my last blog on CSC’s I warned about performance counters being leaked to unprivileged processes and the irony is that exactly this, is what makes the ARMageddon attack possible in the first place.
Finally Gruss, Maurice& Wagner [3] struck again with the Flush+Flush: A Stealthier Last-Level Cache Attack paper. It is this paper I shall write about and respond to in this blog.

The story

A while back Daniel Gruss contacted me and essentially teased me if I tought it’d be easy to detect a CSC that did not cause last level cache misses. My answer was “depends on the details” which is a nice way of hedging my bets. I thought about how Mr. Gruss would be going about it and the answer was on my notes, specifically on my to do list. While I’ve been playing with CSC’s I’d figured that it would be likely that clflush would leak information through how long it would take to execute. Actually flushing memory would likely take a different amount of time than just searching through the cache set and causing a miss. I had never actually done any work on it, but it allowed me to guess what Mr. Gruss’s was hinting at.  In the end I ended up with a draft of the paper and send back a few comments to Mr. Gruss & ClĂ©mentine Maurice.  I’m really grateful for gesture of trust on their part and it’s always a joy reading their papers – and early access is pretty cool too. My early access wasn’t just used to “review” their paper, but also to ponder how I could detect this new evil beast and I’ve come up with a solution.

More data on performance counters and CSC


Gruss, Maurice & Wagner[3] spends part of their paper documenting how a number of performance counters responds to different kinds of attacks and benign use cases.  After carefully analyzing the results they suggest a modification of Mr. Herath and my method by using a fixed decision boundry on the ratio of cache hits to cache misses. This is indeed a good idea. They also show that the method gives fairly good results. Again I’m somewhat skeptical about the performance of this measure under noise for low frequency attacks. My argument goes that if the attacker (or another benign process) causes lots of cache hits on unrelated sets, the performance counters are likely to show a ratio that is considered ok and thus not detect the attack. The attack would remain successful as long as the noise doesn’t extend to the attackers cache sets of interest.  High frequency attacks are more difficult to “mask” in this way. Lowering the demands on the ratio will help, but may lead to false positives on other memory intensive applications. This all said I count their results as evidence that that performance counters can make a real impact on CSC’s. In the black hat slides and my previous blog post my suggestion on how to deal with these issues can be found.

The new stealthier attack

The new stealthier attack works by timing the clflush instruction. It is slower when it actually has to flush the cache as one would expect a priori. Now obviously clflush takes an address as parameter and this essentially forces an attacker to have read access to the memory he wishes to spy on. This makes the attack in most respects similar to the flush+reload attack which I described in my blog on cache side channel attacks. The difference is now that the attacker no longer needs to cause a significant amount of cache misses on his own since the CLFlush instruction on it’s own is enough and the reload “phase” of the flush+reload attack entirely disappears. This fact not only makes it more difficult to pick up with performance counters, but it also makes the attack much faster. There are other aspects of this attack, which makes it even scarier, but it’s beyond the scope of this post to go into that here. The pseudo code for the attack code looks like this:
1.            While (true) {
2.            m_fence
3.            starttime = rdtsc
4.            m_fence
5.            clflush(target_address)
6.            m_fence
7.            delta = rdtsc – starttime
8.            derive information from delta (cache miss or hit)
9.            optionally wait for victim here  }

An important feature of the flush+flush attack is that the time difference between hit and miss is relatively small compared to the flush+reload attack. I shall come back to this later.

The new detection

My new detection bases it self on this code. The idea is that the presence of a RDTSC instruction closely followed by a clflush is likely to be a CSC. The presence of a second RDTSC instruction only hardens the suspicion. Even more so if it’s called repeatedly.
Obviously I cannot just search the entire memory for code that looks like this. So I use a variation of the method Mr. Herath & I suggested for finding low frequency attacks. Below is the sketch description of the most primitive variation of this detection mechanism.
1.            First I setup a PMI interrupt handler to handle interrupts that appear when a performance counter overflows.
2.            To get notification of the use of a RDTSC instruction in user mode I set the CR4.TSD flag which will make the RDTSC instruction(and it’s brother the RDTSCP instruction) privileged and thus cause access violations when called outside ring 0.
3.            If my exception handler was called for other reasons than RDTSC I chain the previous handler. Otherwise I setup performance counting to cause interrupts on “instruction retired” and the appropriate is counter is set to -1 to cause an interrupt on the next instruction. Here after I emulate the RDTSC instruction by using it and forwarding the right values to the user mode client . I do this step for compatibility reasons, so that legitimate use of the RDTSC does not break. There after the interrupt handler exits without chaining further handlers.
4.            Once I get the PMI interrupt I check if the instruction following the R/EIP in the client is a clflush. If it is I’ve detected a CSC in progress, if not I set the counter register to -1 to cause an interrupt on the next instruction. If 4. Has been called a sufficient number of time I stop the performance counters and consider the RDTSC event to be unrelated to a CSC attack. Then I exit gracefully from the interrupt handler.

First observation is that we cannot count on the attacker not obfuscating the attack. In the code listed for the attack he could add dummy instructions such as nop or even (conditional) jmps to make a search very difficult. My first solution to this problem was that I needed an emulator to search for the clflush, but that seems like a terrible solution, slow, prone to error etc. I quickly realized that I could use the trace-flag to let the processor do the work. I scrapped that idea too. The reason was I needed to install another interrupt handler and that the trace flag can be deleted/detected from user mode, which again would require at least a partial emulator for it to work. Thus I scrapped this approach. Eventually I figured that using performance counters with interrupt provided a neat solution to search for the needle in the haystack. Also it would line up well with detections of the other types of CSC’s in a real world implementation. Now the prerequisite for all these methods to work is of cause that the attacker doesn’t use tons of obfuscation. He is however not in a position to do so. Since the clflush timing differences the attacker relies on are relatively short and modern processors isn’t exact time on any instruction, the attacker is limited in how much obfuscation he can use, and also very much in what kind of obfuscation he can use. From Gruss, Maurice  & Wagner[3] I gathered that clflush causes cache hits and dummy instruction that accesses memory has particularly high variance, and is thus unsuitable for the attacker to use as obfuscation. Thus it would be likely I’d get to my clflush in just one interrupt and that I could stop looking after just a few. To my surprise I didn’t get any cache hits on clflush on a Kentsfield processor. This lead me to conclude that most likely there is a difference in the implementation of clflush/cache manager across processors in respect to the interpretation of when to count. Being conservative I thus went with the “instructions retired” event with the downside that I need more interrupts and thus a high performance penalty to rule out false negatives. The smart ones amongst you will notice that Kentsfield doesn’t have a level 3 cache and my guess is that this is the reason for not causing cache misses. Thus I suspect that the cache hit event would do the trick in a real world implementation.

A note on my detection PoC

I developed my PoC on a WinXP 32 on a 4 core Kentsfield system. The reason for picking this system was two fold. First and foremost it was the only one available to me. I don’t own a computer and my wife would not find it amusing if my driver crashing ended up causing damage to the file system on hers. Using Windows XP 32 also have a significant advantage – there is no patch guard which allows me easy access to access the hardware directly. Real world implementation needs to consider this. Now I did not actually hook the access violation interrupt handler in the IDT. I opted for the easier solution of using vectored exception handling in user mode. My reason to do this was that my method was based on hardware features of the intel processor and thus I could use the existing handler of windows without contaminating my results. From a development perspective this allows me to work in ring 3 (which is a really great thing when you’re testing a driver using remote desktop over VPN) and it gives me a lot of handling of the interrupt for free – like figuring out why the access violation was trigged in the first place.  Despite using a Kentsfield processor, where the attack cannot actually take place, the mechanism my detection can be tested reliably and thus I’m confident that general process will extend to other architectures as well.

Performance


First comment I need to make is about the word performance. I this context it can mean two things: First how well does the detection work in terms of false positives and false negatives. Second what is the speed penalty for running the detection? I shall try to answer both separately.

False positives and false negatives


The detection hinges on two things that an attacker uses RDTSC(P) twice and a clflush. Now if the attacker is able to make a suitable high precision timer and thus avoid using RDTSC my detection fails. In my cache side channel blog post I argued that a thread timer may be a solution to that problem. With flush+flush the margin for success is smaller because the timing difference is used to measure hit or miss is much smaller than in the traditional CSCs.  I don’t think these timers work well enough, but I could be wrong. Second option for the attacker to break my detection is dummy instructions that I discussed above. Third the clflush instruction itself can be obsfuscated using prefixes such as ds,es,cs,ss or even combinations of these since the intel CPUs tends to ignore the illegality of double address prefixes. Never-the-less this should not pose much of a challenge for the defender. If the defender is able to use the “cache hit” performance counter to find the clflush instruction doing so will come with a slim chance of false positives, because we’ll no longer find the beginning of the instruction, but the end and the instruction triggering the cache hit could be one  that looks like the clflush instruction that is the 0x0f, 0xa  bytes could be present in the ModRM/SIB/Displacement of say a mov instruction. The defender could deal with this using the “instruction retired” method for the next round of the attack. Finally there is the possibility that a benign program would use the rdtsc/clflush combination for some reason and that would definitely turn up a false positive. But the quick summary is that the attacker is not left with much wiggle room to avoid detection and false positives are unlikely.

The speed penalty of this detection scheme

The normal performance counter method for CSC detection is virtually free. Mr. Herath and I ran tests on the performance loss and found a drop in the range of 0.3% (statistically insignificant too). What one would consider an acceptable cost is of cause highly subjective, so I’ll just give some numbers here. In this flush+flush detection there is however a performance loss. This performance loss has two components the first is the handler for the PMI interrupt and the second is the access violation on the RDTSC instruction. The overflow interrupt on performance counting cost in and of itself around 530 ns on the Kentsfield processor I used to write the POC for the detection. Mr. Herath measured around 250ns on his more modern computer as part of our presentation for black hat. I suspect the access violation interrupt to be in the same order of magnitude. The access violation handler isn’t all that simple because we need to figure out what caused the exception before we can pass it on or emulate the RDTSC instruction. I do not have any measurements for that, but I suspect it’ll be less than a micro second total over head to the RDTSC instruction (which of cause is massive amount of added inaccuracy too). The handler for the PMI is relatively fast as we only need to search for the clflush instruction. For reference human perception starts around 20 milliseconds.

With ball park figures for the duration of the components of the detection attempt we now need turn our attention to incidence figures. Unfortunately I don’t have any. The first thing we must take note of is that the RDTSC instruction is rarely used. Visual studio 2013 and search indexer on WInXP both crashed after I set the CR4.TSD flag without a handler, but other than that the PC continued to run stabile. So we wouldn’t be running one detection attempt after another. The PMI will only come into play once we found an RDTSC. If we use the “instruction retired” variant we’ll probably spend a significant amount of time before we can dismiss an incidence of rdtsc as benign. Depending on just how much dummy instruction we think the attacker could add without adding too much noise. However if we can use the “cache hit” variation we’ll probably only need a few interrupts to confidently say it’s not an attack – again depending on how much cache access an attacker could actually do without adding too much noise.

The last thing we should consider is that the general method can be applied with deliberation. We can set the CR4.TSD flag only for suspected applications or white list known benign software. We can even white list dynamically – say we have another thread that analyzes the code statically around each RDTSC and if it’s sure to be benign then we wouldn’t have to trigger PMI’s in response to the next time we encounter the same instance of RDTSC. Also we could moderate our use of the PMI. Say we trigger the PMI only on every other instruction – this gives us a 50% reduction in number of interrupts and a 50% chance of detecting an attacker – per round he runs the attack(number of loops in the “while(true)” loop in the pseudo code), which means our chance of detection is extremely high in real use cases.

To summarize I think the detection scheme could be implemented at a acceptable speed penalty.

What could intel do?


It has often been suggested to make clflush privileged. Adding a CR4.TSD type flag for clflush would be a more flexible solution because it’d grant operating system the power to allow and disallow clflush where it sees fit. 


Acknowledgement

I need to acknowledge Mr. Herath for his contributions to our black hat slides. The conversations and work I did with Mr. Nishat for black hat has been instrumental for this post to work out. Also thanks goes to Daniel Gruss for enlightening conversations and to Clementine Maurice for trusting me with a draft of their excellent paper. Any errors or misunderstandings are entirely mine!


Litterature

[1] Marco Chiappetta, Erkay Savas & Cemal Yilmaz (2015): “Real time detection of cache-based side-channel attacks using Hardware Performance Counters”: http://eprint.iacr.org/2015/1034.pdf
[2] Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Stefan Mangard (2015), ”ARMageddon: Last-Level Cache Attacks on Mobile Devices”. http://arxiv.org/abs/1511.04897

[3] Daniel Gruss, ClĂ©mentine Maurice & Klaus Wagner(2015): “Flush+Flush: A Stealthier Last-Level Cache Attack”, http://arxiv.org/pdf/1511.04594v1.pdf

Saturday, November 14, 2015

Reverse engineering and the security of software

Erratum: I apologize if somebody is left with the impression that the "pond model" is of my invention.It is not. I cannot trace the history of it, but here is a reference for it: http://leansoftwareengineering.com/2007/06/05/the-capture-recapture-code-inspection/

Introduction

Large portions of this blog post was written in April (or there abouts). I’ve only updated it in some places and seriously reworked a comment on bug bounties. It was part of a larger blog post that I wanted to make and never finished. It just kept on growing. The blog post would’ve been called “Perverse infosec incentives” and is about the totally messed up incentive structure present in infosec sector today. Any micro economist would have a field day applying standard micro economic theory – especially the industrial organization subsection of micro economics and analyze just how badly infosec economics works. The other sections that I’ve not finished and probably never will, have the following headers – I’m sure most of you guys can fill in the blanks from reading just the headers.
Why exploits are weapons
Why #yourboyserge proves that anti-virus works and why I am free riding
Why anti-virus companies might be better off showing false positives
Why does public available information about offensive security dominate defensive
Why would it be a good idea for an infosec company to blog
Peak malware

You may ask why I’m posting this now and haven’t done so earlier. Well – one reason Is that I was beat to bringing up the subject. I believe Haroon Meer was responsible for that, but I’m not sure. Any ways I’m glad that it is brought up, whoever did it. The other was I actually halfway decided against trespassing on info sec politics terrain. My reason was I wanted to keep my blog about technical stuff. Eventually there was a bit of mission creep with the “Nostalgia posts” and I think I’ll just let it creep some more. I blog for fun, why should I care about mission creep? Finally came a tweet from Marion Marschalek that read: “Infosecs first problem in policy work is its heterogeneous community that doesnt get its plenty opinions together. Free after @halvarflake“.  All this made me think that even though these things have been said before and almost certainly better that I can, maybe if enough people join the choir somebody will listen.

Reverse engineering and the security of software

It is often that reverse engineering software is disallowed in license agreements. Sometimes it’s even argued by software manufacturers that reverse engineering software should be generally outlawed to protect their intellectual property. I shall argue here that reverse engineering for information security is in fact important for all information security and software quality in general using classical arguments from the Industrial Organization branch of economics. This blog post is not meant to be rigorous, but to introduce some useful economic concepts.  The release of this blog post was motivated by discussion on reverse engineering and the relevance of the Akerlof(1970) article for this discussion. Though I agree that this article is indeed relevant, I shall argue here that the while Akerlof’s problem has simple solutions those solutions do not work well for the software security problem and that reverse engineering does provide a solution.

You would obviously be right if you immediately though that economics doesn’t say anything about security. But it does say something about quality and a product being secure would be considered a quality by most customers and by a society as a whole it certainly should. Thus when I talk about software quality I’m talking about how secure it is in this text. There are other aspects of software quality which I’ll ignore here even though the same argument extends themselves to other measures of quality. I’ll use the quality lingo to stay in tune with classic economics.

Akerlof(1970) made an economic model for the market of used cars. His model has two kinds of used cars: cherries and lemons. Consumers are assumed not to be able to identify if a car was in great shape (a cherry)  or in bad shape (a lemon – after your facial expression once you find out you bought one). However the car dealer / previous owner would know. Say the value of a lemon is $600 and the value of a cherry $2000. With every other car being a lemon the average value would be $1300 and a rational; risk neutral consumer would thus be willing to pay this for a car. Obviously no dealer would sell a cherry for this price and the market would break down.

Akerlofs example illustrates what economics calls imperfect information. The parallel with software security is that normal people are perfectly incapable of detecting if software is easily exploitable or not, in the same way people like me couldn’t tell a cherry from a lemon. But this is where the parallels start to break down. The solution is easy in the used car market because a dealer warranty goes a long way to realign incentives – nobody wants to give a warranty on a lemon. Vendors could also start selling only cherries to build reputation. No such simple solution exists for Software.

Economist sometimes distinguishes between 3 kinds of products in relation to quality and imperfect information. The first is the “search good”, I don’t know which laptop I’d like to buy, but if I research it long enough I’ll figure it out. The second kind of good is the “experience good”. You find out the quality of the product after you buy it. Drive a car for a couple of years and you know if it is a lemon. The last kind of good is the “credence good”. Chances are that you’ll never know if your tooth paste contains the right amount of fluoride for your teeth.

I argue that software is a “credence good”. That the software consumer is unlikely to ever know if software was of good quality and thus secure. The thing with software exploits is that they often leave little or no trace, so even if you do get hacked you’d often not know how it happened – even though you may very well figure out that you were hacked. This gives software vendors a perverse incentive not to invest in software security – consumers cannot identify it in any way and it’s likely to be costly, thus making software vendors investing in security less competitive. We would predict only a partial market break down for software – everybody knows that there is no cherry software because there is no reason to produce it. Because of the “credence good” nature of software – that is the fact that consumers never know the quality – there is no easy way out of the problem through warranties or liability. Nor can vendors build reputation by only selling cherries. In this oversimplified model we’ve institutionalized software insecurity.

We need to address the problem of software insecurity at it’s root: The imperfect information. Restaurants are reviewed, government agencies and scientists spends lots of time on figuring out how much fluoride should be in tooth paste and sometimes even how much fluoride is in a given tooth paste. Cars under go technical tests, etc. Why should this be different for software? Even if only a few customers are informed it’ll increase incentives to make quality products to the benefit of all. By being more demanding they drive up quality on the products. Or in economic terms: “The informed customer exert a positive externality on the uninformed ones” – Tirole(1994).

The litmus test for the security of a software product is how easy it is to find insecurities in it and the way this is done is by reverse engineering parts of the software looking for security relevant errors. Even when the results of a reverse engineers work , is also not understood by the broad public, the presence of informed customers improves quality for everybody. Sharing the information only helps the process.

We should not make any assumptions that hackers with ulterior motives will abide by license terms or even laws. And while open sharing information on inadequate quality of software may give rise to opportunities for hackers with less noble motives, in the long run institutionalized insecurity will prove worse that occasional glitches. 

Now the model I proposed here is obviously over simplified. Sometimes hacks can be attributed to bugs in specific software. Mikko Hypponen(2011) argues that a major security focus by Microsoft was a direct result of Code Red and it’s clones causing havoc to Windows Systems. To me that is evidence of the mechanism actually working. Microsoft gathered that consumers would identify their lack of quality. Also people tend to be not to be of the Homo Economics species, but rather homo sapiens with much more diverse behavior. I do however think that my model is an important part of reality.  Fortunately some tech companies see it similarly and companies like Microsoft, Google and F-Secure either pay for security bugs found or keep a hall of fame of people having reported security bugs found by reverse engineering.

Reverse engineering for security purposes should be embraced by anybody who favors information security.

The not so depressing effects of bug bounties

Resonantly Jacob Torrey wrote a blog on the “Depressing Effect of Bug Bounties” Torrey(2015). Mr. Torrey argues that the work of reverse engineers to find bugs in response to bug bounties will give companies incentives to lower their own Q&A efforts. He argues: “By artificially deflating the cost of finding and fixing bugs in operation/shipped product through monopolistic means, bug bounties remove the economic incentive to develop better software by integrating security-aware architects into the SDLC. Bug bounties use their monopoly on setting prices“. I’ll grant Mr. Torrey that such effects are bound to happen, but I still come out in favor of bug bounties.

My first quarrel with Mr. Torrey’s argument is that bug bounties are not that monopolistic. Anybody wanting to claim price money from bug bounties are likely to pick the bug bounty (and thus product to reverse) to where she is most likely to achieve the most money for the least effort. Indeed other alternatives such as selling bugs to companies that trade bugs for weaponization exists. There is little argument that the market for “bug bounty hunters” has an inelastic supply allowing software companies to dictate market conditions.  What I mean is there is little reason to think that “bug bounty hunters” cannot pick and choose as they wish between different bounties.

My second contention is that bug bounties isn’t new. Microsoft has had a “hall of fame” for security researchers reporting bugs since 2007. Microsoft(xxxx): Now this isn’t a monetary reward – but people in the infosec community does collect this kind of recognition to find a well paying employer. Thus it’s economically equivalent of a bug bounty –it’s value for service.

However bug bounty prices at the moment are ridiculous. F-Secure offering a mere EUR 100 – EUR 15000 for a bug report and they alone decided what the actual pay day is going to look like. F-Secure(2015). 100 EUR is about 2 hours worth of brick laying in Germany, you can’t get a decent engineer to write a report for that amount of money, let alone research anything. It’s certainly is a big differences to the prices acquired on the weaponization market, but at least the white hat side of things offers up competition and that’s a good thing. I believe I read somewhere that $45.000 was paid by hacking team for a flash exploit. I think these prices reflect that bug bounties are not yet taken serious by some companies rather than monopolistic competition.

Finally bug bounties carry the possibility for companies to use them as a signal of quality. Offering a million dollars for a bug says a lot about a company’s confidence in their software’s security. It goes a long way to solve the imperfect information problem. It is in many ways equivalent to the warranty solution in the market for cars. That reverse engineering is allowed is always a prerequisite for bug bounties to work though.

The fact is bug bounties incentivize bug hunting in a white hat manner and the depressing effect on in-house q&a mr. Torrey argues for is likely to be small – especially if bug bounties becomes wide spread and a fair evaluation system can be applied. Further most bug bounties only suppress information on bugs for a period of time thus allowing for information on product quality to arrive at the market. Is EUR 15.000 is a strong signal of confidence in F-Secure’s security?  I doubt it.

Despite my “disagreement” with Mr. Torrey’s article, I find it refreshing to see such blogs about infosec. It’s good work and it deserves a read .

The pond argument

It is sometimes argued that hunting for exploitable bugs is like catching frogs in a pond. If there is many frogs in the pond, removing one will have a negligible effect on the frog population and thus the effort required to catch another. If there are only 3 frogs in the pond, you’ll eradicate a third of the population making it much harder for other (more dubious) people to catch a frog. It’s a wonderful argument and I find it has lots of bearing on exploits. I think there is lots of exploits in the pond and thus finding a 0-day doesn’t contribute significantly to security in general.

However the pond argument ignores the imperfect information argument of reverse engineering above. Having bugs found too often implies insecure software. Flash and Font bugs has gained that reputation. In fact the many security bugs in flash have become a real problem in the market place for Adobe. The number of bugs that have been reported on flash is often used as an argument to turn it off and replace it with something else. That is an argument that I fully endorse.

However my guesstimate remains that software authors produce more security relevant bugs, than security researchers can find. Making errors is human and programming is for now a very human task. I shall not deny that I’m responsible for my share of bugs. The attack surface (pond) is expanding faster than security researchers can keep up. The infosec community would be wise to look for methods for avoiding bugs in the first place (google “langsec” to see what I think is the most promising path) or at least solutions that offers significantly hurdles to utilizing bugs for security breaches.

The real problem

While bug bounties, in my view certainly, is a step forward, the core problem of the insecurity state of software development is liability. Companies that gets hacked does not carry the cost of it. Sony spend $15 millions on “investigation and remediation costs“ according to their Q3 2014 financial statement.  In essence Sony gave everybody who’s data was dropped a credit monitoring subscription and that’s it. The trouble for those whose private data actually get abused is that they are never reimbursed in any meaningful way. More damaging information leaks such as those that happened with the Ashley Madison hack, significantly damage those involved, but there is no compensation and thus only a fraction of the societal cost is billed to the hacked company. This is called moral hazard in economics terms, because the decision on how much to spend on security is divorced from who carries the cost in the event of a hack. The problem of moral hazard in infosec is two fold: First, companies has little incentive to invest in security and this little incentive is bound to be passed down the food chain to software developers. Second, companies has every incentive to store any and all information making hacking a much more prosperous venture. I based this portion loosely on a wonderful article in “The conversation”. You can find it here:
http://theconversation.com/why-companies-have-little-incentive-to-invest-in-cybersecurity-37570


Tirole(1994):  The Theory of Instrial Organzination.
Akerlof(1970): “The market for Lemons: Quality uncertainty and the market mechanism”; Quarterly Journal of economics 84
Hypponen, Mikko(xxxx): “The history and evolution of Computer Viruses”. DefCon 19. https://www.youtube.com/watch?v=Xr0ESMH1hwY
Torrey, Jacob(2015): “Depressing Effect of Bug Bounties”; http://blog.jacobtorrey.com/the-depressing-effect-of-bug-bounties
Microsoft(xxxx): “Security Researcher Acknowledgments” https://technet.microsoft.com/en-us/security/cc308575

F-Secure(2015): “VULNERABILITY REWARD PROGRAM” : https://www.f-secure.com/en/web/labs_global/vulnerability-reward-program

Thursday, November 5, 2015

x86 assembler coding nostalgia

Foreword

This blog was actually finished a long time ago. I've not posted it before because I'm anything but happy with how it turned out. I hope you enjoy it anyways. In case you actually read this I added an errata to the cache side channel post. A friend of mine in the infosec community insists on calling me old and he is right too. You are old when you write nostalgia blogs.


Introduction

The idea for this post came shortly after I’d posted the last nostalgia post.  That post remains one of the most popular on my blog.  The trigger was a question by MalwareTech on twitter. I don’t recall the exact quote, but it was along these lines: “Is there a way to figure out if an instruction is privileged from it’s encoding?”. My first thought was that it’s a silly question. But thinking about it made me reconsider. The question has lots of merit and the answer is silly. Any reasonable designer would have grouped privileged instructions together, yet x86 doesn’t. The reason is that x86 assembler wasn’t designed, it evolved. It’s layers and layers of added designs and I’ve had the mixed pleasure to follow lots of this evolution first hand. This post however is not the story of x86 processors – other people could tell that story better than I could, I’m sure. This post is about me and x86 processors. I’ll try to focus on the tech stuff, because I assume you’re only interested in me to the extend, that my story reflect yours. I hope you enjoy this trip down memory lane.

Setting the stage

Somewhere in time  Intel made a new processor called 8086. This CPU would've been long forgotten by now if not IBM had made an open computer design around it. As a consequence companies like Nixdorf of Germany, Olivetti of Italy, Commodore (and much later Apple) and many others produced their own computers based on this design and the 8086 and it's successors became the de facto business PC and remains so today. It had an instruction set inherited from older Intel processors like the 8085, to make porting software easier. And a set of 16 bit of registers that by now sound awfully familiar: AX, BX, CX, DX, SP, BP, DI, SI, CS,DS,ES and SS. The first 4 of which each could  be accessed as two 8-bit registers as well.

In the early 1980'ies - I don't know exactly when my father bought his first x86'er PC. It was a 4.77 MHZ 8088 IBM portable (if you can call anything weighing 14 kilos portable and not to mention it could not run unplugged). It was soon sold again, because the display did not live up to the standard our old Commodore already had. Consequently it was replaced with a Commodore PC 10 with a NEC V20 processor. It was instruction compatible with the 8088, but running 8MHZ and if I recall correctly a better memory bus which made it a pretty solid computer for the day. At first I didn't care much for the monster. First you had to boot the OS from a floppy disc (rather than from a ROM), you had to enter the time on each boot too as no battery was on board to keep the clock running, the games sucked and I had nobody to swap games with. It took a while for me to get interested in that computer. My father being a geek relatively quickly acquired a 10Mb hard drive and a multifunction card which cured the having to enter the time problem and boot from floppy disks. However what made the difference to me was a word processor. Once I got used to it, I found that the easy with which I could do my homework would save me time. Having a spell checker in the mid 80ies and search function to approximate commas was brought down time spend on my Danish homework significantly for the same grades (Me being a freak I actually timed it). Another cool thing I just have to mention is that my best friend bought a used IBM PC and today it defies belief but it actually came with a diagram of all the electronics on the mother board so you could repair the beast with a soldering iron should something break.

Hacking did exist in these days, but not for kids like me. Mischief obviously existed and that usually meant borrowing the library's computer (an Olivetti) that booted into a menu from which you could selected a handful of innocuous programs and require a password to get back into DOS to do maintenance. A dos prompt was root, since the early processors in the family 8086, 8088, V20, 80186, 80286 etc. came completely without security functions. These processors had no memory protection, any memory was R/W/X and the entire address space was accessible from anywhere. The in/out instructions always worked, the interrupt table was placed at a fixed location in memory etc. etc.  Consequently if you got into dos you owned the computer. As a direct consequence of the processors inability (insufficient memory, too slow, etc.) to do meaningful multitasking, many programs, such as the dominant word processor Word Perfect, allowed you to open a dos prompt in foreground to format floppy disks, look for files etc. and then close the dos prompt to continue working on your document.  I used that countless times to get that black hat feeling of power, back when I was 12 or younger. Don't recall actually doing anything with my power though. But I wanted to crack games, have a cool nick name in intros and for that I had to learn assembly.

My first 0x86 assembly program

My first assembly program would’ve looked much like this:
.MODEL small
.STACK 100h
.DATA
HelloMessage DB ‘Hello, world’,13,10,’$’
.CODE
.startup
mov ax,@data
mov ds,ax
mov ah,9
mov dx,OFFSET HelloMessage
int 21h
mov ah,4ch
int 21h
END

When linked you’d have a small hello world executable. This code obviously only has remote similarity to any assembler you’d write for windows or other modern operating system today. Back then it was just 0x86 assembler. With the arrival of the 80386 processor it became “real mode” as opposed to “protected mode”. I’ll get back to this later. The two things that seems particularly weird in the above assembler is writing to DS (which we today call a descriptor register, but back then it was a segment register – there was nothing to “describe” and thus it’s only say which segment of memory to read from) and the use of interrupts.
To understand why segment registers had to be written, the first thing we need to understand is that the early x86 processors came without paging and where 16 bit. Thus without changing the segment register we could only access 64kb – not very much even by the standard of the time. Thus the changing of the segment registers. The “Physical” address could be calculated by segment (register << 4) | address. 0000:0211 would be the same memory as 0001:0201. The upside to this scheme was that loading an executable or allocating memory you’d only loose a max of 15 bytes to alignment. This seems like a reasonable compromise considering the address space was 20 bits: 1 Mb – of which only 640kb was to the users disposal, the last 384kb belonged to the bios (well sort of – eventually things like himem.sys partially recaptured memory from this region). I like to think that history repeats itself and the scheme is a bit like Amd64 which is a 64bit addressing in a 48bit address space and the reserved 384kb can in wierd way be thought of a early version of SMM memory.
Interrupts in those days where not only hardware and exceptions service as it is today. It was also the API. Though many interrupts where used (0x19=reboot,0x20=exit process) interrupt 0x21 and 0x13 where the once you had to know. 0x21 was the dos services – which in todays terms are what is exported from kernel32.dll in windows. The calling convention was all register based (funny how history repeats itself. After more than a decade with stack based parameters, it’s typical to use much more registers for call styles in x64 – actually until WinXP int 0x2e was used internal to forward calls from user mode to kernel mode and was then replaced with the new instruction syscall, but programmers rarely see this nowadays). I never saw any official documentation, I doubt most people did – remember without internet to download stuff and no Intel or Microsoft websites documentation was tricky. Instead unofficial documentation especially in the form of interrupt list was floating around. The most important was Ralph Brown’s interrupt list. For interrupt 0x21 the ah register selected the function. 0x9 was Writeln() (that is write a $ terminated string to the console and 0x4c was ExitProcess(). The interrupt 0x13 was very important to system programmers, because this was where the bios exported it’s functionality. If you wanted to read sector based from disk for example, the easiest way was to use this interrupt which was supported by the bios. All boot loaders at the time where full of int 0x13 and copy protections too…
Another funny thing about early x86 was that lot’s of stuff was mapped to fixed addresses in memory. At 0000:0000 was the start of the interrupt table, which was just an array of far pointers. An interesting side effect of this was, that any program could globally hook all operating system API: An offer that virus authors would not turn down. Even weirder was address 0000:b800. It was the console. Mov ax, 0; mov ds,ax; mov [b800], ‘A’ would print an “A” in the top left corner of the console. The graphics adaptor would read it directly from here. Consequently you could read what was in the console from this address too. Graphics was rare, so the console mode was where computers where actually used in the early x86 days. EGA graphics had a fixed address too – I think it was 0000:e800 – but I really do forget these things.

Writing my first demo

A demo was at the time the “proof” of coding skill in 1994. Doing nice motions graphics required pushing the processor pretty far and writing a good demo in a high level language usually didn’t turn out well. I wrote my first and last demo in real mode assembler. It was a sine scroller – which essentially means that a text scrolls across the screen in a sine curve. To realize this, the first thing I had to do was to define a font. Other than the standard ascii, the operating system or hardware didn’t offer anything. This was done in a graphics program and then the output was converted into a table for each letter by hand. With only 256 colors it was a byte table. This was a lot of work, even when I drew the font only for the letters I needed. The second thing I needed was a sine table. Today you’d just calculate one as you go. Not so back then. The first problem was that early x86’s did not come with floating point instructions. To get floating point instructions you’d actually have to buy a so called co-processor (typically called 8087, 80387 or so) which you could plug into your motherboard. The second reason was that a table lookup was far more efficient. I generated my sine table in pascal that had a library for calculating sine with integer instructions and then added it to my demo. Drawing the actual graphics on screen was challenge of it own. The screen refresh was pretty slow and you’d see artifacts from it if you didn’t synchronize with it. Thus updating your screen was a loop of waiting for the refresh to finish and then a complex memory move operating to the graphics card memory (a fixed address as seen above). Waiting was achieved by using the “in” instruction directly (No clue what port it was) until a certain bit was deleted (or so I recall it). My demo sucked and I distinctively remember that despite being able to code these kinds of things, I had a huge disconnect in what I thought it would look like and what it looked like. I knew I was never going to be an artist and had already lost my heart to systems programming and computer virus stuff.


80386


80386 in many ways was the game changer for x86. It was introduced in 1985 according to Wikipedia. It was also the first computer I bought – it was 1991 or 92. Imagine a processor architecture staying pretty much unchanged for 7 years – you hardly see that today. The irony is that despite of this back then a computer lasted 2 or 3 years before it was antiquated, by today standard no time at all. The 386 was probably the biggest revolution in the entire history of the 0x86 platform and for infosec people almost certainly.  The most profound change was the move to a 32 bit architecture from the previous 16 bit. Not only was the address space increased from 20 bit to 32 bit, but the registers and operand sizing was upgraded too, but also a paging system was added. This allowed 32 bit assembly language code to look like it does today. Further more the security architecture with ring 0 – ring 3 which some of us have a love/hate relationship with today was introduced here. New debugging features were added in particular the hardware memory break point.  The CRx, DRx, GDT, LDT and IDT registers was introduced with the 80836. The real irony was that this processor was so far ahead of the software that it wasn’t until Windows 95 a full 10 years later operating system that made full use of the opportunities of the 386 processor became wide spread.  What certainly was a revolution in hardware, became slow evolution in software.  Dos extenders that swapped from real to protected mode became standard for games (especially DOS4GW from Watcom, but Pharlab also had a market share). Short of an operating system it basically just provided a flat memory layout  to C++ programs. . There was a DR dos which had true multitasking but didn’t break the 640kb boundary and thus was pretty much unusable. We saw horrible Windows 2.0 and 3.0 with terrible, terrible 16 bit protected mode execution with their NE format which can best be described as a crime against good style.  


Virus and infosec in the early 1990ies

I had been infected with an interest in computer viruses in the late very early 90ies reading about the exploits of Dark Avenger and his anti-virus nemesis: Vesselin Bontchev. Both were heroes of mine. Thus as soon as I’d written my first program in assembler I started wanting to write anti-virus software – some how writing virus code always seemed immoral to me. There of cause isn’t much anti-virus code you can write when you’ve only written “Hello World” before that. That did not scare me and I quickly wrote up a number of goat programs. For those unfamiliar with goat programs, they are programs meant as scapegoats for virus to infect. Since computer virus at the time tried to avoid detection and crashes by carefully selecting which files it would infect, having a large collection of known executables on your computer would increase the attack surface for any virus and of cause allow for detection too. On the assembly side, it was pretty much programs with different junk instructions to fill up the executables – nothing fancy. With time I moved on to writing encryptors and matching decrypters for executables. That was pretty much the same code as you’d use in a virus but without the moral implications. The first target was the .com file format of DOS. This was the easiest executable format imaginable. It was simple the binary code and data. It was loaded into a free segment at xxxx:0100, where xxxx denotes the free segment the operating system loaded the.com at and execution would start from this address too. The memory below 100h was used for command line and other start up parameters. This made .com a very easy format to encrypt. Basically I made a jmp instruction instead of the first two bytes to the end of the file, where a decrypter was appended. However with a file format this simple removing a virus or an encryptor was trivial and thus anti-debugging had to be made. Hooking interrupts 1 and 3 could throw off most debuggers. Running checksums on your self would detect if somebody was stepping/gotoing using interrupt 3 inserted into the code and using the pushf instruction to read the trace flag was popular.  Finding back-door interfaces for the most common debuggers was a common enterprise for me and writing up detection code for them using this new found information. On the other hand patching backdoor interfaces in your own debugger allowing you to decrypt the competition was another essential trick. A real game changer was a debugger called TRW. It emulated each instruction instead of executing them and was thus independent of all these games we were playing. As a response I started checking instructions for emulation errors – lot’s of work, but real fun. In many ways a sign of what was to come in this area. After I had my fun with the errors, I reported them back to the author, who by the way was a really nice guy.

With the arrival of Windows new territory was everywhere and I did lot’s of work on the windows internals in the beginning particularly the PE executable format. Despite all things being new and all – the first (public) generic type PE decryptor (ProcDump32 from 1998) used the same principle as had been used to unpack .com files generically in the DOS days. It set the trace flag and single stepped everything until the original entry point and then dumped the memory. The anti-anti-debugging was inspired heavily from TRW: A driver would emulate instructions that could be used to endanger the safe tracing such as pushfd/rdtsc instead of actually executing the instructions. ProcDump32 could dump running programs too –except those wouldn’t actually run afterwards because the data wasn’t correctly restored. It was however enough to do reverse engineering on the code in many cases.

Towards the end of the first half of my career as an infosec hobbyist, I was introduced to what had been more or less the birth of modern hacking: The stack overflow hack. My interest at the time was Windows internals, but in IRC EFFnet you could not avoid being confronted with “0-dayz”. Being used to assembler coding I quickly understood the mechanics of the attack. It was occasionally useful for privilege elevation attacks on Windows NT.  For windows 98 internals freaks like myself had no use for it – with the front door left open, why look for a window to break? I didn't consider priviledge elevation  to be hacking at the time. Besides finding buffer overflows were too easy – in the very early days of it looking for strcpy() routines. It was quickly followed by attacks on sprintf type formatting stings. Usually if you could find one of those where the input string came from “user” input, you would have a 50% chance of having a 0 day in your hand. You could report bugs like you wanted to and it wouldn’t change a thing. My experience was that if you reported a privilege elevation bug to MS engineers you’d be ignored. If you reported a BSOD that had to be provoked to appear, you’d be thanked. If you reported a BSOD that could happen by accident it would be fixed. It would take forever though till it was distributed though. It was all reliability, no body cared about security in between 1996 and late 2000 where I put my infosec carrier to rest.  

My last act before retiring as a malware hobbyist, was writing a blog post on the favorite malware of the press in the day – Back Orifice. I’d figured out how to detect Back Orifice in specific and most other remote control malwares of the day generically. Nobody cared. I tried for a while to get a job with anti-virus companies, but short of a nice email from Mikko Hyponen, nothing ever came of it. So I got a job which quickly killed my interest in spending time behind a computer in my spare time– but my life with x86 was not yet over .

Optimization

Assembler has always played a major role in optimization. As I started out reading about optimization it was all what we see in compilers these days.  E.g. Use shl for multiplication by 2 or 4. The first kind of optimization I actually did was pre-computed tables.  At the time it didn’t take much calculation for pre-computations to be meaningful. Part of this was because that the processors where much slower relative to RAM that they are these days. Also accessing hardware directly instead of through bios/OS channels worked wonders. Say the demo above would not have worked without directly accessing the graphics memory.  In the mid/late 90ies a friend of mine was tasked with optimizing an interrupt routine for an industrial barcode scanner which would be the heart of a large sorting machine for logistics. I joined the effort for the fun of it.  First we spend a lot of time translating from pascal to pure assembler and then more time was spend replacing multiplications with combinations of “shl”,”lea” and “add” instructions. Then we spend hours on avoiding push and pop and keeping everything we needed in registers and eventually got it fast enough for the requirements.

In 2000 I was working with optimizing an MPEG2 decoder. By now compilers had made the above optimizations internal and it was rarely worth the effort to see if you could do it better. The optimization game had moved yet again. The name of the game was now MMX – I realize that SSE3 was available at this time – but not in the low end computers and MMX optimization would’ve do the trick at both ends of the scale.

By 2002 it was an MPEG2 transrater I was optimizing on. The idea was copying a DVD9 to a DVD5 without re-encoding. I had a beta version of a Pentium 4 and Intel really wanted me to work on using hyper threading. But threading is difficult and it brought less than optimizing memory access and again real life market perspective: Nobody but a few select people had a hyper threading computer and there was little chance that they’d be wide spread within a year. Especially using the non-temporal instructions of SSE32/3 and aligned memory access made a huge impact. Another thing that made a big impact on my optimization work was that performance counters, had become really useful. 


By 2006 I was back to optimizing video – this time h.264 codec for massive amounts of video. This time around optimizations now moved to threading – with 8 cores in a computer and the prospect of using multiple computers, splitting tasks in a meaningful way gave much more performance gain than the 50% you could get from hand optimizing big spenders. To be fair I did write up SSE3 routines for the biggest spenders. But assembly language optimization had finally taken the back seat.