What Stone dream about: x86 assembler coding nostalgia

Foreword

This blog was actually finished a long time ago. I've not posted it before because I'm anything but happy with how it turned out. I hope you enjoy it anyways. In case you actually read this I added an errata to the cache side channel post. A friend of mine in the infosec community insists on calling me old and he is right too. You are old when you write nostalgia blogs.

Introduction

The idea for this post came shortly after I’d posted the last nostalgia post. That post remains one of the most popular on my blog. The trigger was a question by MalwareTech on twitter. I don’t recall the exact quote, but it was along these lines: “Is there a way to figure out if an instruction is privileged from it’s encoding?”. My first thought was that it’s a silly question. But thinking about it made me reconsider. The question has lots of merit and the answer is silly. Any reasonable designer would have grouped privileged instructions together, yet x86 doesn’t. The reason is that x86 assembler wasn’t designed, it evolved. It’s layers and layers of added designs and I’ve had the mixed pleasure to follow lots of this evolution first hand. This post however is not the story of x86 processors – other people could tell that story better than I could, I’m sure. This post is about me and x86 processors. I’ll try to focus on the tech stuff, because I assume you’re only interested in me to the extend, that my story reflect yours. I hope you enjoy this trip down memory lane.

Setting the stage

Somewhere in time Intel made a new processor called 8086. This CPU would've been long forgotten by now if not IBM had made an open computer design around it. As a consequence companies like Nixdorf of Germany, Olivetti of Italy, Commodore (and much later Apple) and many others produced their own computers based on this design and the 8086 and it's successors became the de facto business PC and remains so today. It had an instruction set inherited from older Intel processors like the 8085, to make porting software easier. And a set of 16 bit of registers that by now sound awfully familiar: AX, BX, CX, DX, SP, BP, DI, SI, CS,DS,ES and SS. The first 4 of which each could be accessed as two 8-bit registers as well.

In the early 1980'ies - I don't know exactly when my father bought his first x86'er PC. It was a 4.77 MHZ 8088 IBM portable (if you can call anything weighing 14 kilos portable and not to mention it could not run unplugged). It was soon sold again, because the display did not live up to the standard our old Commodore already had. Consequently it was replaced with a Commodore PC 10 with a NEC V20 processor. It was instruction compatible with the 8088, but running 8MHZ and if I recall correctly a better memory bus which made it a pretty solid computer for the day. At first I didn't care much for the monster. First you had to boot the OS from a floppy disc (rather than from a ROM), you had to enter the time on each boot too as no battery was on board to keep the clock running, the games sucked and I had nobody to swap games with. It took a while for me to get interested in that computer. My father being a geek relatively quickly acquired a 10Mb hard drive and a multifunction card which cured the having to enter the time problem and boot from floppy disks. However what made the difference to me was a word processor. Once I got used to it, I found that the easy with which I could do my homework would save me time. Having a spell checker in the mid 80ies and search function to approximate commas was brought down time spend on my Danish homework significantly for the same grades (Me being a freak I actually timed it). Another cool thing I just have to mention is that my best friend bought a used IBM PC and today it defies belief but it actually came with a diagram of all the electronics on the mother board so you could repair the beast with a soldering iron should something break.

Hacking did exist in these days, but not for kids like me. Mischief obviously existed and that usually meant borrowing the library's computer (an Olivetti) that booted into a menu from which you could selected a handful of innocuous programs and require a password to get back into DOS to do maintenance. A dos prompt was root, since the early processors in the family 8086, 8088, V20, 80186, 80286 etc. came completely without security functions. These processors had no memory protection, any memory was R/W/X and the entire address space was accessible from anywhere. The in/out instructions always worked, the interrupt table was placed at a fixed location in memory etc. etc. Consequently if you got into dos you owned the computer. As a direct consequence of the processors inability (insufficient memory, too slow, etc.) to do meaningful multitasking, many programs, such as the dominant word processor Word Perfect, allowed you to open a dos prompt in foreground to format floppy disks, look for files etc. and then close the dos prompt to continue working on your document. I used that countless times to get that black hat feeling of power, back when I was 12 or younger. Don't recall actually doing anything with my power though. But I wanted to crack games, have a cool nick name in intros and for that I had to learn assembly.

My first 0x86 assembly program

My first assembly program would’ve looked much like this:

.MODEL small
.STACK 100h
.DATA
HelloMessage DB ‘Hello, world’,13,10,’$’
.CODE
.startup
mov ax,@data
mov ds,ax
mov ah,9
mov dx,OFFSET HelloMessage
int 21h
mov ah,4ch
int 21h
END

When linked you’d have a small hello world executable. This code obviously only has remote similarity to any assembler you’d write for windows or other modern operating system today. Back then it was just 0x86 assembler. With the arrival of the 80386 processor it became “real mode” as opposed to “protected mode”. I’ll get back to this later. The two things that seems particularly weird in the above assembler is writing to DS (which we today call a descriptor register, but back then it was a segment register – there was nothing to “describe” and thus it’s only say which segment of memory to read from) and the use of interrupts.

To understand why segment registers had to be written, the first thing we need to understand is that the early x86 processors came without paging and where 16 bit. Thus without changing the segment register we could only access 64kb – not very much even by the standard of the time. Thus the changing of the segment registers. The “Physical” address could be calculated by segment (register << 4) | address. 0000:0211 would be the same memory as 0001:0201. The upside to this scheme was that loading an executable or allocating memory you’d only loose a max of 15 bytes to alignment. This seems like a reasonable compromise considering the address space was 20 bits: 1 Mb – of which only 640kb was to the users disposal, the last 384kb belonged to the bios (well sort of – eventually things like himem.sys partially recaptured memory from this region). I like to think that history repeats itself and the scheme is a bit like Amd64 which is a 64bit addressing in a 48bit address space and the reserved 384kb can in wierd way be thought of a early version of SMM memory.

Interrupts in those days where not only hardware and exceptions service as it is today. It was also the API. Though many interrupts where used (0x19=reboot,0x20=exit process) interrupt 0x21 and 0x13 where the once you had to know. 0x21 was the dos services – which in todays terms are what is exported from kernel32.dll in windows. The calling convention was all register based (funny how history repeats itself. After more than a decade with stack based parameters, it’s typical to use much more registers for call styles in x64 – actually until WinXP int 0x2e was used internal to forward calls from user mode to kernel mode and was then replaced with the new instruction syscall, but programmers rarely see this nowadays). I never saw any official documentation, I doubt most people did – remember without internet to download stuff and no Intel or Microsoft websites documentation was tricky. Instead unofficial documentation especially in the form of interrupt list was floating around. The most important was Ralph Brown’s interrupt list. For interrupt 0x21 the ah register selected the function. 0x9 was Writeln() (that is write a $ terminated string to the console and 0x4c was ExitProcess(). The interrupt 0x13 was very important to system programmers, because this was where the bios exported it’s functionality. If you wanted to read sector based from disk for example, the easiest way was to use this interrupt which was supported by the bios. All boot loaders at the time where full of int 0x13 and copy protections too…

Another funny thing about early x86 was that lot’s of stuff was mapped to fixed addresses in memory. At 0000:0000 was the start of the interrupt table, which was just an array of far pointers. An interesting side effect of this was, that any program could globally hook all operating system API: An offer that virus authors would not turn down. Even weirder was address 0000:b800. It was the console. Mov ax, 0; mov ds,ax; mov [b800], ‘A’ would print an “A” in the top left corner of the console. The graphics adaptor would read it directly from here. Consequently you could read what was in the console from this address too. Graphics was rare, so the console mode was where computers where actually used in the early x86 days. EGA graphics had a fixed address too – I think it was 0000:e800 – but I really do forget these things.

Writing my first demo

A demo was at the time the “proof” of coding skill in 1994. Doing nice motions graphics required pushing the processor pretty far and writing a good demo in a high level language usually didn’t turn out well. I wrote my first and last demo in real mode assembler. It was a sine scroller – which essentially means that a text scrolls across the screen in a sine curve. To realize this, the first thing I had to do was to define a font. Other than the standard ascii, the operating system or hardware didn’t offer anything. This was done in a graphics program and then the output was converted into a table for each letter by hand. With only 256 colors it was a byte table. This was a lot of work, even when I drew the font only for the letters I needed. The second thing I needed was a sine table. Today you’d just calculate one as you go. Not so back then. The first problem was that early x86’s did not come with floating point instructions. To get floating point instructions you’d actually have to buy a so called co-processor (typically called 8087, 80387 or so) which you could plug into your motherboard. The second reason was that a table lookup was far more efficient. I generated my sine table in pascal that had a library for calculating sine with integer instructions and then added it to my demo. Drawing the actual graphics on screen was challenge of it own. The screen refresh was pretty slow and you’d see artifacts from it if you didn’t synchronize with it. Thus updating your screen was a loop of waiting for the refresh to finish and then a complex memory move operating to the graphics card memory (a fixed address as seen above). Waiting was achieved by using the “in” instruction directly (No clue what port it was) until a certain bit was deleted (or so I recall it). My demo sucked and I distinctively remember that despite being able to code these kinds of things, I had a huge disconnect in what I thought it would look like and what it looked like. I knew I was never going to be an artist and had already lost my heart to systems programming and computer virus stuff.

80386

80386 in many ways was the game changer for x86. It was introduced in 1985 according to Wikipedia. It was also the first computer I bought – it was 1991 or 92. Imagine a processor architecture staying pretty much unchanged for 7 years – you hardly see that today. The irony is that despite of this back then a computer lasted 2 or 3 years before it was antiquated, by today standard no time at all. The 386 was probably the biggest revolution in the entire history of the 0x86 platform and for infosec people almost certainly. The most profound change was the move to a 32 bit architecture from the previous 16 bit. Not only was the address space increased from 20 bit to 32 bit, but the registers and operand sizing was upgraded too, but also a paging system was added. This allowed 32 bit assembly language code to look like it does today. Further more the security architecture with ring 0 – ring 3 which some of us have a love/hate relationship with today was introduced here. New debugging features were added in particular the hardware memory break point. The CRx, DRx, GDT, LDT and IDT registers was introduced with the 80836. The real irony was that this processor was so far ahead of the software that it wasn’t until Windows 95 a full 10 years later operating system that made full use of the opportunities of the 386 processor became wide spread. What certainly was a revolution in hardware, became slow evolution in software. Dos extenders that swapped from real to protected mode became standard for games (especially DOS4GW from Watcom, but Pharlab also had a market share). Short of an operating system it basically just provided a flat memory layout to C++ programs. . There was a DR dos which had true multitasking but didn’t break the 640kb boundary and thus was pretty much unusable. We saw horrible Windows 2.0 and 3.0 with terrible, terrible 16 bit protected mode execution with their NE format which can best be described as a crime against good style.

Virus and infosec in the early 1990ies

I had been infected with an interest in computer viruses in the late very early 90ies reading about the exploits of Dark Avenger and his anti-virus nemesis: Vesselin Bontchev. Both were heroes of mine. Thus as soon as I’d written my first program in assembler I started wanting to write anti-virus software – some how writing virus code always seemed immoral to me. There of cause isn’t much anti-virus code you can write when you’ve only written “Hello World” before that. That did not scare me and I quickly wrote up a number of goat programs. For those unfamiliar with goat programs, they are programs meant as scapegoats for virus to infect. Since computer virus at the time tried to avoid detection and crashes by carefully selecting which files it would infect, having a large collection of known executables on your computer would increase the attack surface for any virus and of cause allow for detection too. On the assembly side, it was pretty much programs with different junk instructions to fill up the executables – nothing fancy. With time I moved on to writing encryptors and matching decrypters for executables. That was pretty much the same code as you’d use in a virus but without the moral implications. The first target was the .com file format of DOS. This was the easiest executable format imaginable. It was simple the binary code and data. It was loaded into a free segment at xxxx:0100, where xxxx denotes the free segment the operating system loaded the.com at and execution would start from this address too. The memory below 100h was used for command line and other start up parameters. This made .com a very easy format to encrypt. Basically I made a jmp instruction instead of the first two bytes to the end of the file, where a decrypter was appended. However with a file format this simple removing a virus or an encryptor was trivial and thus anti-debugging had to be made. Hooking interrupts 1 and 3 could throw off most debuggers. Running checksums on your self would detect if somebody was stepping/gotoing using interrupt 3 inserted into the code and using the pushf instruction to read the trace flag was popular. Finding back-door interfaces for the most common debuggers was a common enterprise for me and writing up detection code for them using this new found information. On the other hand patching backdoor interfaces in your own debugger allowing you to decrypt the competition was another essential trick. A real game changer was a debugger called TRW. It emulated each instruction instead of executing them and was thus independent of all these games we were playing. As a response I started checking instructions for emulation errors – lot’s of work, but real fun. In many ways a sign of what was to come in this area. After I had my fun with the errors, I reported them back to the author, who by the way was a really nice guy.

With the arrival of Windows new territory was everywhere and I did lot’s of work on the windows internals in the beginning particularly the PE executable format. Despite all things being new and all – the first (public) generic type PE decryptor (ProcDump32 from 1998) used the same principle as had been used to unpack .com files generically in the DOS days. It set the trace flag and single stepped everything until the original entry point and then dumped the memory. The anti-anti-debugging was inspired heavily from TRW: A driver would emulate instructions that could be used to endanger the safe tracing such as pushfd/rdtsc instead of actually executing the instructions. ProcDump32 could dump running programs too –except those wouldn’t actually run afterwards because the data wasn’t correctly restored. It was however enough to do reverse engineering on the code in many cases.

Towards the end of the first half of my career as an infosec hobbyist, I was introduced to what had been more or less the birth of modern hacking: The stack overflow hack. My interest at the time was Windows internals, but in IRC EFFnet you could not avoid being confronted with “0-dayz”. Being used to assembler coding I quickly understood the mechanics of the attack. It was occasionally useful for privilege elevation attacks on Windows NT. For windows 98 internals freaks like myself had no use for it – with the front door left open, why look for a window to break? I didn't consider priviledge elevation to be hacking at the time. Besides finding buffer overflows were too easy – in the very early days of it looking for strcpy() routines. It was quickly followed by attacks on sprintf type formatting stings. Usually if you could find one of those where the input string came from “user” input, you would have a 50% chance of having a 0 day in your hand. You could report bugs like you wanted to and it wouldn’t change a thing. My experience was that if you reported a privilege elevation bug to MS engineers you’d be ignored. If you reported a BSOD that had to be provoked to appear, you’d be thanked. If you reported a BSOD that could happen by accident it would be fixed. It would take forever though till it was distributed though. It was all reliability, no body cared about security in between 1996 and late 2000 where I put my infosec carrier to rest.

My last act before retiring as a malware hobbyist, was writing a blog post on the favorite malware of the press in the day – Back Orifice. I’d figured out how to detect Back Orifice in specific and most other remote control malwares of the day generically. Nobody cared. I tried for a while to get a job with anti-virus companies, but short of a nice email from Mikko Hyponen, nothing ever came of it. So I got a job which quickly killed my interest in spending time behind a computer in my spare time– but my life with x86 was not yet over .

Optimization

Assembler has always played a major role in optimization. As I started out reading about optimization it was all what we see in compilers these days. E.g. Use shl for multiplication by 2 or 4. The first kind of optimization I actually did was pre-computed tables. At the time it didn’t take much calculation for pre-computations to be meaningful. Part of this was because that the processors where much slower relative to RAM that they are these days. Also accessing hardware directly instead of through bios/OS channels worked wonders. Say the demo above would not have worked without directly accessing the graphics memory. In the mid/late 90ies a friend of mine was tasked with optimizing an interrupt routine for an industrial barcode scanner which would be the heart of a large sorting machine for logistics. I joined the effort for the fun of it. First we spend a lot of time translating from pascal to pure assembler and then more time was spend replacing multiplications with combinations of “shl”,”lea” and “add” instructions. Then we spend hours on avoiding push and pop and keeping everything we needed in registers and eventually got it fast enough for the requirements.

In 2000 I was working with optimizing an MPEG2 decoder. By now compilers had made the above optimizations internal and it was rarely worth the effort to see if you could do it better. The optimization game had moved yet again. The name of the game was now MMX – I realize that SSE3 was available at this time – but not in the low end computers and MMX optimization would’ve do the trick at both ends of the scale.

By 2002 it was an MPEG2 transrater I was optimizing on. The idea was copying a DVD9 to a DVD5 without re-encoding. I had a beta version of a Pentium 4 and Intel really wanted me to work on using hyper threading. But threading is difficult and it brought less than optimizing memory access and again real life market perspective: Nobody but a few select people had a hyper threading computer and there was little chance that they’d be wide spread within a year. Especially using the non-temporal instructions of SSE32/3 and aligned memory access made a huge impact. Another thing that made a big impact on my optimization work was that performance counters, had become really useful.

By 2006 I was back to optimizing video – this time h.264 codec for massive amounts of video. This time around optimizations now moved to threading – with 8 cores in a computer and the prospect of using multiple computers, splitting tasks in a meaningful way gave much more performance gain than the 50% you could get from hand optimizing big spenders. To be fair I did write up SSE3 routines for the biggest spenders. But assembly language optimization had finally taken the back seat.

What Stone dream about

Thursday, November 5, 2015

x86 assembler coding nostalgia