Tuesday, April 21, 2015

Anti-virus 1980ies style

This post was originally part of the “Machine Learning with malware Part 1: Data” post. But to avoid confusing I’ve made it a separate post. The reason is that it’s only marginally about machine learning and I do not wish to confuse the exposition. Unfortunately you’d probably need to read at least the last section of the previous blog post for this blog to make sense. Also if you wish to play along with me, you definitely need the data from that post.

In the late eighties viruses began to pop up in the wild. My first contact with computer virus (which I had until then known only from magazines) was when I got my hands on a version of I believe it was IBM anti virus. It had a signature data base which was a plain text file with I think some twenty entries. It looked like hex strings like those you’d see in the hex editor of choice from that time Norton Disk Edit. And by poking around (not really knowing what I was doing ) I managed to make IBM anti virus identify Norton Disk Edit as a virus by modifying a string to match what I’d extracted with NDE about itself. In those days virus was really rare there where only a few and they were all carefully examined by malware researchers. To detect these just plain strings where picked from the virus and if a file (or boot sector) contained the string it probably was a virus. Mutating code where not yet an issue. With an SQL data base full of malware we can easily make it search for likely malware strings. For this reason one of the features, were the bytes of the entrypoint of each file in my previous blog on data for malware machine learning. Since I don’t expect each file to be a unique malware, but rather multiple manifestations of a smaller number of malwares I can now just take a look at how many entry point strings that are common. Using an SQLite browser (download from SQLite.org) I can enter this command in the “Execute SQL” section after loading the database:

select EPAsText, count(EPAsText) as StrCount from MALWARE group by EPAsText order by StrCount

The most common entry point strings could be compiler generated ones. I couldn’t tell without comparing either comparing with non-malware binaries or looking at an actual disassembly. So for now I’ll just talk about those that look suspicious. The first suspicious entry is FF 25 00 02 04 00 00 00 00… This I can disassemble in my head. It would be Jmp [402000] which is a strange thing to do at an entrypoint. However you quickly discover that this is simply .Net files. The next immediately suspicious one is this: EB 01 68 60 e8 00 00 00 00 ….Of the top of my hat I recognize Jmp $+3 and call $+5. Jmp $+3 is just out right weird, unless it’s purpose is obfuscation which makes perfect sense in badly behaved code. Call $+5 is not weird at all. It’s a classic way for a virus or other piece of code of pushing EIP to the stack to find out where the loader actually put it in memory. It’s quite common in viruses and I used it myself in my PE-executable packers and I’ve never seen it in a well behaved program. In short I just found a signature string of either a virus or perhaps a packer. To minimize the chances of it being a packer we can do this:

select * from malware where EPASText like 'eb016860e8000%'

And whoopty they all have an entropy between 6 and 8 which is normal for unpacked files. Obviously there are PE-Encrypters that use silly encryption that keep entropy that high. My first PE Crypter is among them – see my Unpacking with emulator blog for more on that. So there is good evidence that we just found a string signature for a virus. Looking at the number of imports I’d say we just found a signature that matches two variants of the same virus. If you want to play more with strings try looking for the call $+5:

select count(EPAsText) as MyCount, EPAstext from malware where EPASText like 
'%e800000000%' group by EPAsText order by MyCount

 My guess is that simple query finds signatures for about 5% of the sample, but I don’t care to check it.
We could easily automate what I did above but there are better ways. Obviously if we had a database of non-malware executables we could use the simple Bayes machine learning on these entry point strings (see my Machine Learning and Bayes blog – if you look in the source codes for that blog you’ll find that I actually had planned to do exactly this without a database and not just entry point strings, but dropped the idea because of the time I’d been spending). I have no idea how big a role strings plays in modern anti-virus business. My immediate guess would be that it’s the method of choice for everything simple which remains the bulk of malware in executables.  The last 30 years however made the method more sophisticated but this core method probably remains exactly what I outlined here.

Szor, Peter: The Art of Computer VirusResearch and Defense

Machine learning with Malware Part 1: Data

Download my database here
Download source code for predator here 

Without data there is no machine learning

Without data there is no machine learning. That is because ML is about finding "structure" in data, it really is that simple. This also means that you can never get better results from ML than your data allows. Often a bad model choice will provide decent results if the data is good, but with bad data no model will ever provide good results. Thus while this blog might be a little boring it’s subject is of significant importance.
I've used dummy data for my two previous ML post which focused more about what ML methods is and what they can do and less about the data. Now I'm faced with real data and this blog post is about how I'll treat my data before I even get to think about models and methods for finding structure. Ironically these steps are rarely even considered in a generalized fashion in the literature I've read. Usually it goes straight to the methods for finding structure. I suppose this is because what you actually need to do depend heavily on what kind of data we are talking about. This means of cause that this blog post will not be well founded in the scientific literature, but rather on my experience – but trust me  I’m good with data.  If you have no prior knowledge and haven't read the two first blog posts on machine learning with malware you might wish to read those first before proceeding.
In my mind I've always seen the process of doing econometrics or in this case ML of running over the stages below. Unlike a shopping list often you'll find yourself going back and improving previous steps as you become acquainted with the data. The steps in my mind are:
·         Data acquisition: Getting Raw data
·         Feature engineering: Getting from raw data to a data set ( observations)
·         Preprocessing: Getting from data set to input data
·         Model selection: Getting from input data to information
·         Model analysis: Analyze results and become aware of problems with your model

This blog post will consider only the first three steps. Also the distinction between the steps is in real life application often much less clear than my list above. Never the less I think of the distinction as useful when working with the data. When I’m all done I’ll have a data set that you in theory could use for machine learning on malware – arguably a very crude data set, but that’s the way it’ll always be given my time constraints.

Data acquisition

When getting raw data you need to think ahead. For questionnaire type of data acquisition it's obvious that you get different results if you ask about "Obamacare" as opposed to "Death panels". It's less obvious in a malware setting, but no less important. Say you collect your raw malware data through honey pots. It'll probably influence the type of malware you collect if your honey pots are placed on servers in "Advanced weapons technology" companies as opposed to on aunt Emma's home PC. The first are far more likely to see custom build and highly sophisticated attacks than the second, and much less likely to fall prey to rapid spreading email worms. If you're trying to deduct general things about malware from this data, you'll be biased in either scenario because the data already reflect different realities. This bias is called selection bias. We’ll visit selection bias again below. Thus beware of your data, ML is data driven and it’s very much a garbage in garbage out discipline.. In my case I didn't have honey pots, but a password to a collection where I could download some malware. Thus I ponder that the chance of a malware making it into the collection is much higher for aunt Emma style malware than for sophisticated attacks.

Feature engineering

In all ML, except “deep learning”, finding the structure of the data requires us to specify the features of the data. You don't just stuff an executable into a machine and out comes if it's malicious or not. In my first blog post I stuffed into Bayes the section of the entrypoint and out came an estimate if the file was packed or not. The "Section of the entry point" is usually called a feature of the data and getting from raw data to features is what feature engineering is about. Sometimes features are obvious, unfortunately that is not so in malware research. Here we'll have to think hard about what we think is different between malware and non malware (at least if that's what we are examining with ML). The good thing is that we don't need to care about the actual relationship between the features, we just need to extract the features we think are relevant. This is because figuring out relationship in data is the job of the model or algorithm. Extracting too much is better than too little. ML has ways of dropping what is too much in data driven ways instead of the ad hoc assumption style. When we start working with the data we might come to think of a feature that wasn't part of the raw data we got and that might send us back to acquire more data. As a student I got access to a large database of socio-economic indicators over time for 1% of the Danish work force. I started analyzing unemployment and found out that I had to go back and find data on the business cycle because business cycles are fairly important in the labour market. Once you start working with the data things often become more clear and you need to be open to rework your decisions again and again. As you may recognize from the experience I just mentioned, feature extraction is a job for people who understand the subject at hand. So this is a good reason why being malware knowledgeable cannot be replaced by machine learning, however leveraging machine learning can be extremely powerful.



Usually there are significant advantages in preprocessing data before entering it into a model. When trying to use machine learning to read hand written postal addresses classic image processing - that as such has nothing to do with machine learning - is often used. For example often CCD noise are removed, the picture of each letter is made the same size (in pixels) and if there is a slight tilt in the hand writing the pictures might be rotated etc. Also some ML methods have requirements of the input in question. If you wish to do say principle component analysis on the data you at least should consider normalizing the mean to 0 and the variance to 1 of the features, where say an “ordinary least squares model” is much more forgiving. The distinction between preprocessing and feature extraction is not always clear. In the same way reading out features from malware it often makes good sense to preprocess. For example a thing we'd wish to look for when trying to determine if an executable is malware or not could be the protection characteristics of the sections in the executable. In the Portable Executable format (PE) that's just a simple 8 bit long bit field for each section. First off no simple model will make much use of such a bit field as input, it is grammar, not semantic. But the "Writeable" bit might be very interesting. Is the feature the characteristics bit-field or is it the separate bits? But further we might consider combinations of bits to be features say executable and writable might seem like an unsecure combination where as observing either bit in an executable would be considered perfectly normal. The thing is that we might take these things into account when extracting the features, but often times we'll have to go back and work the data again before we can proceed with a model. My recommendation is do obvious preprocessing early on as it'll save work. Anything preprocessing that is less than obvious make a mental note of and return once you get to know your data and a picture starts to emerge where you wish to go with the data. For instance don’t normalize your data until you know if it is an advantage in the model you wish to use – and even then analyzing the raw data before hand might yield insights even if the end goal might be a PCA. I once estimated a  fairly advanced model where I where trying to predict business cycles. In my first attempt I used something close to raw quarterly data for my variables. After hours of waiting for my model to converge I had a model that was able to perfectly predict at what time of year Christmas sales picked up. I had done my preprocessing poorly and my model picked up Christmas sales instead of business cycles. Had I preprocessed my data to correct for seasonal effects I would’ve saved myself a lot of trouble.

So much for theory

Data acquisition for me was as stated just plain downloading virus_share_0000.zip (And obtaining access - thanks to those who made this possible). This archive contains around 130.000 malware files in some 20 gigabytes. 130.000 is a big number for ad hoc management. To further complicate things the files names are only a hash with no mention of original extension, which means there is no simple identification to identify java script,c source codes, DLL’s, EXE etc. Manual analysis is out of the question because of the amount of data. Enter “Predator”. Predator is the stupid code name I had in my head and as I didn’t figure out a better name for a tool it’s now called Predator. Predator is a simple tool that’ll extract features from all files in a directory and organize them. A simple text file or a spread sheet seemed not quite up to the task for organization. A SQL database seemed the right way to go. Enter SQLite. I was up and running with SQLite library in less than 20 minutes including creating tables, adding observations and sending queries. Unfortunately the SQLite interface is just bare bones to put it politely. Initially I thought I'd just install POCO library using biicode and use poco's abstraction for SQLite. Then I thought hey I can write something up in a couple of hours and in the process learn something about SQLite and that might prove beneficial down the road. So that's how that code came to be.

The second thing I needed to figure out features to extract to get to a data set. And in step with keeping things simple I went for features of the PE format. My logic where:
A) there are going to be plenty of PE files as opposed to analyzing PDF exploits
B) I'm knowledgeable in the PE format and pe features are relatively easy to extract
C) They are very relevant in the real world malware.
Obviously I cannot extract PE features from a java script or PDF-based exploit. So I needed to throw away anything without a PE-header. As a general rule when working with feature engineering is that throwing data away is bad unless you have strong reasons to do so and if you do have these strong reasons you need to articulate them and consider the effects. This is because throwing away information will lead to selection bias - just imagine throwing out smokers in a lung cancer study. And don’t forget you’ll not be the first who have data problems, so the literature is full of methods to deal with most issues in better ways than throwing information away. However I do have strong reasons - PE files are likely to be an entirely different kind of malware than java scripts. They have different kinds of features; they work in different ways to achieve mostly different things. In fact I’d go so far to say that the only thing all malwares have in common is that they are malicious. The price I pay is that any result I get from now on no longer is general about malware but general only to PE files. We might find it reasonable to cut down further at some point (How about analyzing viruses separate from adware?). But since at this point I have no a priori (data based) reason to think they cannot be analyzed with the same features and same methodology so in line with the reasoning above I keep everything for now. As I wrote in the beginning of this blog post, ML is more of an iterative process than working a shopping list.

The original idea was to write up a small PE parsing library based on my previous works. I did not have much luck with that. The PE files in the achieve where off specs, broken and shattered. So after a few hours of trying to harden up my own code I realized there was no way I could get both sufficient flexibility and stability on a time schedule like mine (Remember I try to stay under 24 hours total time for a blog post and I’d already spend time on SQL and my own PE stuff). Thus I decided to go for a finished library, despite not being keen on using too many libraries especially in core functions, because my blog should not descend into a “using tools blogs”, but rather be about how stuff works. I considered going with Katja Hahn’s PortEx java library but dropped the idea because I’m not too strong in java and didn’t want to interface with my c++ code. Especially since I’m considering for another blog post to add features from the code utilizing my emulator (see Unpacking using emulation blog). Instead I went with PeLib by Sebastian Porst. I think I actually met mr. Porst once which ironically became the real reason for going with it. I know sloppy, but hey…  It’s a very nice library, (Unfortunate it seems to be orphaned as there has been no changes since 2005 or so) but never the less it too was not quite up to the task of working with these mangled files. In fact it asserted on the very first file I parsed with it. So instead of just linking it, I took it into my own project and started fixing bugs as I went. Thus you’ll find Mr. Porsts code in my project. Now keep in mind that I was not trying to do good fixes for the bugs, I was trying to get the parsing done. Most bugs in the code where integer overflows. I did my best to keep every PE file in the sample for reasons I’ve discussed above. I might have excluded a few files, but I am certain that those files could not possibly be executed on any NT platform. So I think the selection bias from this point of view should be minimal.

The features I’ve selected is summarized in this table and I’ve sometimes tried to explain my motivation behind choosing the feature:

Maybe I’ll wish to look at 32bit and 64 bit exes sperately at one point
Many sections can be an indicator of evil
Linker version is often used as “infected” marker by viruses




This might be used as indicator for being a DLL
The entropy of the first 1000 bytes after the entrypoint might be an indicator for polymorphic encryption of a virus.
See below
If doesn’t compare to filename probably damanged
The entropy of the entire file will indicated if file is encrypted or packaged as opposed to infected. It of cause might be both, but that is another question
This is normal in most executables. It’s not quite as common for packaged and infected files.
An entrypoint in the PE header is definitely an indicator that something evil is going on. These however will execute!
A virus might forget to set the entrypoint as executable. A Linker won’t. Exe will run on 32bit
Writeable code section is definitely a sign of evil
Exports are an indicator of DLL
An error occurred during export parsing. Unlikely for a real linker.


.Net might be an entirely different kind of analysis,
Same as export errror

I ended up with 86447 PE files in my database. They came as 84925 32bit native code, 1503 DotNet executables and 19 of other architectures. Seems like my malware collection is a bit dated as 19 is a tiny number considering that that number contains 64 bit PE’s. In selecting my features my recollection of Peter Szor’s excellent book was useful.
SQLite. http://www.sqlite.org
Porst, Sebastian: PeLib. http://www.pelib.om
Hahn, Katja – PortEx. https://github.com/katjahahn/PortEx

Szor, Peter: The Art of Computer VirusResearch and Defense

Thursday, April 2, 2015

Thoughts on H.A.R.E.S, harvard, mining and leverageing

Update 19-6-2015

I didn't explicitly write it in the original blog. But I should've. The methods below would be useful for detecting Shadow walker type root kits too.

Original post starts here:

I finally got a new hard disk and I'm now spending most of my malware hobby time trying to extract features for machine learning from 70000 PE executables and put them into a database. This unfortunately takes a while. But stay with me I'll get back to ML soon enough.

Download frequency tables here


While I'm working on other stuff here is a short note on something I found extraordinary interesting: The H.A.R.E.S. white paper. by Jacob Torrey. The white paper is pretty short, and the slides from his presentation on SyScan (which I did not attend) doesn't bring much more light to the thing.  So forgive me if I'm in accurate. Also I'm writing this on the side while running 70.000 malware executables through my feature grabbing tool and monitoring that everything is ok. So no code this time, just words.

What is H.A.R.E.S and why do I like it and fear it at once

To summarize really shortly:
Mr. Torrey wants to build a DRM style utility in which the protection is that the code segment can be executed but not read or written. To implement this he uses that the memory manager (MMU) in the modern x86 computers keeps separate caches for page tables (TLB's) for data and execution. He then manipulates the execution cache to point to real data, and the data cache to point in "nirvana". This trick is that makes PaX  on linux work or the Shadow walker root kit. To avoid the attacker from reading the code in the original file he proposes to encrypt with AES-128 for which the modern intel processors has instructions for. To protect the key and the page tables he installs everything in a "thin" virtual machine so that everything the host os (That includes the attacker) does is directly with hardware except for decryption and protection of the TLB's. Read his white paper for details  - I find it quite good.

I find this interesting because in many ways it's a step towards a new computer architecture. x86 was originally build on the von Neumann architecture. For this discussion the important part of the von Neumann architecture is that there is no distinction between code and data.  A competing architecture the Harvard architecture distinguish between them. A great many of the bad behavior that is possible in x86 architecture is possible in the von Neumann architecture but not in the Harvard (stack execution, trampoline hooks, self modifying code, many code injection techniques ...). Mr. Torrey's idea effectively imposes this Harvard feature on a von Neumann platform and that is a very interesting thing - at least from a theoretical perspective. The Gruq however noted that though at first it sounds like a blessing for InfoSec, it's really is a curse in disguise. Imagine no way to audit code for maliciousness, no signature, no code heuristics only behavior analysis to identify if something is malicious or not. Mr. Torrey suggest some weaknesses in his methods that might be useable. I'll get back to that in a moment.

I doubt that Mr. Torrey's thing will find much real world application, but it might be an important stepping stone towards an architectural change that is much needed. That alone makes it worth peaking into and not to mention it seemed like a good semi-technical topic for a fun blog post. I have no time to do code on this and since I doubt this will have real world implication I find it ok to stick with speculation.

A detour

The first problem that Mr. Torrey has is that code in real executables are often interspersed with data. Here a string, there a jump table etc. If you've disassembled enough you've seen it. Since not even the code it self can read this data we have a problem. I'd suggest that a solution to this problem is identify as much code as you can in the code section using standard methods (Performance counters based coverage analysis and static coverage analysis). You make a copy of the entire code section, destroys (overwrite with garbage) what you could positively identify as code and uses the data TLB to project it correctly for data instructions. You sacrifice protection for easy of implementation and that always seems to me to be a worthwhile sacrifice.

Attacking H.A.R.E.S

Now back to the weakness's Mr. Torrey suggested. I think the real potential is attack the implementation of the thin VM it self. This at least currently cannot be protected it self and thus reproducing the VM exactly, but leaking the key seems to me to be the most promising way to go about playing attacker. He also suggests hardware attacks (reading physical memory directly) but shipping hardware with an antivirus software is uneconomical to say the least. This leaves me to his third suggested attack vector: Side channel attacks which will be my topic for the rest of this post.
The side channel attacks I'll talk about is tracing the code and seing what happens and using performance counters to gain additional information about the code you cannot read. You could probably also use cache timing for an attack, but I'll not get into that because I don't know enough about modern caches yet and of the three it seems to me to be the least productive. Forgive me again - since my 15 years de facto infosec pause, I've been reading up only 3 months.

The problem of tracing around observing registers and memory to find out what the code looks likes seems daunting at first. With N byte code segment we have 256 to the power of N possible programs. Even if we say we just need to get close, that looks like a terrible large number. Also with the sheer amount of instructions on x86 platforms it seem near impossible to get them all correctly deduced from just looking at their behavior. Think clflush instruction. It get's worse. Imagine the code segment being full of data that essentially are instructions that crashes some hardware in your computer if run in ring0 and causes exceptions if not. Since you cannot really execute the instruction, you cannot use the trace method and the performance counters are unlikely to be of much use to you. Unfortunately such a scenario can easily be imagined - imagine a long string of outb instructions, that'll crash some attached hardware if executed. Game over?
Not quite. First of the problems mentioned above are all real, but they probably won't shut us down totally.  We can probably squeeze most of the data out of the unreadable code segment. Also we should start up be noticing that not all opcodes are equally likely. Some are used more than others. Zombie did a survey a couple of years back and just one opcode made up for 15% of all opcodes in his sample. Only 54 opcodes made up 95% of the opcodes used in his sample, the majority of these 54 opcodes indeed seems to be analyzable. Looking just at frequencies with which bytes occur in calc.exe from windows XP I found almost 10% to be 0's. Zero is "add al, reg8, r8/m8". About half of those zeroes where followed by another zero making up the instruction "add al, [eax]". You can find my tables as an excel file. This hints that the problem of squeezing enough information out of the code segment might not be quite impossible.

The amount of instructions on the x86 platforms is ironically going to help us. Since the instructions are so crammed many opcodes can contains other smaller opcodes. For example mov eax, 90h would 0xb8, 0x90. The 0x90 being a "nop". We can use this to leverage our search.
Let me define some terminology before I start. With "mining" I mean finding out the opcode and/or parameters of said without any prior knowledge of it. With "Leveraging" I mean using known bytes in the code segment to figure out more bytes by using opcodes in embedded in previously mined opcodes.
Obviously since we know nothing we need to mine as a first step. My first step mining would go for the easiest stuff first. So I'm essentially going to setup up a context where I can easily identify if somebody is using my registry values e.g. set EAX to 0xa1a2a3a4, ebx to 0xb1b2b3b4 etc. Including the memory on the stack and then set a thread to execute exactly one instruction for each bytes in the loaded target executable. I do not care where real opcodes meant to be executed start. I'm looking for anything that I can identify. Once we trace we'll either see an except (other than the expected int 1) or we'll get a new context after execution of the instruction. I'll return to execptions later. When talking about a register in the input context I'll use that register with a .0 before it similarly for the output context a .1 will be appended. Thus eip.1 is the output eip. I can for most instructions calculate a length of the instruction by simply doing eip.1-eip.0. This works for all but branching instructions. I'll refer to this imperfect measure as the opcode size (OpcSize) nevertheless. Below is a table of instructions that is particularly easy to mine (and how to mine them)
Identification criteria
0x40-0x4f Inc/dec reg
OpcSize ==1 && reg.0==reg.1 +-1

0x50-0x57 Push reg
OpcSize==1 && esp.0==esp.1+4&& @esp.0==reg

0x58-0x5f pop reg
OpcSize==1 && esp.0==esp.1-4 && reg.1 == @esp.0

0x68 /0x6aPush imm8/32
OpcSize==2/5 && esp.0==esp.1+4
This gives us 1/4 parameters bytes which we can read from the @esp
0xe8 call rel imm32
esp.0==esp.1+4, @esp.1==eip.0+5
We can only partially mine this easily as sometimes it'll cause an exception and not give us a new context
0xC2 retn imm16
Esp.0==esp.1-imm8 && abs(eip.0-eip.1)>16
We can only positively identify this instruction if the branch made was so"long" it couldn't be an instruction and the imm8 value is bigger than 4 ( to differentiate from 0xC3 retN).
Xchng reg, eax, scasb/w/d/, stosb/w/d, pushfd,popfd,push es/cs/ds/ss, int, ...

Mining criteria left as an excersise... sorry..

Say we figured perfectly mined for 30 or so opcodes and identified 1000 bytes. The complexty we are now trying to resolve is only 236 to the power of N-1000. Still a ridiculous number, but notice that our progress is exponential. Finding just one byte reduces the problem a factor of 256!!!
The above mining was the simple mining. Things we could establish by tracing just once per instruction. Tracing more than once can provide even more information. Say cli/sti instructions. Tracing this instruction twice with different preset I-flags will let you see a toggle of this flag and  and OpcSize of 1. Thus we can identify these. We can also mine conditional short jumps with more than one trace. Trace all instructions with all branching-flags off and once on. If once the OpcSize is 2 and once not two we have a conditional short jump. There goes a significant number of bytes again and since we can mine perfectly we can subtract another 14 for the number of possible choices for each opcode.

Once we've mined these we can start leveraging. For this particularly the 1/4 bytes we got from the call rel imm8 or the push imm8/32 or the conditional short jumps are interesting. Imagine that the last byte of the address in the imm was 0x2. We'd then know that the opcode is add al, imm8. Zero out al.0 execute and you'll have the next byte in al.1 Imagine this being 0x3 (opcode add eax, imm32) we can then do the trick again with eax. However not all opcodes can be leveraged because they don't leak more information than the opcode themselves by mining. So we cannot expect a cascade effect, but we will make progress. We can easily leverage add/adc/sbb/xor/sub al/eax, imm8/32 as well as all short jumps opcodes and how about mov reg, imm8/16? We could also look at two bytes leveraging say 0x8d 0x98 would be lea ebx, [eax+imm32] There are more easy leverage to be had.

Harnesing Exceptions
Up to this point I've treated exceptions as something which prevents you for getting information while mining. They can be rather interesting though. Not only does the type of exception tell you something about the type of opcode, it also give gives you information on whether it was a read or write that caused. Imagine leveraging this 0x8b (mov reg32, m32). Now we wish to know the parameters of it. Simply set as much of the memory in the process you're working with to guardpages. Chances are tracing this will now throw an exception. Tracing a couple of times with different register settings if the read address in the exception doesn't change we have a simple IMM32 with indirection . We can also find out what registers change the address and use this information to figure out more about the RM and MOD bits of the ModRM. Which again will allow us to identify the SIB byte (if present). Once this is done you could get lucky and be able to make this instruction not throw an exception and that'll give you the REG bits of ModRM leaving you with the complete instruction. Even if you don't know the instruction you can use this technique to mine for anything that looks like a ModRM byte and figure it and the additional parameters out.

Last words
I've put enough on the table in this direction now. There is plenty more to be harness in terms of trace mining and leveraging. A hint though would be to use frequency maps from old non protected executables to direct your mining operatings towards where they are most likely to give you the highest success rate for your effort. Mining for 0x00, 0x00 in calc.exe in windows XP sounds like a great idea also for other executables. You can mine for imperfect information that won't allow you to with certainty find a given byte, but rule out almost everything making an educated guess easy later. You should also look into info leaks from performance counters. Or hooking system api to get return addresses will allow you to better mine for api calls. Moving yourself into ring 0 might also bring you something. Finally we can make use of bindiff'ing methods if we've managed to get enough information from the above measures we can compare out incomplete functions with those of standard libraries. You can find more about this on the Diaphora webpage included a link to a good article by Thomas Dullien and Rolf Rolles.

Diaphoraweb page:http://joxeankoret.com/blog/2015/03/13/diaphora-a-program-diffing-plugin-for-ida-pro/
Torrey, Jacob. White paper: http://jacobtorrey.com/HARES-WP.pdf
Torrey, Jacob. SyScan slites: http://jacobtorrey.com/HARESSyscanTalk.pdf

Zombie's opcode statistics: http://z0mbie.daemonlab.org/opcodes.html