Update 19-6-2015
I didn't explicitly write it in the original blog. But I should've. The methods below would be useful for detecting Shadow walker type root kits too.
Original post starts here:
I finally got
a new hard disk and I'm now spending most of my malware hobby time trying to extract
features for machine learning from 70000 PE executables and put them into a
database. This unfortunately takes a while. But stay with me I'll get back to
ML soon enough.
Download frequency tables here
Introduction
While I'm
working on other stuff here is a short note on something I found extraordinary
interesting: The H.A.R.E.S. white paper. by Jacob Torrey. The white paper is
pretty short, and the slides from his presentation on SyScan (which I did not
attend) doesn't bring much more light to the thing. So forgive me if I'm in accurate. Also I'm writing
this on the side while running 70.000 malware executables through my feature
grabbing tool and monitoring that everything is ok. So no code this time, just
words.
What is H.A.R.E.S and why
do I like it and fear it at once
To
summarize really shortly:
Mr. Torrey
wants to build a DRM style utility in which the protection is that the code
segment can be executed but not read or written. To implement this he uses that
the memory manager (MMU) in the modern x86 computers keeps separate caches for page
tables (TLB's) for data and execution. He then manipulates the execution cache
to point to real data, and the data cache to point in "nirvana". This
trick is that makes PaX on linux work or
the Shadow walker root kit. To avoid the attacker from reading the code in the
original file he proposes to encrypt with AES-128 for which the modern intel processors
has instructions for. To protect the key and the page tables he installs
everything in a "thin" virtual machine so that everything the host os
(That includes the attacker) does is directly with hardware except for
decryption and protection of the TLB's. Read his white paper for details - I find it quite good.
I find this
interesting because in many ways it's a step towards a new computer
architecture. x86 was originally build on the von Neumann architecture. For
this discussion the important part of the von Neumann architecture is that
there is no distinction between code and data.
A competing architecture the Harvard architecture distinguish between
them. A great many of the bad behavior that is possible in x86 architecture is possible
in the von Neumann architecture but not in the Harvard (stack execution,
trampoline hooks, self modifying code, many code injection techniques ...). Mr.
Torrey's idea effectively imposes this Harvard feature on a von Neumann
platform and that is a very interesting thing - at least from a theoretical
perspective. The Gruq however noted that though at first it sounds like a
blessing for InfoSec, it's really is a curse in disguise. Imagine no way to
audit code for maliciousness, no signature, no code heuristics only behavior
analysis to identify if something is malicious or not. Mr. Torrey suggest some
weaknesses in his methods that might be useable. I'll get back to that in a
moment.
I doubt
that Mr. Torrey's thing will find much real world application, but it might be
an important stepping stone towards an architectural change that is much
needed. That alone makes it worth peaking into and not to mention it seemed
like a good semi-technical topic for a fun blog post. I have no time to do code
on this and since I doubt this will have real world implication I find it ok to
stick with speculation.
A detour
The first
problem that Mr. Torrey has is that code in real executables are often
interspersed with data. Here a string, there a jump table etc. If you've
disassembled enough you've seen it. Since not even the code it self can read
this data we have a problem. I'd suggest that a solution to this problem is
identify as much code as you can in the code section using standard methods
(Performance counters based coverage analysis and static coverage analysis). You
make a copy of the entire code section, destroys (overwrite with garbage) what
you could positively identify as code and uses the data TLB to project it
correctly for data instructions. You sacrifice protection for easy of
implementation and that always seems to me to be a worthwhile sacrifice.
Attacking H.A.R.E.S
Now back to
the weakness's Mr. Torrey suggested. I think the real potential is attack the
implementation of the thin VM it self. This at least currently cannot be
protected it self and thus reproducing the VM exactly, but leaking the key
seems to me to be the most promising way to go about playing attacker. He also
suggests hardware attacks (reading physical memory directly) but shipping
hardware with an antivirus software is uneconomical to say the least. This
leaves me to his third suggested attack vector: Side channel attacks which will
be my topic for the rest of this post.
The side
channel attacks I'll talk about is tracing the code and seing what happens and
using performance counters to gain additional information about the code you
cannot read. You could probably also use cache timing for an attack, but I'll
not get into that because I don't know enough about modern caches yet and of
the three it seems to me to be the least productive. Forgive me again - since
my 15 years de facto infosec pause, I've been reading up only 3 months.
The problem of tracing around observing registers and memory to find out what
the code looks likes seems daunting at first. With N byte code segment we have 256
to the power of N possible programs. Even if we say we just need to get close,
that looks like a terrible large number. Also with the sheer amount of
instructions on x86 platforms it seem near impossible to get them all correctly
deduced from just looking at their behavior. Think clflush instruction. It
get's worse. Imagine the code segment being full of data that essentially are
instructions that crashes some hardware in your computer if run in ring0 and
causes exceptions if not. Since you cannot really execute the instruction, you
cannot use the trace method and the performance counters are unlikely to be of
much use to you. Unfortunately such a scenario can easily be imagined - imagine
a long string of outb instructions, that'll crash some attached hardware if
executed. Game over?
Not quite.
First of the problems mentioned above are all real, but they probably won't
shut us down totally. We can probably
squeeze most of the data out of the unreadable code segment. Also we should
start up be noticing that not all opcodes are equally likely. Some are used
more than others. Zombie did a survey a couple of years back and just one
opcode made up for 15% of all opcodes in his sample. Only 54 opcodes made up
95% of the opcodes used in his sample, the majority of these 54 opcodes indeed
seems to be analyzable. Looking just at frequencies with which bytes occur in
calc.exe from windows XP I found almost 10% to be 0's. Zero is "add al,
reg8, r8/m8". About half of those zeroes where followed by another zero
making up the instruction "add al, [eax]". You can find my tables as
an excel file. This hints that the problem of squeezing enough information out
of the code segment might not be quite impossible.
Mining
The amount
of instructions on the x86 platforms is ironically going to help us. Since the instructions
are so crammed many opcodes can contains other smaller opcodes. For example mov
eax, 90h would 0xb8, 0x90. The 0x90 being a "nop". We can use this to
leverage our search.
Let me define
some terminology before I start. With "mining" I mean finding out the
opcode and/or parameters of said without any prior knowledge of it. With
"Leveraging" I mean using known bytes in the code segment to figure
out more bytes by using opcodes in embedded in previously mined opcodes.
Obviously
since we know nothing we need to mine as a first step. My first step mining
would go for the easiest stuff first. So I'm essentially going to setup up a
context where I can easily identify if somebody is using my registry values
e.g. set EAX to 0xa1a2a3a4, ebx to 0xb1b2b3b4 etc. Including the memory on the
stack and then set a thread to execute exactly one instruction for each bytes
in the loaded target executable. I do not care where real opcodes meant to be
executed start. I'm looking for anything that I can identify. Once we trace
we'll either see an except (other than the expected int 1) or we'll get a new
context after execution of the instruction. I'll return to execptions later. When
talking about a register in the input context I'll use that register with a .0
before it similarly for the output context a .1 will be appended. Thus eip.1 is
the output eip. I can for most instructions calculate a length of the
instruction by simply doing eip.1-eip.0. This works for all but branching
instructions. I'll refer to this imperfect measure as the opcode size (OpcSize)
nevertheless. Below is a table of instructions that is particularly easy to
mine (and how to mine them)
Opcode
|
Identification
criteria
|
Comment
|
0x40-0x4f Inc/dec
reg
|
OpcSize ==1 &&
reg.0==reg.1 +-1
|
|
0x50-0x57 Push reg
|
OpcSize==1
&& esp.0==esp.1+4&& @esp.0==reg
|
|
0x58-0x5f pop reg
|
OpcSize==1
&& esp.0==esp.1-4 && reg.1 == @esp.0
|
|
0x68 /0x6aPush imm8/32
|
OpcSize==2/5
&& esp.0==esp.1+4
|
This gives us 1/4 parameters
bytes which we can read from the @esp
|
0xe8 call rel imm32
|
esp.0==esp.1+4, @esp.1==eip.0+5
|
We can only
partially mine this easily as sometimes it'll cause an exception and not give
us a new context
|
0xC2 retn imm16
|
Esp.0==esp.1-imm8
&& abs(eip.0-eip.1)>16
|
We can only
positively identify this instruction if the branch made was so"long"
it couldn't be an instruction and the imm8 value is bigger than 4 ( to
differentiate from 0xC3 retN).
|
Xchng reg, eax,
scasb/w/d/, stosb/w/d, pushfd,popfd,push es/cs/ds/ss, int, ...
|
|
Mining criteria left
as an excersise... sorry..
|
Say we
figured perfectly mined for 30 or so opcodes and identified 1000 bytes. The
complexty we are now trying to resolve is only 236 to the power of N-1000. Still
a ridiculous number, but notice that our progress is exponential. Finding just one
byte reduces the problem a factor of 256!!!
The above
mining was the simple mining. Things we could establish by tracing just once
per instruction. Tracing more than once can provide even more information. Say
cli/sti instructions. Tracing this instruction twice with different preset
I-flags will let you see a toggle of this flag and and OpcSize of 1. Thus we can identify these.
We can also mine conditional short jumps with more than one trace. Trace all
instructions with all branching-flags off and once on. If once the OpcSize is 2
and once not two we have a conditional short jump. There goes a significant
number of bytes again and since we can mine perfectly we can subtract another
14 for the number of possible choices for each opcode.
Leveraging
Once we've
mined these we can start leveraging. For this particularly the 1/4 bytes we got
from the call rel imm8 or the push imm8/32 or the conditional short jumps are
interesting. Imagine that the last byte of the address in the imm was 0x2. We'd
then know that the opcode is add al, imm8. Zero out al.0 execute and you'll
have the next byte in al.1 Imagine this being 0x3 (opcode add eax, imm32) we
can then do the trick again with eax. However not all opcodes can be leveraged
because they don't leak more information than the opcode themselves by mining.
So we cannot expect a cascade effect, but we will make progress. We can easily leverage
add/adc/sbb/xor/sub al/eax, imm8/32 as well as all short jumps opcodes and how
about mov reg, imm8/16? We could also look at two bytes leveraging say 0x8d 0x98
would be lea ebx, [eax+imm32] There are more easy leverage to be had.
Harnesing Exceptions
Up to this
point I've treated exceptions as something which prevents you for getting information
while mining. They can be rather interesting though. Not only does the type of
exception tell you something about the type of opcode, it also give gives you
information on whether it was a read or write that caused. Imagine leveraging
this 0x8b (mov reg32, m32). Now we wish to know the parameters of it. Simply
set as much of the memory in the process you're working with to guardpages.
Chances are tracing this will now throw an exception. Tracing a couple of times
with different register settings if the read address in the exception doesn't
change we have a simple IMM32 with indirection . We can also find out what
registers change the address and use this information to figure out more about
the RM and MOD bits of the ModRM. Which again will allow us to identify the SIB
byte (if present). Once this is done you could get lucky and be able to make
this instruction not throw an exception and that'll give you the REG bits of
ModRM leaving you with the complete instruction. Even if you don't know the
instruction you can use this technique to mine for anything that looks like a ModRM
byte and figure it and the additional parameters out.
Last words
I've put enough on the table in this direction now. There is plenty more to be
harness in terms of trace mining and leveraging. A hint though would be to use
frequency maps from old non protected executables to direct your mining
operatings towards where they are most likely to give you the highest success
rate for your effort. Mining for 0x00, 0x00 in calc.exe in windows XP sounds
like a great idea also for other executables. You can mine for imperfect
information that won't allow you to with certainty find a given byte, but rule
out almost everything making an educated guess easy later. You should also look
into info leaks from performance counters. Or hooking system api to get return
addresses will allow you to better mine for api calls. Moving yourself into
ring 0 might also bring you something. Finally we can make use of bindiff'ing
methods if we've managed to get enough information from the above measures we
can compare out incomplete functions with those of standard libraries. You can
find more about this on the Diaphora webpage included a link to a good article
by Thomas Dullien and Rolf Rolles.
Literature
Diaphoraweb
page:http://joxeankoret.com/blog/2015/03/13/diaphora-a-program-diffing-plugin-for-ida-pro/
Torrey,
Jacob. White paper: http://jacobtorrey.com/HARES-WP.pdf
Torrey,
Jacob. SyScan slites: http://jacobtorrey.com/HARESSyscanTalk.pdf
Zombie's
opcode statistics: http://z0mbie.daemonlab.org/opcodes.html