This post
was originally part of the “Machine Learning with malware Part 1: Data” post.
But to avoid confusing I’ve made it a separate post. The reason is that it’s
only marginally about machine learning and I do not wish to confuse the
exposition. Unfortunately you’d probably need to read at least the last section
of the previous blog post for this blog to make sense. Also if you wish to play
along with me, you definitely need the data from that post.
In the late
eighties viruses began to pop up in the wild. My first contact with computer
virus (which I had until then known only from magazines) was when I got my
hands on a version of I believe it was IBM anti virus. It had a signature data
base which was a plain text file with I think some twenty entries. It looked
like hex strings like those you’d see in the hex editor of choice from that
time Norton Disk Edit. And by poking around (not really knowing what I was
doing ) I managed to make IBM anti virus identify Norton Disk Edit as a virus
by modifying a string to match what I’d extracted with NDE about itself. In
those days virus was really rare there where only a few and they were all
carefully examined by malware researchers. To detect these just plain strings
where picked from the virus and if a file (or boot sector) contained the string
it probably was a virus. Mutating code where not yet an issue. With an SQL data
base full of malware we can easily make it search for likely malware strings.
For this reason one of the features, were the bytes of the entrypoint of each
file in my previous blog on data for malware machine learning. Since I don’t
expect each file to be a unique malware, but rather multiple manifestations of
a smaller number of malwares I can now just take a look at how many entry point
strings that are common. Using an SQLite browser (download from SQLite.org) I
can enter this command in the “Execute SQL” section after loading the database:
select EPAsText, count(EPAsText) as StrCount from MALWARE group by
EPAsText order by StrCount
The most
common entry point strings could be compiler generated ones. I couldn’t tell
without comparing either comparing with non-malware binaries or looking at an
actual disassembly. So for now I’ll just talk about those that look suspicious.
The first suspicious entry is FF 25 00 02 04 00 00 00 00… This I can
disassemble in my head. It would be Jmp [402000] which is a strange thing to do
at an entrypoint. However you quickly discover that this is simply .Net files.
The next immediately suspicious one is this: EB 01 68 60 e8 00 00 00 00 ….Of
the top of my hat I recognize Jmp $+3 and call $+5. Jmp $+3 is just out right
weird, unless it’s purpose is obfuscation which makes perfect sense in badly
behaved code. Call $+5 is not weird at all. It’s a classic way for a virus or
other piece of code of pushing EIP to the stack to find out where the loader
actually put it in memory. It’s quite common in viruses and I used it myself in
my PE-executable packers and I’ve never seen it in a well behaved program. In
short I just found a signature string of either a virus or perhaps a packer. To
minimize the chances of it being a packer we can do this:
select *
from malware where EPASText like 'eb016860e8000%'
And whoopty
they all have an entropy between 6 and 8 which is normal for unpacked files.
Obviously there are PE-Encrypters that use silly encryption that keep entropy
that high. My first PE Crypter is among them – see my Unpacking with emulator
blog for more on that. So there is good evidence that we just found a string
signature for a virus. Looking at the number of imports I’d say we just found a
signature that matches two variants of the same virus. If you want to play more
with strings try looking for the call $+5:
select
count(EPAsText) as MyCount, EPAstext from malware where EPASText like
'%e800000000%' group by EPAsText order by MyCount
My guess is that simple query finds signatures
for about 5% of the sample, but I don’t care to check it.
We could
easily automate what I did above but there are better ways. Obviously if we had
a database of non-malware executables we could use the simple Bayes machine
learning on these entry point strings (see my Machine Learning and Bayes blog –
if you look in the source codes for that blog you’ll find that I actually had
planned to do exactly this without a database and not just entry point strings,
but dropped the idea because of the time I’d been spending). I have no idea how
big a role strings plays in modern anti-virus business. My immediate guess
would be that it’s the method of choice for everything simple which remains the
bulk of malware in executables. The last
30 years however made the method more sophisticated but this core method
probably remains exactly what I outlined here.
Literature:
Szor,
Peter: The Art of Computer VirusResearch and Defense
In the 1980s, the concept of antivirus software was born as a response to the emergence of computer viruses. Godaddy Coupon During this era, antivirus programs primarily focused on detecting and removing viruses.
ReplyDelete