This post was originally part of the “Machine Learning with malware Part 1: Data” post. But to avoid confusing I’ve made it a separate post. The reason is that it’s only marginally about machine learning and I do not wish to confuse the exposition. Unfortunately you’d probably need to read at least the last section of the previous blog post for this blog to make sense. Also if you wish to play along with me, you definitely need the data from that post.
In the late eighties viruses began to pop up in the wild. My first contact with computer virus (which I had until then known only from magazines) was when I got my hands on a version of I believe it was IBM anti virus. It had a signature data base which was a plain text file with I think some twenty entries. It looked like hex strings like those you’d see in the hex editor of choice from that time Norton Disk Edit. And by poking around (not really knowing what I was doing ) I managed to make IBM anti virus identify Norton Disk Edit as a virus by modifying a string to match what I’d extracted with NDE about itself. In those days virus was really rare there where only a few and they were all carefully examined by malware researchers. To detect these just plain strings where picked from the virus and if a file (or boot sector) contained the string it probably was a virus. Mutating code where not yet an issue. With an SQL data base full of malware we can easily make it search for likely malware strings. For this reason one of the features, were the bytes of the entrypoint of each file in my previous blog on data for malware machine learning. Since I don’t expect each file to be a unique malware, but rather multiple manifestations of a smaller number of malwares I can now just take a look at how many entry point strings that are common. Using an SQLite browser (download from SQLite.org) I can enter this command in the “Execute SQL” section after loading the database:
select EPAsText, count(EPAsText) as StrCount from MALWARE group by EPAsText order by StrCount
The most common entry point strings could be compiler generated ones. I couldn’t tell without comparing either comparing with non-malware binaries or looking at an actual disassembly. So for now I’ll just talk about those that look suspicious. The first suspicious entry is FF 25 00 02 04 00 00 00 00… This I can disassemble in my head. It would be Jmp  which is a strange thing to do at an entrypoint. However you quickly discover that this is simply .Net files. The next immediately suspicious one is this: EB 01 68 60 e8 00 00 00 00 ….Of the top of my hat I recognize Jmp $+3 and call $+5. Jmp $+3 is just out right weird, unless it’s purpose is obfuscation which makes perfect sense in badly behaved code. Call $+5 is not weird at all. It’s a classic way for a virus or other piece of code of pushing EIP to the stack to find out where the loader actually put it in memory. It’s quite common in viruses and I used it myself in my PE-executable packers and I’ve never seen it in a well behaved program. In short I just found a signature string of either a virus or perhaps a packer. To minimize the chances of it being a packer we can do this:
select * from malware where EPASText like 'eb016860e8000%'
And whoopty they all have an entropy between 6 and 8 which is normal for unpacked files. Obviously there are PE-Encrypters that use silly encryption that keep entropy that high. My first PE Crypter is among them – see my Unpacking with emulator blog for more on that. So there is good evidence that we just found a string signature for a virus. Looking at the number of imports I’d say we just found a signature that matches two variants of the same virus. If you want to play more with strings try looking for the call $+5:
select count(EPAsText) as MyCount, EPAstext from malware where EPASText like
'%e800000000%' group by EPAsText order by MyCount
My guess is that simple query finds signatures for about 5% of the sample, but I don’t care to check it.
We could easily automate what I did above but there are better ways. Obviously if we had a database of non-malware executables we could use the simple Bayes machine learning on these entry point strings (see my Machine Learning and Bayes blog – if you look in the source codes for that blog you’ll find that I actually had planned to do exactly this without a database and not just entry point strings, but dropped the idea because of the time I’d been spending). I have no idea how big a role strings plays in modern anti-virus business. My immediate guess would be that it’s the method of choice for everything simple which remains the bulk of malware in executables. The last 30 years however made the method more sophisticated but this core method probably remains exactly what I outlined here.
Szor, Peter: The Art of Computer VirusResearch and Defense