Unrelated background information:
I wished I'd gotten a lot further a lot sooner with this. Also this blog post is only barely up to the standard I set for myself. Compressing so much statistics and machine learning into a blog post of reasonable length while avoiding excessive math is a terribly difficult thing to do. I hope that it lifts the vail a bit around what machine learning is and how it can be leveraged in a malware setting, but if you really wish to get down with it there can be no two opinions about you need a real text book with all the rigor that it requires.
Finally it took a long time because most of my family conspired to celebrate round birthdays. Also climbing season started which keeps me away from the computer. Finally I have this feeling that the "Predator" from Machine Learning Part 1. blog produced insufficient amount of features. So spend a significant amount of time embedding my code emulator (from unpacking with primitive emulation) into Predator at first thinking it would be easy. Emulating on 80000 files malformed files, where a significant amount is missing DLL's etc. turned out to be way more time consuming than I imagined.
The blog post is here:
Machine learning with malware part 2: Model Selection