Wednesday, October 12, 2011

I have a Hidden Markov Model... Now What?

I have been working on creating hidden Markov models (HMMs) for computer viruses.  Now that I have them, I'm running into an interesting complication.  Namely, what can I do with them?

With an HMM, you can get the statistical probability for any particular series of observations.  For a very simple case, consider a loaded die.  50% of the time, it will roll a 6.  Otherwise, it will roll a number between 1 and 5 (10% chance each).  Once you have your model built up, you can determine the probability of a series of rolls.

So, continuing the example, pretend that you observe 10 sixes being rolled in succession.  What are the odds that this sequence would have been rolled with the loaded die?  (1/2) ^ 10, or 1 in 1024.  Given these observations, is it likely that you are using the loaded die?

Well...  it depends.  What are the other models?  The probability for the same sequence with a fair die would be 1 in 6,0466,176.  On the other hand, if you suspect that the die might be loaded so that it rolls sixes 90% of the time, the observations fit much better with that model.

My first exposure to HMMs was in linguistics.  I built up two language models for classified advertisements -- one for Spanish and one for English.  By comparing the probabilities of any random classified ad, I could guess fairly easily whether an ad were English or Spanish.  But if it happened to be in French or Vietnamese, my tool would have failed miserably.  (On a side note, one of my friends faced with a similar problem for news stories used a simpler solution -- he counted the number of 'the's and contrasted that to the number of 'el's and 'la's.  I never heard of a single bad identification with his approach.  It goes to show, the sophisticated solution might not always be what is needed).

This raises some interesting questions for me in the context of computer viruses.  HMMs seem to be a compelling option for virus detection, but what do they compare against?  You can imagine a series of models built for different virus families, but what if the file is not a virus?  It does not seem realistic to build a model for 'all benign programs'.  Neither does it seem realistic to build a model for each type of benign program.

There is likely a clean, well-known solution.  I just don't know it yet.

Otherwise, life in Laval has been fun.  My wife has arrived, and we've started to explore the surrounding town together.  The Lavalloise seem to be a little shy about their town.  Compared to Rennes or some of the other larger towns, perhaps Laval is a little sleepy.  But somehow it is very cool to sit and have a glass of wine at the foot of an 800 year old castle.  Coming from the western United States, where 100 years seems like a long time, the history of Laval is amazing.