Wednesday, October 12, 2011

I have a Hidden Markov Model... Now What?

I have been working on creating hidden Markov models (HMMs) for computer viruses.  Now that I have them, I'm running into an interesting complication.  Namely, what can I do with them?

With an HMM, you can get the statistical probability for any particular series of observations.  For a very simple case, consider a loaded die.  50% of the time, it will roll a 6.  Otherwise, it will roll a number between 1 and 5 (10% chance each).  Once you have your model built up, you can determine the probability of a series of rolls.

So, continuing the example, pretend that you observe 10 sixes being rolled in succession.  What are the odds that this sequence would have been rolled with the loaded die?  (1/2) ^ 10, or 1 in 1024.  Given these observations, is it likely that you are using the loaded die?

Well...  it depends.  What are the other models?  The probability for the same sequence with a fair die would be 1 in 6,0466,176.  On the other hand, if you suspect that the die might be loaded so that it rolls sixes 90% of the time, the observations fit much better with that model.

My first exposure to HMMs was in linguistics.  I built up two language models for classified advertisements -- one for Spanish and one for English.  By comparing the probabilities of any random classified ad, I could guess fairly easily whether an ad were English or Spanish.  But if it happened to be in French or Vietnamese, my tool would have failed miserably.  (On a side note, one of my friends faced with a similar problem for news stories used a simpler solution -- he counted the number of 'the's and contrasted that to the number of 'el's and 'la's.  I never heard of a single bad identification with his approach.  It goes to show, the sophisticated solution might not always be what is needed).

This raises some interesting questions for me in the context of computer viruses.  HMMs seem to be a compelling option for virus detection, but what do they compare against?  You can imagine a series of models built for different virus families, but what if the file is not a virus?  It does not seem realistic to build a model for 'all benign programs'.  Neither does it seem realistic to build a model for each type of benign program.

There is likely a clean, well-known solution.  I just don't know it yet.

Otherwise, life in Laval has been fun.  My wife has arrived, and we've started to explore the surrounding town together.  The Lavalloise seem to be a little shy about their town.  Compared to Rennes or some of the other larger towns, perhaps Laval is a little sleepy.  But somehow it is very cool to sit and have a glass of wine at the foot of an 800 year old castle.  Coming from the western United States, where 100 years seems like a long time, the history of Laval is amazing.

Friday, September 30, 2011

The Virus War

In ESIEA, I am doing research on metamorphic viruses.  It is a new area for me, so I have been reading up on lots of new material.  I am fascinated at some of the gambits and defenses that are happening in the war between virus writers and antivirus researchers.

In the past week, I have been experimenting with virus construction kits, octave (free version of matlab), and reading reams of papers on computer viruses, hidden Markov models, etc.  I feel like I am going in about 12 directions at once.  But as my master's thesis adviser once told me, "that's research".

A quick history of viruses...

The classic viruses were fairly easy to detect through a method known as "signature detection".  Essentially, virus scanners look for a bit pattern associated with a virus to identify a corrupted file.  This method is still the predominant one, but newer viruses are being designed to evade this method.

"Encrypted viruses" attempt to evade scanners by encrypting the body of the virus.  Typically, this would be done with a XOR operation, so that the same procedure can be used to both encrypt and decrypt the body of the virus.  By itself, this approach is not especially useful -- the virus scanner can still identify the signature of the encryption/decryption code.

"Polymorphic viruses" improve on encrypted viruses by mutating the decrypter function.  A simple version of the signature detection approach will then fail totally.  Except...  Modern scanners will decrypt the virus body, and then scan the virus.  (I am still a little fuzzy on how they know when to decrypt the virus body.)

But polymorphic viruses point the way to a far more interesting approach.  Rather than relying on encryption, "metamorphic viruses" mutate the body of the virus.  This strategy can evade signature detection approaches without relying on encryption.  (Interestingly, DRM systems are apparently exploring this technique to defy reverse engineering efforts).

Detecting metamorphic viruses is fairly challenging.  Fortunately, most of the metamorphic viruses today have not been particularly good.  But some are.  NGVCK (Next Generation Virus Construction Kit) was designed (apparently) as a proof of concept.  It produces harmless, but hard to detect viruses.  (Its last release was in 2002 -- virus scanners might have caught up to it these days).

Current research has been exploring statistical models, especially hidden Markov Models (HMM).  The results seem promising, but the battle is not over.  Some research suggests that attackers could tune the mutations to emulate benign files.  Virus scanners are then left with the unpleasant choice of rejecting benign files or accepting some malicious files (and probably some of both).

Anyway, it is an exciting new realm for me!

Sunday, September 25, 2011

An American in Laval

After finishing up a fantastic summer at Mozilla, I hopped on board a plane to France to begin my 3 month odyssey abroad.  I was still exhausted from the all hands meeting at Mozilla.  I woke up at 4am to catch the shuttle to the airport, with a layover in Philly, followed by an hour shuttle from Charles de Gaulle to the train station at Montparnasse, followed by a 2 hour train ride to Laval, finally to arrive at my destination at about noon the following day.  I think I am just finally catching up on sleep now.

I have been in France for almost a week, and I've been overwhelmed by my reception.  The people here have been uniformly friendly, and have gone out of their way to make me feel welcome.  The town of Laval is lovely, and the food has been delicious.

The Saturday market in Laval was overwhelming.  In California, we have farmer's markets, but these are pitifully small compared to Laval.  There was fresh-baked bread, giant tubs of paella, seafood so fresh that it was literally trying to escape, and produce that has to be seen to be believed.  I think the produce section alone would be equivalent to 3 or 4 farmer's markets back home.  I think I will enjoy my time here.

So far, the biggest difference that I have noticed is that there is a sharp divide between work and play.  In the states, we buy huge cups of coffee and take them to go so that we can go back to work.  Half the time, 'work' might consist of Facebook and Farmville, but the pressure to be at our desks is very strong.

In France, cups of coffee are small, and no one gets them to go.  You sit and chat with friends, and when you are finished, you go back to work.  And then you work.  I'm not sure who comes out ahead in terms of production, but I am gaining an appreciation for the French approach.

Tuesday, February 15, 2011

Counting Tiger Vim Fu, Macro Monkey Vim Fu

In my travels, I have seen many masters of their text editors, but none to match the masters of the ancient and complex art of Vim Fu.  In contrast to the arts from The Land of The Darkened Sun or the Builders, the art of Vim Fu has great subtlety and variety.  If you were to see two different masters of Vim Fu, you might not even realize that they were practicing the same art.  And so, in this article, I attempt to chronicle several of the more common types of Vim Master that you may encounter.

The Counting Tigers have a straightforward, powerful style. With a near Rain Man like ability to count at a glance, practitioners of this style will frequently use commands like 5dd or 2y{.  To Counting Tigers, the world may be broken up into units and numbers.

The masters who follow the Path Of One Thousand Stars (though they use an equal number of question marks, pluses, dots, and other meta characters) eschew the straight-forward approach of the Counting Tigers, instead opting for the elegance of patterns.  Their forte will be commands like yt,.  And :%s/old/new/gc is deeply ingrained into their muscle memory.

The Masters of the Hidden Mark rely on mx and 'x extensively, whereas the Counting Tigers would remember the line number and type 166GThe Visionaries rely extensively on visual mode, highlighting relevant sections and applying their commands to all the region.  Members of this clan frequently rely on column mode editing, an art that other masters often see as being of little value.

The Way of the Macro Monkey leads practitioners to define mini programs of character sequences, constantly creating ephemeral, custom scripts to achieve their goals.  The Pure take this craft to another level, constantly redefining the basic art of Vim Fu to the point that it is not recognizable by other Vim Fu practitioners.  Their .vim/plugin/ directories may have thousands of .vim files.  They can also be readily identified by their fierce, nearly fanatical refusal to use any other tool.

In great contrast to The Pure, there are a number of Wandering Masters.  While Vim is their first weapon of choice, they will use Eclipse, NetBeans, or any other tool that makes their task easier.  They eschew the more esoteric features of Vim Fu, instead seeking to synthesize features of other arts with the core basics of the Vi Path.

Here I feel I must make a special mention of the masters of the Viper Clan.  Though not proper Vim Fu masters, they seek to join the great art with techniques gleaned from the masters from the strange, twisted lands of Emacs.  Some consider their art an abomination.  Others hope that they will bring civility to Emacs.  Yet others prophesy of a chosen one that will at last bring true peace and unity to the House of Vim and the House of Emacs.

Lastly are those of "No Style".  Borrowing from the philosophies of Bruce Lee, these rare few strive to master all aspects of Vim Fu, and yet limiting themselves to no one approach.  These masters can do things with their Vim Fu that seem magical even to other Vim Fu masters.  While it is not meant for all of us to truly master the art of Vim Fu to this degree, by aspiring to perfection, and constantly extending our mastery of our craft, we can become better practitioners of the Noble Art.

Friday, January 28, 2011

Switching to LaTeX

I am giving up on Word.  After many years of use, I've gotten too irritated with its many quirks.  The final straw for me was when it inserted section breaks in the middle of my novel and would not let me delete them.

So instead, I am now using Aquamacs and LaTeX.  When I started the process of converting over my novel, I was not sure if it was really a good idea.  LaTeX is great for technical papers (especially greek-letter heavy ones that we are so fond of in programming languages research), but I was not sure if it would really offer much for a fiction writer.

Almost immediately I found the benefits.  Although there are some hassles, the precise control of formatting is fantastic, and that alone makes the trade very worth while.  Also, being able to use my vi key bindings with emacs viper mode has been a true delight.  And I can actually track my changes through source control.

It has also been interesting what I have not found useful.  I had expected that variables would be a useful feature.  For instance, I could make a command for a character's name, and then allow myself to change it easily if I thought of a better one.  I tried this for one of my characters briefly, but found it to be more of an irritation than anything else.

But there is one feature that I did not give much thought, and yet has me fully committed to a life using LaTeX for all of my non-technical documents.  And that is comments....

It seems so obvious now, but before I had a wide variety of text files and hand written note-cards; all of these have been collapsed into the .tex file itself.  I use simple line comments for things like 'fix up this section' with no worries about cluttering up the document for anyone I ask to review it.  For more complex notes, like outlines and lists of my characters, I make new commands.  When I set the command to do nothing, the notes are invisible.  But when I change it to output the text, I can share those details with anyone who wishes to review it.

LaTeX is not a program for everyone.  But if you are familiar with it, don't be afraid to use it for less technical documents.  Many of the same benefits will still carry over.