Learning to Detect Malicious Executables in the Wild

Learning to Detect Malicious Executables in the Wild Original Paper by Jeremy Z. Colter and Marcus A. Maloof, of Georgetown University Presented by Alvin Grissom

What is Malicious Code? Malicious code is "any code added, changed, or removed from a software system to intentional cause harm or subvert the system's intended function." (G. McGraw and G. Morisett, 2000) Examples include software that is used to... -compromise computer systems -destroy information -gather/distribute information (e.g., spyware)

Types of Malicious Software Malicious executables are generally categorized based on their respective transport mechanisms: Viruses inject malicious code into other executables, infecting them and propagating the infection. Worms are self-contained programs which spread over a network, usually by exploiting vulnerabilities. Trojan Horses masquerade as benign programs.

Anti-Virus Software Anti-virus software scans executables for known patterns and has been quite successful. However, there are substantial shortcomings inherent to this approach: - program must be obtained before patterns can be identified -simple obfuscation techniques can deceive anti-virus software

A Different Approach - Use machine learning and techniques from text mining and classification -Form as a classification problem -n-grams

Related Work - Lo, et. al. investigated using "tell-tale signs" to filter programs, but provided no empirical results -IBM's Watson Research Center have investigated using neural networks, and have incorporated it into their anti-virus software for boot sector virus protection -Shultz, et. al. used data mining methods to detect malicious code, a technique similar to the work presented here. -used extracted features, e.g., function names, hex dumps, DLL names -Naive Bayes performed best of the techniques attempted -Are these features stable? Compiler-dependent?

Data Collection - The data set consists of 1,972 benign executables and 1,651 malicious executables, in Windows's portable executable format for Windows 2000 and XP. - Some executables were obfuscated with compression and/or encryption. - The authors used the hexdump utility to convert executables to hexadecimal, before producing n-grams. -E.g. ff00ab3e12be --> {ff00ab3e, 00ab3e12, ab3e12b3} This yielded ~256 million distinct n-grams

Classification Methodology -Used n-grams by viewing each n-gram as either present (1) or absent (0) from the executable. The most relevant n-grams were then selected, by computing the information game for each. -Selected the top 500 n-grams and applied several learning methods, namely, IBk, TFIDF, Naive Bayes, SVM, and a decision tree. -Boosting was also used-Techniques were evaluated on both a large and a small collection

Experimental Results -Best performance achieved by using top 500 n-grams -Using n-grams of size n = 4 (bytes) produced best results. -The most relevant attributes were extracted by computing information gain.-Boosted J48 (decision tree) exhibited best performance -Evaluation was based on the area under the ROC curve

Results

Other Interesting Finds -One n-gram appeared in 75% of malicious executables, but was not executable code; it was a string sequence. Its purpose is unclear. :-/-No decision tree contained more than 103 nodes; the average was 90, and the height never exceeded 13.-Naive Bayes results were inferior to previous studies, possibly due to conditional dependence.

Learning to Detect Malicious Executables in the Wild