Data Mining for Security Applications: Detecting Malicious Executables

Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas Data Mining for Security Applications:Detecting Malicious Executables

Outline and Acknowledgement • Vision for Assured Information Sharing • Handling Different Trust levels • Defensive Operations between Untrustworthy Partners • Detecting Malicious Executables using Data Mining • Research Funded by Air Force Office of Scientific Research and Texas Enterprise Funds

Vision: Assured Information Sharing Data/Policy for Coalition Publish Publish Data/Policy Data/Policy Publish Data/Policy Component Component Data/Policy for Data/Policy for Agency A Agency C • Trustworthy Partners • Semi-Trustworthy partners • Untrustworthy partners • Dynamic Trust Component Data/Policy for Agency B

Our Approach • Integrate the Medicaid claims data and mine the data; next enforce policies and determine how much information has been lost by enforcing policies • Prof. Khan, Dr. Awad (Postdoc) and Student Workers (MS students) • Apply game theory and probing techniques to extract information from semi-trustworthy partners • Prof. Murat Kantarcioglu and Ryan Layfield (PhD Student) • Data Mining for Defensive and offensive operations • E.g., Malicious code detection, Honeypots • Prof. Latifur Khan and Mehedy Masud • Dynamic Trust levels, Peer to Peer Communication • Prof. Kevin Hamlen and Nathalie Tsybulnik (PhD student)

Introduction: Detecting Malicious Executables using Data Mining • What are malicious executables? • Harm computer systems • Virus, Exploit, Denial of Service (DoS), Flooder, Sniffer, Spoofer, Trojan etc. • Exploits software vulnerability on a victim • May remotely infect other victims • Incurs great loss. Example: Code Red epidemic cost $2.6 Billion • Malicious code detection: Traditional approach • Signature based • Requires signatures to be generated by human experts • So, not effective against “zero day” attacks

State of the Art: Automated Detection • Automated detection approaches: • Behavioural: analyse behaviours like source, destination address, attachment type, statistical anomaly etc. • Content-based: analyse the content of the malicious executable • Autograph (H. Ah-Kim – CMU): Based on automated signature generation process • N-gram analysis (Maloof, M.A. et .al.): Based on mining features and using machine learning.

New Ideas • Content -based approaches consider only machine-codes (byte-codes). • Is it possible to consider higher-level source codes for malicious code detection? • Yes: Diassemble the binary executable and retrieve the assembly program • Extract important features from the assembly program • Combine with machine-code features

Feature Extraction • Binary n-gram features • Sequence of n consecutive bytes of binary executable • Assembly n-gram features • Sequence of n consecutive assembly instructions • System API call features • DLL function call information

The Hybrid Feature Retrieval Model • Collect training samples of normal and malicious executables. • Extract features • Train a Classifier and build a model • Test the model against test samples

Hybrid Feature Retrieval (HFR) • Training

Hybrid Feature Retrieval (HFR) • Testing

Feature Extraction • Binary n-gram features • Features are extracted from the byte codes in the form of n-grams, where n = 2,4,6,8,10 and so on. • Example: • Given a 11-byte sequence: 0123456789abcdef012345, • The 2-grams (2-byte sequences) are: 0123, 2345, 4567, 6789, 89ab, abcd, cdef, ef01, 0123, 2345 • The 4-grams (4-byte sequences) are: 01234567, 23456789, 456789ab,...,ef012345 and so on.... • Problem: • Large dataset. Too many features (millions!). • Solution: • Use secondary memory, efficient data structures • Apply feature selection

Feature Extraction • Assembly n-gram features • Features are extracted from the assembly programs in the form of n-grams, where n = 2,4,6,8,10 and so on. • Example: • three instructions • “push eax”; “mov eax, dword[0f34]” ; “add ecx, eax”; • 2-grams • (1) “push eax”; “mov eax, dword[0f34]”; • (2) “mov eax, dword[0f34]”; “add ecx, eax”; • Problem: • Same problem as binary • Solution: • Same solution

Feature Selection • Select Best K features • Selection Criteria: Information Gain • Gain of an attribute A on a collection of examples S is given by

Experiments • Dataset • Dataset1: 838 Malicious and 597 Benign executables • Dataset2: 1082 Malicious and 1370 Benign executables • Collected Malicious code from VX Heavens (http://vx.netlux.org) • Disassembly • Pedisassem (http://www.geocities.com/~sangcho/index.html ) • Training, Testing • Support Vector Machine (SVM) • C-Support Vector Classifiers with an RBF kernel

Results • HFS = Hybrid Feature Set • BFS = Binary Feature Set • AFS = Assembly Feature Set

Future Plans • System call: • seems to be very useful. • Need to Consider Frequency of call • Call sequence pattern (following program path) • Actions immediately preceding or after call • Detect Malicious code by program slicing • requires analysis

Data Mining to Detect Buffer Overflow Attack Mohammad M. Masud, Latifur Khan, Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas

Introduction • Goal • Intrusion detection. • e.g.: worm attack, buffer overflow attack. • Main Contribution • 'Worm' code detection by data mining coupled with 'reverse engineering'. • Buffer overflow detection by combining data mining with static analysis of assembly code.

Background • What is 'buffer overflow'? • A situation when a fixed sized buffer is overflown by a larger sized input. • How does it happen? • example: ........ char buff[100]; gets(buff); ........ memory buff Stack Input string

Background (cont...) • Then what? buff Stack ........ char buff[100]; gets(buff); ........ memory buff Stack Return address overwritten Attacker's code memory buff Stack New return address points to this memory location

Background (cont...) • So what? • Program may crash or • The attacker can execute his arbitrary code • It can now • Execute any system function • Communicate with some host and download some 'worm' code and install it! • Open a backdoor to take full control of the victim • How to stop it?

Background (cont...) • Stopping buffer overflow • Preventive approaches • Detection approaches • Preventive approaches • Finding bugs in source code. Problem: can only work when source code is available. • Compiler extension. Same problem. • OS/HW modification • Detection approaches • Capture code running symptoms. Problem: may require long running time. • Automatically generating signatures of buffer overflow attacks.

CodeBlocker (Our approach) • A detection approach • Based on the Observation: • Attack messages usually contain code while normal messages contain data. • Main Idea • Check whether message contains code • Problem to solve: • Distinguishing code from data

Severity of the problem • It is not easy to detect actual instruction sequence from a given string of bits

Our solution • Apply data mining. • Formulate the problem as a classification problem (code, data) • Collect a set of training examples, containing both instances • Train the data with a machine learning algorithm, get the model • Test this model against a new message

CodeBlocker Model

Feature Extraction

Disassembly • We apply SigFree tool • implemented by Xinran Wang et al. (PennState)

Feature extraction • Features are extracted using • N-gram analysis • Control flow analysis • N-gram analysis What is an n-gram? -Sequence of n instructions Traditional approach: -Flow of control is ignored 2-grams are: 02, 24, 46,...,CE Assembly program Corresponding IFG

Feature extraction (cont...) • Control-flow Based N-gram analysis What is an n-gram? -Sequence of n instructions Proposed Control-flow based approach -Flow of control is considered 2-grams are: 02, 24, 46,...,CE, E6 Assembly program Corresponding IFG

Feature extraction (cont...) • Control Flow analysis. Generated features • Invalid Memory Reference (IMR) • Undefined Register (UR) • Invalid Jump Target (IJT) • Checking IMR • A memory is referenced using register addressing and the register value is undefined • e.g.: mov ax, [dx + 5] • Checking UR • Check if the register value is set properly • Checking IJT • Check whether jump target does not violate instruction boundary

Feature extraction (cont...) • Why n-gram analysis? • Intuition: in general, disassembled executablesshould have a different pattern of instruction usage than disassembled data. • Why control flow analysis? • Intuition: there should be no invalid memory references or invalid jump targets.

Putting it together • Compute all possible n-grams • Select best k of them • Compute feature vector (binary vector) for each training example • Supply these vectors to the training algorithm

Experiments • Dataset • Real traces of normal messages • Real attack messages • Polymorphic shellcodes • Training, Testing • Support Vector Machine (SVM)

Results • CFBn: Control-Flow Based n-gram feature • CFF: Control-flow feature

Novelty / contribution • We introduce the notion of control flow based n-gram • We combine control flow analysis with data mining to detect code / data • Significant improvement over other methods (e.g. SigFree)

Advantages • 1) Fast testing • 2) Signature free operation 3) Low overhead • 4) Robust against many obfuscations

Limitations • Need samples of attack and normal messages. • May not be able to detect a completely new type of attack.

Future Works • Find more features • Apply dynamic analysis techniques • Semantic analysis

Reference / suggested readings • X. Wang, C. Pan, P. Liu, and S. Zhu. Sigfree: A signature free buffer overflow attack blocker. In USENIX Security, July 2006. • Kolter, J. Z., and Maloof, M. A. Learning to detect malicious executables in the wild Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining Seattle, WA, USA Pages: 470 – 478, 2004.

Email Worm Detection (behavioural approach) Outgoing Emails The Model Feature extraction Test data Machine Learning Training data Classifier Cleanor Infected ?

Feature Extraction Per email features • Binary valued Features Presence of HTML; script tags/attributes; embedded images; hyperlinks; Presence of binary, text attachments; MIME types of file attachments • Continuous-valued Features Number of attachments; Number of words/characters in the subject and body Per window features • Number of emails sent; Number of unique email recipients; Number of unique sender addresses; Average number of words/characters per subject, body; average word length:; Variance in number of words/characters per subject, body; Variance in word length • Ratio of emails with attachments

Feature Reduction & Selection Principal Component Analysis • Reduce higher dimensional data into lower dimension • Helps reducing noise, overfitting Decesion Tree • Used to Select Best features

Experiments • Data Set • Contains instances for both normal and viral emails. • Six worm types: • bagle.f, bubbleboy, mydoom.m, mydoom.u, netsky.d, sobig.f • Collected from UC Berkeley • Training, Testing: • Decision Tree: C4.5 algorithm (J48) on Weka Systems • Support Vector Machine (SVM) and Naïve Bayes (NB).

Results

Conclusion & Future Work • Three approaches has been tested • Apply classifier directly • Apply dimension reduction (PCA) and then classify • Apply feature selection (decision tree) and then classify • Decision tree has the best performance • Future Plans • Combine content based with behavioral approaches • Offensive Operations • Honeypots, Information operations

Data Mining for Security Applications: Detecting Malicious Executables