Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware BBehavioural Classification Tony Lee and Jigar J Mody

Automatic malware classification • Human analysis inefficient and inadequate. • Large number of new virus/spyware families • Our focus : Classification problem • Effective classification • Better Detection • Better Cleaning • Better Analysis solutions

Classification Process

Objectives of classification methodologies • Efficiently and automatically. • Minimal information loss. • Structured to be stored, analyzed and referenced efficiently.

Objectives of classification methodologies (contd..) • Applies learned knowledge to identify familiar pattern and similarity relations in a given target automatically • Adaptable and has innate learning abilities.

Approach • Automated classification method based on: -runtime behavioral data -machine learning. • Represent a file by its runtime behavior • Structure the event information • Store them in database. • Construct classifiers • Apply classifiers for the new objects

A “good” knowledge representation • Effectively capture knowledge of the object to represent • The representation can persist in permanent storage. • Enable classifiers to efficiently and effectively correlate data across large number of objects.

Representing behavior: • The meaning of a particular action -resulted state • Construct the representation in a consistent canonical format. Vector Approach • Process data in vector format using statically and probabilistic algorithms • Problem: vector size, scalability, and factorability.

The Opaque Object Approach • Objects represent data in rich syntax • Rich semantic representation of the actual object • Precise distance between objects used for Clustering

Events Representation • Sequence of events • Ordered according to • time of the occurrence of program actions • environment state transitions.

Event Properties • Event ID • Event object (e.g registry, file, process, socket, etc.) • Event subject if applicable (i.e. the process that takes the action) • Kernel function called if applicable • Action parameters (e.g. registry value, file path, IP address) • Status of the action (e.g. file handle created, registry removed, etc.)

An example (Register Event)

Generate Classifier for Classification

Which classifier? • Case-based Classifier by treating existing malware collection as a database of solutions. • Learn by CBR • Nearest Neighbor algorithms. • To make the CBR approach scalable, Apply “Clustering”.

Clustering • Unsupervised learning • Organize objects into clusters • A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Distance Measure • Levenshtein Distance – “minimum cost required to transform one sequence of objects to another sequence by applying a set of operations. ” • Operation = Op (Event) • Cost (Transformation) = Σi Cost (Operationi) • Cost of operation depends on operator as well as the operand

Operation Cost Matrix for Similarity Measure

k-medoid partitioning clustering algorithm Place K points into the space.These points represent initial group Medoids. Assign each object to the group that has the closest Medoid Recalculate the positions of the K Medoids. Repeat 2 and 3 until the Medoids no longer move.

Classifying a new object Nearest Neighbor Classification Compare the new object to all the medoids . Assign the new object the family name of the closest medoid.

Experiment • an automated distributed replication system

Data Analysis • Test data : Experiment 1: 461 samples of 3 families Experiment 2: 760 samples of 11 families. • 10 fold cross validation • We vary and contrast experiments by adjusting two parameters: • number of clusters (K),maximum number of events(E) • Measure Error rate &Accuracy Gain

Error rate is defined as ER = number of incorrectly classified samples / total number of samples. • Accuracy , AC = 1 – ER • Accuracy Gain of x over y : G(x,y) = | (ER(y) – ER(x))/ER(x) |

Experiment A

Observations • Accuracy vs. #Clusters Error rate reduces as number of clusters increase. • Accuracy vs. Maximum #Events Error rate reduces as the event cap increases->more events we observe-> more accurately capture-> more likely the clustering discovers the semantic similarity among variants of a family.

Accuracy Gain vs. Number of Events The gain in accuracy is more substantial at lower event caps (100 vs. 500) than at higher event caps (500 vs. 1000) • Accuracy vs. Number of Families The 11-family experiment outperforms in accuracy the 3-family experiment in high event cap tests (1000), but the result is opposite in lower event cap tests (100).

Conclusion • Run time behavior +Machine learning allow us focus on pattern/similarity recognitions in behavior semantics • Lack of code structural information • Combine static analysis to improve classification accuracy • “Developing automated classification process that applies classifiers with innate learning ability on near lossless knowledge representation is the key to the future of malware classification and defense. “

References • Jeff Kephart, Dave Chess and Steve White (1997). Blueprint for a Computer Immune System. • Ford R.A., Thompson H.H. (2004). The future of Proactive Virus Detection. • Wagner M. (2004). Behavior Oriented Detection of Malicious Code at Run-time. M.Sc. Thesis, Florida Institute of Technology • Richard Ford, Jason Michalske (2004). Gatekeeper II: New approaches to Generic Virus Prevention.

Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware