310 likes | 558 Views
Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware. B Behavioural Classification Tony Lee and Jigar J Mody. Automatic malware classification. Human analysis inefficient and inadequate. Large number of new virus/spyware families
E N D
Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware BBehavioural Classification Tony Lee and Jigar J Mody
Automatic malware classification • Human analysis inefficient and inadequate. • Large number of new virus/spyware families • Our focus : Classification problem • Effective classification • Better Detection • Better Cleaning • Better Analysis solutions
Objectives of classification methodologies • Efficiently and automatically. • Minimal information loss. • Structured to be stored, analyzed and referenced efficiently.
Objectives of classification methodologies (contd..) • Applies learned knowledge to identify familiar pattern and similarity relations in a given target automatically • Adaptable and has innate learning abilities.
Approach • Automated classification method based on: -runtime behavioral data -machine learning. • Represent a file by its runtime behavior • Structure the event information • Store them in database. • Construct classifiers • Apply classifiers for the new objects
A “good” knowledge representation • Effectively capture knowledge of the object to represent • The representation can persist in permanent storage. • Enable classifiers to efficiently and effectively correlate data across large number of objects.
Representing behavior: • The meaning of a particular action -resulted state • Construct the representation in a consistent canonical format. Vector Approach • Process data in vector format using statically and probabilistic algorithms • Problem: vector size, scalability, and factorability.
The Opaque Object Approach • Objects represent data in rich syntax • Rich semantic representation of the actual object • Precise distance between objects used for Clustering
Events Representation • Sequence of events • Ordered according to • time of the occurrence of program actions • environment state transitions.
Event Properties • Event ID • Event object (e.g registry, file, process, socket, etc.) • Event subject if applicable (i.e. the process that takes the action) • Kernel function called if applicable • Action parameters (e.g. registry value, file path, IP address) • Status of the action (e.g. file handle created, registry removed, etc.)
Which classifier? • Case-based Classifier by treating existing malware collection as a database of solutions. • Learn by CBR • Nearest Neighbor algorithms. • To make the CBR approach scalable, Apply “Clustering”.
Clustering • Unsupervised learning • Organize objects into clusters • A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
Distance Measure • Levenshtein Distance – “minimum cost required to transform one sequence of objects to another sequence by applying a set of operations. ” • Operation = Op (Event) • Cost (Transformation) = Σi Cost (Operationi) • Cost of operation depends on operator as well as the operand
k-medoid partitioning clustering algorithm Place K points into the space.These points represent initial group Medoids. Assign each object to the group that has the closest Medoid Recalculate the positions of the K Medoids. Repeat 2 and 3 until the Medoids no longer move.
Classifying a new object Nearest Neighbor Classification Compare the new object to all the medoids . Assign the new object the family name of the closest medoid.
Experiment • an automated distributed replication system
Data Analysis • Test data : Experiment 1: 461 samples of 3 families Experiment 2: 760 samples of 11 families. • 10 fold cross validation • We vary and contrast experiments by adjusting two parameters: • number of clusters (K),maximum number of events(E) • Measure Error rate &Accuracy Gain
Error rate is defined as ER = number of incorrectly classified samples / total number of samples. • Accuracy , AC = 1 – ER • Accuracy Gain of x over y : G(x,y) = | (ER(y) – ER(x))/ER(x) |
Observations • Accuracy vs. #Clusters Error rate reduces as number of clusters increase. • Accuracy vs. Maximum #Events Error rate reduces as the event cap increases->more events we observe-> more accurately capture-> more likely the clustering discovers the semantic similarity among variants of a family.
Accuracy Gain vs. Number of Events The gain in accuracy is more substantial at lower event caps (100 vs. 500) than at higher event caps (500 vs. 1000) • Accuracy vs. Number of Families The 11-family experiment outperforms in accuracy the 3-family experiment in high event cap tests (1000), but the result is opposite in lower event cap tests (100).
Conclusion • Run time behavior +Machine learning allow us focus on pattern/similarity recognitions in behavior semantics • Lack of code structural information • Combine static analysis to improve classification accuracy • “Developing automated classification process that applies classifiers with innate learning ability on near lossless knowledge representation is the key to the future of malware classification and defense. “
References • Jeff Kephart, Dave Chess and Steve White (1997). Blueprint for a Computer Immune System. • Ford R.A., Thompson H.H. (2004). The future of Proactive Virus Detection. • Wagner M. (2004). Behavior Oriented Detection of Malicious Code at Run-time. M.Sc. Thesis, Florida Institute of Technology • Richard Ford, Jason Michalske (2004). Gatekeeper II: New approaches to Generic Virus Prevention.