1 / 30

Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware. B Behavioural Classification Tony Lee and Jigar J Mody. Automatic malware classification. Human analysis inefficient and inadequate. Large number of new virus/spyware families

nicole
Download Presentation

Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware BBehavioural Classification Tony Lee and Jigar J Mody

  2. Automatic malware classification • Human analysis inefficient and inadequate. • Large number of new virus/spyware families • Our focus : Classification problem • Effective classification • Better Detection • Better Cleaning • Better Analysis solutions

  3. Classification Process

  4. Objectives of classification methodologies • Efficiently and automatically. • Minimal information loss. • Structured to be stored, analyzed and referenced efficiently.

  5. Objectives of classification methodologies (contd..) • Applies learned knowledge to identify familiar pattern and similarity relations in a given target automatically • Adaptable and has innate learning abilities.

  6. Approach • Automated classification method based on: -runtime behavioral data -machine learning. • Represent a file by its runtime behavior • Structure the event information • Store them in database. • Construct classifiers • Apply classifiers for the new objects

  7. A “good” knowledge representation • Effectively capture knowledge of the object to represent • The representation can persist in permanent storage. • Enable classifiers to efficiently and effectively correlate data across large number of objects.

  8. Representing behavior: • The meaning of a particular action -resulted state • Construct the representation in a consistent canonical format. Vector Approach • Process data in vector format using statically and probabilistic algorithms • Problem: vector size, scalability, and factorability.

  9. The Opaque Object Approach • Objects represent data in rich syntax • Rich semantic representation of the actual object • Precise distance between objects used for Clustering

  10. Events Representation • Sequence of events • Ordered according to • time of the occurrence of program actions • environment state transitions.

  11. Event Properties • Event ID • Event object (e.g registry, file, process, socket, etc.) • Event subject if applicable (i.e. the process that takes the action) • Kernel function called if applicable • Action parameters (e.g. registry value, file path, IP address) • Status of the action (e.g. file handle created, registry removed, etc.)

  12. An example (Register Event)

  13. Generate Classifier for Classification

  14. Which classifier? • Case-based Classifier by treating existing malware collection as a database of solutions. • Learn by CBR • Nearest Neighbor algorithms. • To make the CBR approach scalable, Apply “Clustering”.

  15. Clustering • Unsupervised learning • Organize objects into clusters • A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

  16. Distance Measure • Levenshtein Distance – “minimum cost required to transform one sequence of objects to another sequence by applying a set of operations. ” • Operation = Op (Event) • Cost (Transformation) = Σi Cost (Operationi) • Cost of operation depends on operator as well as the operand

  17. Operation Cost Matrix for Similarity Measure

  18. k-medoid partitioning clustering algorithm Place K points into the space.These points represent initial group Medoids. Assign each object to the group that has the closest Medoid Recalculate the positions of the K Medoids. Repeat 2 and 3 until the Medoids no longer move.

  19. Classifying a new object Nearest Neighbor Classification Compare the new object to all the medoids . Assign the new object the family name of the closest medoid.

  20. Experiment • an automated distributed replication system

  21. Data Analysis • Test data : Experiment 1: 461 samples of 3 families Experiment 2: 760 samples of 11 families. • 10 fold cross validation • We vary and contrast experiments by adjusting two parameters: • number of clusters (K),maximum number of events(E) • Measure Error rate &Accuracy Gain

  22. Error rate is defined as ER = number of incorrectly classified samples / total number of samples. • Accuracy , AC = 1 – ER • Accuracy Gain of x over y : G(x,y) = | (ER(y) – ER(x))/ER(x) |

  23. Experiment A

  24. Observations • Accuracy vs. #Clusters Error rate reduces as number of clusters increase. • Accuracy vs. Maximum #Events Error rate reduces as the event cap increases->more events we observe-> more accurately capture-> more likely the clustering discovers the semantic similarity among variants of a family.

  25. Accuracy Gain vs. Number of Events The gain in accuracy is more substantial at lower event caps (100 vs. 500) than at higher event caps (500 vs. 1000) • Accuracy vs. Number of Families The 11-family experiment outperforms in accuracy the 3-family experiment in high event cap tests (1000), but the result is opposite in lower event cap tests (100).

  26. Conclusion • Run time behavior +Machine learning allow us focus on pattern/similarity recognitions in behavior semantics • Lack of code structural information • Combine static analysis to improve classification accuracy • “Developing automated classification process that applies classifiers with innate learning ability on near lossless knowledge representation is the key to the future of malware classification and defense. “

  27. References • Jeff Kephart, Dave Chess and Steve White (1997). Blueprint for a Computer Immune System. • Ford R.A., Thompson H.H. (2004). The future of Proactive Virus Detection. • Wagner M. (2004). Behavior Oriented Detection of Malicious Code at Run-time. M.Sc. Thesis, Florida Institute of Technology • Richard Ford, Jason Michalske (2004). Gatekeeper II: New approaches to Generic Virus Prevention.

More Related