1 / 18

Automatic Malware Behavior Analysis using Machine Learning

This paper presents a framework for automatic analysis of malware behavior using machine learning techniques. The framework allows for clustering of novel malware classes with similar behavior and classification of unknown classes. By embedding behavior reports in a vector space and applying machine learning algorithms, the system can efficiently identify and classify malware. The process involves prototype extraction, clustering, and classification, enabling the system to learn and automatically identify variants of known malware for further analysis. Experimental evaluations show high precision and recall rates, making this framework an effective solution for malware analysis.

benitezm
Download Presentation

Automatic Malware Behavior Analysis using Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic Analysis of Malware Behavior using Machine Learning Author’s: Konrad Rieck, Philipp Trinius, Carsten Willems, and Thosten Holz

  2. Abstract & Introduction • Malware - • Poses major threat to security of computer systems. • Very diverse – viruses, internet worms, trojan horses, • Amount of malware – millions of hosts infected • Obfuscation and polymorphism impede detection at file level • Dynamic analysis helps characterizing and defending.

  3. Abstract & Introduction Contd.. • Framework for automatic analysis of malware behavior using Machine learning • Framework allows automatic analysis of novel classes of malware with similar behavior – Clustering. • Assigning unknown classes of malware to these discovered classes – Classification. • An incremental approach based on both for behavior based analysis.

  4. Automatic analysis of Malware Behavior • Framework steps and procedure • Executing and monitoring malware binaries in sandbox environment. Report generated on system calls and their arguments. • Sequential reports are embedded in a vector space where each dimension is associated with a behavioral pattern. • ML techniques then applied to the embedded reports to identify and classify malware. • Incremental analysis progress by alternating between clustering and classification.

  5. Report representation • Can be textual or XML • Human readable and suitable for computation of general statistics • But not efficient for automatic analysis • Hence MIST (Malware Instr. Set) • Inspired from instr. set used in process design.

  6. MIST • Category of system calls • Operation - Reflects a particular system call • Arguments as argblocks.

  7. Sandbox and MIST representation

  8. Representation • These sequential reports identify typical behavior of malware – Changing registry keys, modifying system files. • But still not suitable for efficient analysis techniques. Hence the need to embed behavior reports in vector space – Using instruction q-grams. • This embedding enables expressing the similarity of behavior geometrically – Calculating distance.

  9. Clustering and Classification • Reports are embedded in vector space – Process ready for applying ML techniques • Clustering of behavior – where classes of similar behavior malware are identified. • Classification of behavior – which allows to assign malware to known classes of behavior. • What allows us to do this? • Malware binaries are a family of similar variants with similar behavior patterns !

  10. Contd..

  11. Algorithms • Prototype extraction • Iterative algorithm • Extracts small set of prototypes from set of reports. First one chosen at random. • Clustering using Prototypes • Prototypes at beginning are individual clusters • Algorithm determines and merges nearest pairs of clusters • Classification using Prototypes • Allows to learn to discriminate between classes of malware.

  12. Algorithms Contd.. • For each report algorithm determines the nearest prototype of clusters in training data, if within radius then assigns to cluster • Else rejects and holds back for later incremental analysis. • Incremental analysis • Reports to be analyzed are received from source. • Initially classified using prototypes of known clusters • Thereby variants of known malware are identified for further analysis. • Prototypes extracted from remaining reports and clustered again.

  13. Experiments and Results

  14. Evaluating components • Prototype extraction • Evaluated using Precision, Recall and Compression. • Precision – 0.99 when corpus compressed by 2.9 % & 7% • Clustering • Evaluated using F-measure • F-measure for experiments – MIST 1 = 0.93 and MIST 2 = 0.95 better than previous related work 0.881 • Classification • F-measure for experiments – MIST 1= 0.96 and MIST 2 = 0.99

  15. Experiments and Results Contd..

  16. Experiments and Results Contd..

  17. Conclusion • A new framework introduced which overcomes several previous deficiencies. • The framework is learning based • Framework can be implemented in practice • Steps – Collection of malware, a study in sandbox environment, embed observed behavior in vector space, apply learning algorithms – clustering and classification. • This process is efficient and learns automatically after initial setup and run.

  18. Thank you !

More Related