1 / 23

Faster, More Effective Flowgraph -based Malware Classification

Faster, More Effective Flowgraph -based Malware Classification. Silvio Cesare silvio.cesare@gmail.com http://www.foocodechu.com Ph.D. Candidate, Deakin University. Who am I and where did this talk come from?. Ph.D. Candidate at Deakin University. Research Malware detection.

margie
Download Presentation

Faster, More Effective Flowgraph -based Malware Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Faster, More Effective Flowgraph-based Malware Classification SilvioCesaresilvio.cesare@gmail.com http://www.foocodechu.com Ph.D. Candidate, Deakin University

  2. Who am I and where did this talk come from? • Ph.D. Candidate at Deakin University. • Research • Malware detection. • Automated vulnerability discovery (check out my other talk in the main conference). • Did a Masters by research in malware • “Fast automated unpacking and classification of malware”. • Presented last year at Ruxcon 2010. • This current work extends last year’s work.

  3. Motivation • Traditional AV works well on known samples. • Doesn’t detect unknown samples. • Doesn’t detect “suspiciously similar” samples. • Uses strings as a signature or “birthmark”. • Compares birthmarks by equality.

  4. What can be done? • Birthmarks can be program structure. • More static among malware variants. • Birthmarks can be compared using “approximate similarity”. • Able to detect unknown samples that are suspiciously similar to known malware. • Vastly reduce number of required signatures.

  5. The Software Similarity Problem

  6. The Control Flow Birthmark • Control flow is more invariant among polymorphic and metamorphic malware. • A directed graph representing control flow. • A control flow graph for every procedure. • One call graph per program.

  7. Graphs

  8. Graph Equality • Known as the “Graph Isomorphism” problem. • Identifies equivalent “structure”. • Not proven to be in NP, but no polynomial time algorithm known.

  9. Graph Edit Distance • The number of basic operations applied to a graph to transform it to another graph. • If you know the distance between two objects, you know the similarity. • Complexity in NP and infeasible.

  10. Decompilation

  11. Q-Grams • Input is a string. • Extract all substrings of fixed size Q. • Substrings are known as q-grams. • Let’s take q-grams of all decompiled graphs.

  12. Feature Vectors • An array <E1,...,En> • A feature vector describes the number of occurrences of each feature. • En is the number of times feature En occurs. • Let’s make the 500 most common q-grams as features. • We use feature vectors as birthmarks.

  13. Vector Distance • A vector is an n-dimensional point. • E.g. 2d vector is <x,y> • Fast.

  14. Nearest Neighbour Search • Software similarity problem extended to similarity search over a database. • Find nearest neighbours (by distance) of a query. • Or find neighbours within a distance of the query.

  15. The Software Similarity Search

  16. Metric Trees • Vector distances here are “metric”. • It has the mathematical properties of a metric. • This means you can do a nearest neighbour search without brute forcing the entire database!

  17. Implementation • System is 100,000 lines of code of C++. • The modules for this work < 3000 lines of code. • System translates x86 into an intermediate language (IL). • Performs analysis on architecture independent IL. • Unpacks malware using an application level emulator.

  18. Evaluation – False Positives • Database of 10,000 malware. • Scanned 1,601 benign binaries. • 10 false positives. Less than 1%. • Using additional refinement algorithm, reduced to 7 false positives. • Very small binaries have small signatures and cause weak matching.

  19. Evaluation - Effectiveness • Calculated similarity between Roron malware variants. • Compared results to Ruxcon 2010 work. • In tables, highlighted cells indicates a positive match. • The more matches the more effective it is.

  20. Malware Variant Detection Exact Matching (Ruxcon 2010) Heuristic Approximate Matching (Ruxcon 2010) Q-Grams

  21. Evaluation - Efficiency • Faster than Ruxcon 2010. • Median benign processing time is 0.06s. • Median malware processing time is 0.84s. • Slowest result may be memory thrashing.

  22. Conclusion • Improved effectiveness and efficiency compared to Ruxcon 2010. • Runs in real-time in expected case. • Large functional code base and years of development time. • Happy to talk to vendors.

  23. Further Information • Full academic paper at IEEE Trustcom. • Research page http://www.foocodechu.com • Book on “Software similarity and classification” available in 2012. • Wiki on software similarity and classification http://www.foocodechu.com/wiki

More Related