Exploiting diverse observation perspectives to get insights on the malware landscape

CorradoLeita Symantec Research Labs Ulrich Bayer Technical University Vienna EnginKirda Institute Eurecom @ iSecLab Exploiting diverse observation perspectives to get insights on the malware landscape

Outline • Introduction • Related Work • SGNET and EPM Clustering • Results • Conclusion ADLab Meeting

Introduction ADLab Meeting

Introduction • 30,000 samples per day submitted to VirusTotal website • About the order of millions of samples per month • Malware writers can generate new code by existing code bases or by re-packing the binaries using code obfuscation tools • e.g., Allaple Worms. ADLab Meeting

Introduction • A complete picture on the complexity of the malware landscape is possible only by discerning polymorphic instances from new variants • Get quantitative insights on the interrelations among the different families, and on the extent to which malware writers share code and produce patches to known variants ADLab Meeting

Introduction • SGNET dataset • Combine clustering techniques based on either static or behavioral characteristics of the malware samples ADLab Meeting

Related Work ADLab Meeting

Related Work • Ghorghescu, 2005 • Disassembling • Comparing their basic blocks • Kolter and Maloof, 2006 • Comparing a hex dump of their code segments • Wicherski, 2009, peHash • Polymorphic binaries receive the same hash value • According to the portions of the PE header that are not mutated ADLab Meeting

Related Work • Lee and Mody, 2006 • Based on system call traces • First attempts to cluster malware according to its behavior • Bailey et al., 2007 • The first builds a clustering system that described a sample’s behavior in more abstract terms • O(n^2) ADLab Meeting

Related Work • Anubis • http://anubis.iseclab.org/ • Data tainting • The tracking of sensitive compare operations • Dynamic analysis system for capturing a sample’s behavior ADLab Meeting

SGNET and EPM clustering ADLab Meeting

SGNET and EPM Clustering • SGNET focuses on the collection of detailed information on code injection attacks and on the sources responsible these attacks • Virus Total • Anubis ADLab Meeting

SGNET and EPM Clustering • SGNET • ScriptGen • Learning 0-day behavior • Argos • Program flow hijack detection • Nepenthes • Shellcode emulation • Malware download ADLab Meeting

SGNET and EPM Clustering • Sensor: ScriptGen FSM • Sample Factory: Argos • Shellcode handlers: Nepenthes ADLab Meeting

EPM Clustering ADLab Meeting

EPM Clustering • Epsilon-Gamma-Pi-Mu (EPGM) model • Exploit (ε) • Bogus control data (γ) • Payload (π) • Malware (μ) • Assumption: any randomization performed by attacker has a limited scope • Do not consider γ due to lack of host-based information in the SGNET dataset ADLab Meeting

EPM Clustering • Phase 1: feature definition ADLab Meeting

EPM Clustering • Pi • PUSH-based interaction • PULL-based interaction • Central repository • Mu • PE header characteristics seem to be more difficult to mutate • The change in their value is likely to be associated to a modification or recompilation of existing codebase ADLab Meeting

EPM Clustering • Clearly, all of the features taken into account for the classification could be easily randomized by the malware writer • More complex (costly) polymorphic approaches might appear in the future ADLab Meeting

EPM Clustering • Phase 2: invariant discovery • An invariant value is a value that is not specific to a certain .. • Attack instance • Attacker • Destination • Threshold-based: • At least 10 different attack instances • At least 3 different attackers • At least 3 honeypot IPs ADLab Meeting

EPM Clustering • Phase 3: pattern discovery • T = v1, v2, v3, …, vn ADLab Meeting

EPM Clustering • Phase 4: pattern-based classification • Clustering • Multiple patterns could match the same instance • Each instance is always associated with the most specific pattern matching its feature values • All the instances associated to the same pattern are said to belong to the same EPM cluster ADLab Meeting

EPM Clustering • E-clusters • Exploit • P-clusters • Payload • M-clusters • Malware ADLab Meeting

EPM Clustering • B-Cluster • Anubis • Compare two samples based on their behavioral profile ADLab Meeting

Results ADLab Meeting

Results • Data: Jan 2008 ~ May 2009, collected by SGNET deployment • 6353 malware samples • Only 5165 can be correctly executed in Anubis • Some malwares can not download correctly by Nepenthes ADLab Meeting

Results • 39 E-clusters • 27 P-clusters • 260 M-clusters • 972 B-clusters ADLab Meeting

Results • #(exploit/payload combinations) is low • Most malware variants seem to be sharing few distinct exploitation routines for propagation • #(B-clusters) is lower than #(M-clusters) • Some M-clusters are likely to correspond to variations of the same codebase ADLab Meeting

Results • Clustering anomalies • 860 B-clusters are composed of a single malware sample and are associated to a single attack instance in the SGNET dataset • A small number of size-1 B-clusters have a 1-1 association with a static M-cluster • Mostly… ADLab Meeting

Results • P-pattern 45: • PUSH-based download • TCP port 9988 ADLab Meeting

Results • M-cluster 13: ADLab Meeting

Results • M-cluster 13 is a polymorphic malware associated to several different B-clusters • MD5 is not an invariant • Allaple mutates its content at each attack instance ADLab Meeting

Results • Each behavioral profile corresponds to an execution time of 4 mins • Bot? Honeypots may help! ADLab Meeting

Results • Allaple • Worm exploiting MS04-007 • DoS attacks ADLab Meeting

Results • IRC servers ADLab Meeting

Conclusion ADLab Meeting

Conclusion • Combine different clustering techniques • Improve effectiveness in building intelligence on the threats economy ADLab Meeting

Exploiting diverse observation perspectives to get insights on the malware landscape