1 / 22

On Challenges in Evaluating Malware Clustering

On Challenges in Evaluating Malware Clustering. Peng Li University of North Carolina at Chapel Hill, NC, USA Limin Liu State Key Lab of Information Security, Graduate School of Chinese Academy of Sciences Debin Gao School of Information Systems, Singapore Management University, Singapore

gent
Download Presentation

On Challenges in Evaluating Malware Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Challenges in Evaluating Malware Clustering Peng Li University of North Carolina at Chapel Hill, NC, USA Limin Liu State Key Lab of Information Security, Graduate School of Chinese Academy of Sciences Debin Gao School of Information Systems, Singapore Management University, Singapore Mike Reiter University of North Carolina at Chapel Hill, NC, USA

  2. Malware Clustering ? Malware instances (executables) • How the distance is defined?

  3. Static vs. Dynamic Static Dynamic Dynamic analysis system Packers Traces (API, system call, etc.) [Dullien et al., SSTIC2005] [Zhang et al., ACSAC2007] [Briones et al., VB2008] [Griffin et al., RAID2009] [Gheorghescu et al., VB2005] [Rieck et al., DIMVA2008] [Martignoni et al., RAID2008] [Bayer et al., NDSS2009]

  4. Ground-truth?Single Anti-virus Scanner [Lee et al., EICAR2006] [Rieck et al., DIMVA2008] [Hu et al., CCS2009]

  5. Ground-truth?Single Anti-virus Scanner Inconsistency issue [Bailey et al., RAID2007]

  6. Ground-truth?Multiple Anti-virus Scanners … [Bayer et al., NDSS2009] [Perdisci et al., NSDI2010] [Rieck et al., TR18-2009]

  7. Our Work • Proposed a conjecture that such “multiple-anti-virus-scanner-voting” method of selecting ground-truth data biases their results toward high accuracy • Designed experiments to test this conjecture • Conflicting signals • Revealed the effect of cluster size distribution on the significance of the malware clustering results

  8. To Test Our Conjecture • A dataset “D” generated via “multiple-anti-virus-scanner-voting” • Can we always get high accuracy, using a variety of techniques to do clustering?

  9. Dataset “D1” • [Bayer et al., NDSS 2009] • 2,658 malware instances • A subset of 14,212 malware instances • Majority voted by 6 different anti-virus programs

  10. A Variety of Techniques • MC1 (Malware Clustering #1) [Bayer et al., NDSS 2009] • Monitor the execution of a program and create its behavioral profile • Abstracting system calls, their dependences, and the network activities to a generalized representation consisting of OS objects and OS operations • PD1 – PD3 are Plagiarism Detectors (also attempt to detect some degree of similarity in software programs among a large number of instances) • PD1: similarity (string matching) of the sequences of API calls [Tamada et al., ISFST 2004] • PD2: Jacaard similarity of short sequences of system calls [Wang et al., ACSAC 2009] • PD3: Jacaard similarity of short sequences of API calls

  11. Clustering on D1 Dynamic traces of D1 MC1 PD1 PD2 PD3 Distance Matrices Hierarchical Clustering Reference distribution

  12. Precision and Recall Reference Clustering Test Clustering Precision: Recall: F-measure: r clusters c clusters

  13. Results on D1 • Both MC and PDs perform well, which supports our conjecture • Is this the case for any collection of malware to be analyzed?

  14. Dataset D2 and Ground-truth Samples randomly chosen from VXH selection (5,121) Dynamic Analysis System 1,114 instances More Conservative MC1 PD1 PD3

  15. Results on D2 • Both perform more poorly on D2 than they did on D1 • Does not support our conjecture

  16. Differences Between D1 and D2 D1 D2 • CDF of reference cluster sizes for dataset D1 and D2 • Dataset D1 is highly biased, two large clusters comprising 48.5% and 27% of the malware instances, respectively, and remaining clusters of size at most 6.7% • For dataset D2, the largest cluster comprises only 14% of the instances • Other investigations (the length of API call sequence, detailed behaviors, etc) are in the paper

  17. The Significance of the Precision and Recall Case one: Biased ground-truth Test clustering Prec = 7/8 Recall = 7/8 Prec = 7/8 Recall = 7/8 …

  18. The Significance of the Precision and Recall Case one: Unbiased ground-truth Test clustering Prec = 4/8 Recall = 4/8 Prec = 4/8 Recall = 4/8 Considerably “harder” to produce a clustering yielding good precision and recall in the latter case A good precision and recall in the latter case is thus much more significant than in the former. …

  19. Perturbation Test MC1(D2) MC1(D1)

  20. Results of Perturbation Test D1 D2 • The cluster-size distribution characteristic of D2 is more sensitive to perturbations in the underlying data • Other experiments to show the effect of cluster-size distribution are in the paper

  21. Summary • Conjectured that utilizing the concurrence of multiple anti-virus tools in classifying malware instance may bias the dataset towards easy-to-cluster instances • Our tests using plagiarism detectors on two datasets arguably leaves our conjecture unresolved, but we believe highlighting this possibility is important • Examined the impact of the ground-truth cluster-size distribution on the significance of results suggesting high accuracy

  22. thanks pengli@cs.unc.edu

More Related