On Challenges in Evaluating Malware Clustering

On Challenges in Evaluating Malware Clustering Peng Li University of North Carolina at Chapel Hill, NC, USA Limin Liu State Key Lab of Information Security, Graduate School of Chinese Academy of Sciences Debin Gao School of Information Systems, Singapore Management University, Singapore Mike Reiter University of North Carolina at Chapel Hill, NC, USA

Malware Clustering ? Malware instances (executables) • How the distance is defined?

Static vs. Dynamic Static Dynamic Dynamic analysis system Packers Traces (API, system call, etc.) [Dullien et al., SSTIC2005] [Zhang et al., ACSAC2007] [Briones et al., VB2008] [Griffin et al., RAID2009] [Gheorghescu et al., VB2005] [Rieck et al., DIMVA2008] [Martignoni et al., RAID2008] [Bayer et al., NDSS2009]

Ground-truth?Single Anti-virus Scanner [Lee et al., EICAR2006] [Rieck et al., DIMVA2008] [Hu et al., CCS2009]

Ground-truth?Single Anti-virus Scanner Inconsistency issue [Bailey et al., RAID2007]

Ground-truth?Multiple Anti-virus Scanners … [Bayer et al., NDSS2009] [Perdisci et al., NSDI2010] [Rieck et al., TR18-2009]

Our Work • Proposed a conjecture that such “multiple-anti-virus-scanner-voting” method of selecting ground-truth data biases their results toward high accuracy • Designed experiments to test this conjecture • Conflicting signals • Revealed the effect of cluster size distribution on the significance of the malware clustering results

To Test Our Conjecture • A dataset “D” generated via “multiple-anti-virus-scanner-voting” • Can we always get high accuracy, using a variety of techniques to do clustering?

Dataset “D1” • [Bayer et al., NDSS 2009] • 2,658 malware instances • A subset of 14,212 malware instances • Majority voted by 6 different anti-virus programs

A Variety of Techniques • MC1 (Malware Clustering #1) [Bayer et al., NDSS 2009] • Monitor the execution of a program and create its behavioral profile • Abstracting system calls, their dependences, and the network activities to a generalized representation consisting of OS objects and OS operations • PD1 – PD3 are Plagiarism Detectors (also attempt to detect some degree of similarity in software programs among a large number of instances) • PD1: similarity (string matching) of the sequences of API calls [Tamada et al., ISFST 2004] • PD2: Jacaard similarity of short sequences of system calls [Wang et al., ACSAC 2009] • PD3: Jacaard similarity of short sequences of API calls

Clustering on D1 Dynamic traces of D1 MC1 PD1 PD2 PD3 Distance Matrices Hierarchical Clustering Reference distribution

Precision and Recall Reference Clustering Test Clustering Precision: Recall: F-measure: r clusters c clusters

Results on D1 • Both MC and PDs perform well, which supports our conjecture • Is this the case for any collection of malware to be analyzed?

Dataset D2 and Ground-truth Samples randomly chosen from VXH selection (5,121) Dynamic Analysis System 1,114 instances More Conservative MC1 PD1 PD3

Results on D2 • Both perform more poorly on D2 than they did on D1 • Does not support our conjecture

Differences Between D1 and D2 D1 D2 • CDF of reference cluster sizes for dataset D1 and D2 • Dataset D1 is highly biased, two large clusters comprising 48.5% and 27% of the malware instances, respectively, and remaining clusters of size at most 6.7% • For dataset D2, the largest cluster comprises only 14% of the instances • Other investigations (the length of API call sequence, detailed behaviors, etc) are in the paper

The Significance of the Precision and Recall Case one: Biased ground-truth Test clustering Prec = 7/8 Recall = 7/8 Prec = 7/8 Recall = 7/8 …

The Significance of the Precision and Recall Case one: Unbiased ground-truth Test clustering Prec = 4/8 Recall = 4/8 Prec = 4/8 Recall = 4/8 Considerably “harder” to produce a clustering yielding good precision and recall in the latter case A good precision and recall in the latter case is thus much more significant than in the former. …

Perturbation Test MC1(D2) MC1(D1)

Results of Perturbation Test D1 D2 • The cluster-size distribution characteristic of D2 is more sensitive to perturbations in the underlying data • Other experiments to show the effect of cluster-size distribution are in the paper

Summary • Conjectured that utilizing the concurrence of multiple anti-virus tools in classifying malware instance may bias the dataset towards easy-to-cluster instances • Our tests using plagiarism detectors on two datasets arguably leaves our conjecture unresolved, but we believe highlighting this possibility is important • Examined the impact of the ground-truth cluster-size distribution on the significance of results suggesting high accuracy

thanks pengli@cs.unc.edu

On Challenges in Evaluating Malware Clustering