380 likes | 770 Views
Polonium: Tera -Scale Graph Mining for Malware Detection. Patent Pending. Polo Chau Machine Learning Dept. Carey Nachenberg Vice President & Fellow My boss at. Jeffrey Wilhelm Principal Software Engr. My advisor Prof. Christos Faloutsos Computer Science Dept. Adam Wright
E N D
Polonium: Tera-Scale Graph Mining for Malware Detection Patent Pending Polo Chau Machine Learning Dept Carey Nachenberg Vice President & Fellow My boss at Jeffrey Wilhelm Principal Software Engr My advisor Prof. Christos Faloutsos Computer Science Dept Adam Wright Software Engineer
Anti-Virus Software… Polo Chau. Machine Learning Department, Carnegie Mellon University
Detecting Malware Traditional malware detection approaches rely on signatures • Collect malware samples • Security experts generate signatures from samples • Signatures distributed to users’ computers as updates How to handle the increasingly common “Zero-day” malware? Many new or unknown malware Signature-based approach does not work No samples No signatures No detection Polo Chau. Machine Learning Department, Carnegie Mellon University
Symantec’s New Reputation-Based Approach World’s leading personal security software provider Computes reputation score for every application; protects users from those with poor reputation Leverages terabytes of data anonymously contributedby the millions of participants of the worldwide Norton Community Watch program Polo Chau. Machine Learning Department, Carnegie Mellon University
Symantec’s New Reputation-Based Approach Uses an ensemble of machine learning and data mining algorithms, plus many other detection modules, to compute application reputations Polonium is a new malware detection technology • I helped created in Fall 2009 at Symantec as an intern • Being incorporated into their products • Patent pending Polo Chau. Machine Learning Department, Carnegie Mellon University
Related Work(briefly) Existing research has used many familiar techniques, e.g., Naïve Bayes, SVM, decision trees Polo Chau. Machine Learning Department, Carnegie Mellon University
Propagation Of Leverage Of Network Influence Unearths Malware P O L O N I U M Polo Chau. Machine Learning Department, Carnegie Mellon University
The Data 60+ terabytes of dataanonymously contributedby participants of worldwide Norton Community Watch program >50 million machines >900 million executablefiles Constructed a machine-file bipartite graph (0.2 TB+) ~1 billion nodes (machines and files) ~37 billion edges Polo Chau. Machine Learning Department, Carnegie Mellon University
Terminology Polo Chau. Machine Learning Department, Carnegie Mellon University
The Malware Detection Problem First describe domain knowledge to be incorporated Given • a billion-node machine-file bipartite graph • prior knowledge about some files and machines’ goodness Treat each file i as a random variable Xi={xg, xb} xgis the good label, P(xg) is file goodness xbis the bad label, P(xb) is file badness Goal: find file goodness P(Xi=xg) for each file i Since goodness + badness =1, just consider goodness Then the Polonium algorithm that computes file goodness Polo Chau. Machine Learning Department, Carnegie Mellon University
1. Prior file reputation ? “Known” files “Unknown” files Symantec maintains a ground truth database of known-good and known-bad files Correlates prior file reputation with file prevalence e.g., set known-good file’s prior to 0.9 Intuition: good files appear on many machines; bad files appear on few machines Polo Chau. Machine Learning Department, Carnegie Mellon University
2. Prior machine reputation • Computed using Symantec’s proprietary formula; takes into account multiple anonymous aspects of machine’s usage and behavior • Machine reputation is a value between 0 and 1 • Intuitively, files associated with a machine with high reputation are more likely to be good Polo Chau. Machine Learning Department, Carnegie Mellon University
3. “Homophilic” machine-file relationships Also known as “guilt-by-association” Bad files more likely appear on low reputation machines Good files more likely appear on high reputation machines Machine File Polo Chau. Machine Learning Department, Carnegie Mellon University
Recap: Incorporating Domain Knowledge How to infer the reputation of an unknown file, using its neighbors’ (and their neighbors’) reputation? Adapts Belief Propagation algorithm. Polo Chau. Machine Learning Department, Carnegie Mellon University
Details Computing Node Reputation/Belief (same for file node & machine node) • Node belief ≈ P(xi) • Messagefrom neighboring nodes • Prior node belief Neighbor’s opinion about the node’s reputation • Normalization Constant Polo Chau. Machine Learning Department, Carnegie Mellon University
Details Generating message sent from node i node j • i’s message to j • Propagation function • ~Node i’s belief (same for file machine & machine file) • We choose Є= 0.001 to preserve minute probability differences Example function Polo Chau. Machine Learning Department, Carnegie Mellon University
Example Assigning Prior Probabilities Machine nodes use (proprietary) machine reputations e.g., [0.6, 0.4] machine reputation is 0.6 0.6 0.45 0.35 0.6 A 0.45 B 0.35 C Machines 0.5 0.5 0.9 0.5 0.5 0.1 1 2 3 4 Files 0.9 0.5 0.5 0.1 All messages initialized to [0.5, 0.5]. E.g., mA1=[0.5, 0.5], m1A=[0.5, 0.5] Polo Chau. Machine Learning Department, Carnegie Mellon University
Example Propagate Machine File Messages 0.6 A 0.45 B 0.35 C Machines 0.9 0.92 0.5 0.58 0.5 0.38 0.06 0.1 1 2 3 4 Files Polo Chau. Machine Learning Department, Carnegie Mellon University
Example Propagate File Machine Messages 0.87 0.6 A 0.81 0.45 B 0.1 0.35 C Machines 0.58 0.5 0.5 0.38 0.92 0.58 0.38 0.06 2 3 1 2 3 4 Files Polo Chau. Machine Learning Department, Carnegie Mellon University
Algorithm Termination Ideally, algorithm stops when reputations converge Theoretically NO guarantee this will happen Empirically run for fixed number of iterations (we used 7) Upon completion, we have reputation scores for all file and machine; we only want file reputations Polo Chau. Machine Learning Department, Carnegie Mellon University
Polo Chau. Machine Learning Department, Carnegie Mellon University
Experiments Evaluated with full machine-file bipartite graph ~1 billion nodes (>900M files, >50M machines) ~37 billion edges Largest file-submission graph constructed and analyzed Evaluated with 1/10 ground truth files; 9/10 for setting file priors Run on 64Bit Red Hat Linux with 4 Quad-Core processors and 256GB RAM Polo Chau. Machine Learning Department, Carnegie Mellon University
One-Iteration Resultsfor files reported by four or more machines 84.9% True Positive Rate1% False Positive Rate In computer security industry,high TPR is important.Low FPR is critical! % of malware correctly identified % of non-malware wrongly labeled as malware Polo Chau. Machine Learning Department, Carnegie Mellon University
Multi-Iteration Resultsfor files reported by four or more machines 7 6 5 4 3 2 1 2.2% in TPRsame 1% FPR Diminishing return Polo Chau. Machine Learning Department, Carnegie Mellon University
Scalability: Running Time Per Iteration 3 hours, for full data with 37 billion edges Polo Chau. Machine Learning Department, Carnegie Mellon University
Optimization #1Doubles speed by computing half of messages File Machine messages depend ONLY on Machine File messages from previous iteration Polo Chau. Machine Learning Department, Carnegie Mellon University
Optimization #2Externalize “Edge File” Observation: random access to graph edges or edge messages is NOT necessary; sequential access is sufficient Use adjacency list layout to store messages e.g., [FM0] [FM0] [FM1] [FM1] [FM2] [FM2]… Polo Chau. Machine Learning Department, Carnegie Mellon University
Scaling-up Computation Further • Belief Propagation – hence Polonium – can be implemented as matrix-vector multiplication that leverages research on parallel computation, architecture, etc. • Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph MiningXintian Yang • Inference of Beliefs on Billion-Scale GraphsU Kang Polo Chau. Machine Learning Department, Carnegie Mellon University
Conclusions Polonium is a new and effective reputation-based malware detection technology adapting the Belief Propagation algorithm:87% TPR, at 1% FPR Evaluated on 37 billionedge machine-file bipartite graph, largest file submissions dataset ever published 60 TB raw data 0.2 TB for derived graph Scalable & Fast Optimization doubles speed, reduces storage Polo Chau. Machine Learning Department, Carnegie Mellon University
Thanks Polonium: Tera-Scale Graph Mining for Malware Detection Patent Pending Polo Chau Machine Learning Dept Carey Nachenberg Vice President & Fellow My boss at Jeffrey Wilhelm Principal Software Engr My advisor Prof. Christos Faloutsos Computer Science Dept Adam Wright Software Engineer
Polo Chau. Machine Learning Department, Carnegie Mellon University
Data Statistics: Machine-Submission Distribution Polo Chau. Machine Learning Department, Carnegie Mellon University
Data Statistics: File-Prevalence Distribution Polo Chau. Machine Learning Department, Carnegie Mellon University
The “Right” Algorithm Easy to incorporate domain knowledge Must be effective: high TPR at low FPR Easy to understand (a “whitebox” method) Polo Chau. Machine Learning Department, Carnegie Mellon University
Domain Knowledge to Incorporate • Prior file reputation • Prior machine reputation • “Homophilic” machine-file relationships Polo Chau. Machine Learning Department, Carnegie Mellon University
The Polonium AlgorithmAn adaptation of the Belief Propagation algorithm Given • a billion-node machine-file bipartite graph • prior knowledge about some files and machines’ goodness • the intuition of “guilt-by-association” Treat each node i as a two-state random variable Xi={xg, xb} xgis the good label, P(xg) is node goodness xbis the bad label, P(xb) is node badness Goal: Find file goodness P(Xi=xg) for each file i We don’t care about machines Polo Chau. Machine Learning Department, Carnegie Mellon University
Symantec World’s leading security software provider Released 1.8 million signatures in 2008, resulting in 200 million detections Estimated release rate of malicious or unwanted software would exceed that of legitimate software (2008 Symantec Security Threat Report) Malicious or unwanted software Legitimate software > Polo Chau. Machine Learning Department, Carnegie Mellon University