850 likes | 1.03k Views
Mining Large Graphs: Spectral Methods, Tensors and Influence propagation. Christos Faloutsos CMU. Thanks. Alex Smola Jia Yu (Tim) Pan. Roadmap. Graph problems: G1: Fraud detection – BP G2: Botnet detection – spectral G3: Beyond graphs: tensors and ``NELL’’
E N D
Mining Large Graphs: Spectral Methods, Tensors and Influence propagation Christos Faloutsos CMU
Thanks • Alex Smola • Jia Yu (Tim) Pan C. Faloutsos (CMU)
Roadmap • Graph problems: • G1: Fraud detection – BP • G2: Botnet detection – spectral • G3: Beyond graphs: tensors and ``NELL’’ • Influence propagation and spike modeling • C1: spikeM model • Conclusions C. Faloutsos (CMU)
E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’07] C. Faloutsos (CMU)
E-bay Fraud detection C. Faloutsos (CMU)
E-bay Fraud detection C. Faloutsos (CMU)
E-bay Fraud detection - NetProbe C. Faloutsos (CMU)
details E-bay Fraud detection - NetProbe Compatibility matrix heterophily C. Faloutsos (CMU)
Background 1: Belief Propagation Equations ~bi (xi ) [Pearl ‘82][Yedidia+ ‘02] …[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10] C. Faloutsos (CMU)
Popular press And less desirable attention: • E-mail from ‘Belgium police’ (‘copy of your code?’) C. Faloutsos (CMU)
Roadmap • Graph problems: • G1: Fraud detection – BP • Ebay • Symantec • Unification • G2: Botnet detection – spectral • G3: Beyond graphs: tensors and ``NELL’’ • Influence propagation and spike modeling • Conclusions C. Faloutsos (CMU)
PATENT PENDING SDM 2011, Mesa, Arizona Polonium: Tera-Scale Graph Mining and Inference for Malware Detection Polo Chau Machine Learning Dept Carey Nachenberg Vice President & Fellow Jeffrey Wilhelm Principal Software Engineer Adam Wright Software Engineer Prof. Christos Faloutsos Computer Science Dept
Polonium: The Data 60+ terabytes of dataanonymously contributedby participants of worldwide Norton Community Watch program 50+ million machines 900+ million executable files Constructed a machine-file bipartite graph (0.2 TB+) 1 billion nodes (machines and files) 37 billion edges C. Faloutsos (CMU)
Polonium: Key Ideas • Use “guilt-by-association” (i.e., homophily) • E.g., files that appear on machines with many bad files are more likely to be bad • Scalability: handles 37 billion-edge graph C. Faloutsos (CMU)
Polonium: One-Interaction Results Ideal 84.9% True Positive Rate1% False Positive Rate True Positive Rate % of malware correctly identified False Positive Rate % of non-malware wrongly labeled as malware C. Faloutsos (CMU)
Roadmap • Graph problems: • G1: Fraud detection – BP • Ebay • Symantec • Unification • G2: Botnet detection – spectral • G3: Beyond graphs: tensors and ``NELL’’ • Influence propagation and spike modeling • Conclusions C. Faloutsos (CMU)
Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms Danai Koutra U Kang Hsing-Kuo Kenneth Pao Tai-You Ke Duen Horng (Polo) Chau Christos Faloutsos ECML PKDD, 5-9 September 2011, Athens, Greece
Problem Definition:GBA techniques ? Given: Graph; & few labeled nodes Find: labels of rest (assuming network effects) ? ? ? C. Faloutsos (CMU)
Homophily and Heterophily homophily heterophily NOTall methods handle heterophily BUT proposed method does! Step 1 All methods handle homophily Step 2 C. Faloutsos (CMU)
Are they related? • RWR (Random Walk with Restarts) • google’s pageRank (‘if my friends are important, I’m important, too’) • SSL (Semi-supervised learning) • minimize the differences among neighbors • BP (Belief propagation) • send messages to neighbors, on what you believe about them C. Faloutsos (CMU)
YES! Are they related? • RWR (Random Walk with Restarts) • google’s pageRank (‘if my friends are important, I’m important, too’) • SSL (Semi-supervised learning) • minimize the differences among neighbors • BP (Belief propagation) • send messages to neighbors, on what you believe about them C. Faloutsos (CMU)
Background 1: Belief Propagation Equations [Pearl ‘82][Yedidia+ ‘02] …[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10] C. Faloutsos (CMU)
Correspondence of Methods d1 d2 d3 0 1 0 1 0 1 0 1 0 ? 0 1 1 prior labels/ beliefs final labels/ beliefs adjacency matrix C. Faloutsos (CMU)
Correspondence of Methods d1 d2 d3 0 1 0 1 0 1 0 1 0 ? 0 1 1 prior labels/ beliefs final labels/ beliefs adjacency matrix We know when it converges! C. Faloutsos (CMU)
Results: Scalability runtime (min) # of edges (Kronecker graphs) FABP is linear on the number of edges. C. Faloutsos (CMU)
Results: Parallelism % accuracy FABP ~2x faster & wins/ties on accuracy. runtime (min) C. Faloutsos (CMU)
Conclusions for BP • ‘NetProbe’, ‘Polonium’, and belief propagation: exploit network effects. • FaBP: fast & accurate (and -> convergence conditions) C. Faloutsos (CMU)
Roadmap • Graph problems: • G1: Fraud detection – BP • Ebay • Symantec • Unification • G2: Botnet detection – spectral • G3: Beyond graphs: tensors and ``NELL’’ • Influence propagation and spike modeling • Conclusions C. Faloutsos (CMU)
EigenSpokes B. Aditya Prakash, Mukund Seshadri, Ashwin Sridharan, Sridhar Machiraju and Christos Faloutsos: EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs, PAKDD 2010, Hyderabad, India, 21-24 June 2010. C. Faloutsos (CMU)
EigenSpokes • Eigenvectors of adjacency matrix • equivalent to singular vectors (symmetric, undirected graph) C. Faloutsos (CMU)
EigenSpokes details • Eigenvectors of adjacency matrix • equivalent to singular vectors (symmetric, undirected graph) N N C. Faloutsos (CMU)
EigenSpokes details • Eigenvectors of adjacency matrix • equivalent to singular vectors (symmetric, undirected graph) N N C. Faloutsos (CMU)
EigenSpokes details • Eigenvectors of adjacency matrix • equivalent to singular vectors (symmetric, undirected graph) N N C. Faloutsos (CMU)
EigenSpokes details • Eigenvectors of adjacency matrix • equivalent to singular vectors (symmetric, undirected graph) N N C. Faloutsos (CMU)
EigenSpokes 2nd Principal component • EE plot: • Scatter plot of scores of u1 vs u2 • One would expect • Many points @ origin • A few scattered ~randomly u2 u1 1st Principal component C. Faloutsos (CMU)
EigenSpokes • EE plot: • Scatter plot of scores of u1 vs u2 • One would expect • Many points @ origin • A few scattered ~randomly u2 90o u1 C. Faloutsos (CMU)
EigenSpokes - pervasiveness • Present in mobile social graph • across time and space • Patent citation graph C. Faloutsos (CMU)
EigenSpokes - explanation Near-cliques, or near-bipartite-cores, loosely connected C. Faloutsos (CMU)
EigenSpokes - explanation Near-cliques, or near-bipartite-cores, loosely connected C. Faloutsos (CMU)
EigenSpokes - explanation Near-cliques, or near-bipartite-cores, loosely connected C. Faloutsos (CMU)
EigenSpokes - explanation Near-cliques, or near-bipartite-cores, loosely connected So what? • Extract nodes with high scores • high connectivity • Good “communities” spy plot of top 20 nodes C. Faloutsos (CMU)
Bipartite Communities! patents from same inventor(s) `cut-and-paste’ bibliography! magnified bipartite community C. Faloutsos (CMU)
(maybe, botnets?) Victim IPs? Botnet members? Exploring it with Dr. Eric Mao (III-Taiwan) C. Faloutsos (CMU)
Roadmap • Graph problems: • G1: Fraud detection – BP • G2: Botnet detection – spectral • G3: Beyond graphs: tensors and ``NELL’’ • Influence propagation and spike modeling • Conclusions C. Faloutsos (CMU)
GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Abhay Harpale Evangelos Papalexakis Christos Faloutsos KDD’12 C. Faloutsos (CMU)
Background: Tensors • Tensors (=multi-dimensional arrays) are everywhere • Hyperlinks &anchor text [Kolda+,05] 1 Anchor Text 1 1 C# 1 C++ URL 2 1 1 1 Java URL 1 C. Faloutsos (CMU)
Background: Tensors • Tensors (=multi-dimensional arrays) are everywhere • Sensor stream (time, location, type) • Predicates (subject, verb, object) in knowledge base “Eric Claptonplays guitar” (48M) NELL (Never Ending Language Learner) data Nonzeros =144M “Barack Obamaispresidentof U.S.” (26M) (26M) C. Faloutsos (CMU)
Background: Tensors • Tensors (=multi-dimensional arrays) are everywhere • Sensor stream (time, location, type) • Predicates (subject, verb, object) in knowledge base Anomaly Detection in Computer networks Time-stamp IP-source IP-destination C. Faloutsos (CMU)
Problem Definition • How to decompose a billion-scale tensor? • Corresponds to SVD in 2D case C. Faloutsos (CMU)
Problem Definition • How to decompose a billion-scale tensor? • Corresponds to SVD in 2D case ‘Artists’ ‘Politicians’ C. Faloutsos (CMU)