630 likes | 822 Views
Overview. Goal: scalable algorithms to find patterns and anomalies on graphs Mining Large Graphs: Algorithms, Inference, and Discoveries Spectral Analysis of Billion-Scale Graphs : Discoveries and Implementation Patterns on the Connected Components of Terabyte-Scale Graphs
E N D
Overview Goal: scalable algorithms to find patterns and anomalies on graphs Mining Large Graphs: Algorithms, Inference, and Discoveries Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation Patterns on the Connected Components of Terabyte-Scale Graphs PI: Christos Faloutsos (CMU) Students: Leman Akoglu, Polo Chau, U Kang 1
Mining Large Graphs: Algorithms, Inference, and Discoveries U Kang Duen Horng Chau Christos Faloutsos School of Computer Science Carnegie Mellon University
Outline • Problem Definition • Proposed Method • Experiment • Conclusion
Motivation • Inference on graph: “guilt by association” • Adult sites tend to be connected to adult sites, while edu. sites are connected to educational ones • Given labels(adult or edu) on a subset of the nodes, infer the labels of other unlabeled nodes on graph • Tool: Belief Propagation(BP) blue nodes connected to blue nodes red nodes connected to red nodes 4
Belief Propagation Belief computation Node belief Prior prob Messages from neighbors Message computation Propagation matrix Messsage from node i to node j Prior prob ~Messages from neighbors 5
A Challenge in BP • Scalability! • Existing works assume that all the nodes (and/or edges) of the input graph fit in memory • Problem: what if the graph is too large to fit in memory? • Challenge: Scaling up the inference algorithm for very large graphs whose nodes do not fit in memory 6
Problem Definition • How can we scale up the BP algorithm to very large graphs? • Goal • Scalability: to billions of nodes and edges • Efficiency: fast algorithm 7
Outline • Problem Definition • Proposed Method • Experiment • Conclusion
Main Idea • Our approach • Use Hadoop to scale-up BP • Challenge • How can we formulate BP using a simple, efficient operation supported by Hadoop? 9
m01 m12 m24 m10 m21 m42 m13 m31 Main Idea • Key observation • BP message update equation = local message exchange A message is updated from its neighboring messages. For example, m12 is updated from m01 and m31 10
Main Idea • BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G • Nodes in L(G) are edges in G • Two nodes in L(G) are connected if they are adjacent in G 11
Proposed: HA-LFP algorithm • BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G Line graph of G Generalized m-v multiplication Multiply repeatedly until convergence New message vector Old message vector 12
Complexity One Iteration of HA-LFP on L(G) One Matrix Vector Multiplication on G = Time : O((V+E) / M) Space: O(V + E) V : # of nodes E : # of nodes M : # of machines 13
Outline • Problem Definition • Proposed Method • Experiment • Conclusion
Questions Q1: How fast is HA-LFP? Q2: How does HA-LFP scale-up? Q3: How can we find `good’ and `bad’ sites in a web graph?
Running Time Q1: How fast is HA-LFP? [10 iteration] 16
Scale Up Q2: How does HA-LFP scale-up? Linear on the number of machines, edges 17
Advantage of HA-LFP • Scalability • The only solution when the node information cannot fit in memory. • Near-linear scale up • Running Time • Faster than the single-machine, for large graphs • Fault Tolerance 18
Analysis of Web Graph Q3: How can we find `good’ and `bad’ sites in a web graph? Pages whose goodness scores < 0.9 are likely to be adult pages 19
Outline • Problem Definition • Proposed Method • Experiment • Conclusion
Conclusion HA-LFP Belief Propgation for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Many applications Finding `good’ and `bad’ web sites Fraud detection … 21
Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos School of Computer Science Carnegie Mellon University 22
Outline Problem Definition Proposed Method Experiment Conclusion 23
Problem Definition • Eigensolver • Computes top-k eigenvalues and eigenvectors • Application: • SVD, triangle counting, spectral clustering, … • Existing eigensolver • Can handle up to millions of nodes • How can we scale up eigensolvers to billion-scale graphs? 24
Outline Problem Definition Proposed Method Experiment Conclusion 25
Main Idea • HEigen algorithm (Hadoop Eigen-solver) • Selective parallelize ‘Lanczos’ algorithm • Expensive operation: on Hadoop for scalability • Inexpensive operation: on a single-machine for accuracy • Block encoding • Block encoding, and then do matrix-vector multiplication • Exploiting skewness in matrix-matrix mult. • In matrix-matrix multiplication when a matrix is very large and the other is very small 26
Application of HEigen • Triangle Counting • Real social networks have a lot of triangles • Friends of friends are friends • But: triangles are expensive to compute • (3-way join; several approx. algos) • Q: Can we do that quickly? • A: Yes! • #triangles = 1/6 Sum ( λi3 ) • (and, because of skewness in eigenvalues, • we only need the top few eigenvalues!) [Tsourakakis ICDM 2008]
Outline Problem Definition Proposed Method Experiment Conclusion 28
Questions Q1: How does HEigen scale-up? Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Q3: How can we find anomalous sites in a web graph? 29
Running Time Q1: How does HEigen scale-up? Heigen-BLOCK is faster than PLAIN ver. Linear on the number of machines, edges
Scale Up Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Cache-based MM runs the fastest!
Results Q3: How can we find anomalous sites in a web graph? • Triangle counting on Twitter social network [Twitter 2009; ~ 3 billion edges] • U.S. politicians: moderate number of triangles vs. degree • Adult sites: very large number of triangles vs. degree 32
Outline Problem Definition Proposed Method Experiment Conclusion 33
Conclusion HEigen Eigensolver for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Cache-based Matrix-Matrix multiplication: fastest! Anomalies in triangle counts Many applications Triangle counting SVD … 34
Patterns on the Connected Components of Terabyte-Scale Graphs U Kang* Mary McGlohon*† Leman Akoglu* Christos Faloutsos* (*) SCS, Carnegie Mellon University (†) Google 35
Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 36
A large graph is composed of many connected components Problem Definition YahooWeb graph |V| = 1.4 billion |E| = 6.7 billion 120 GBytes Count Size Q1: static patterns? Q2: evolution patterns? Q3: model? 37
Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 38
Q1: Static Patterns • What are the regularities in the connected components of a static graph? • How do they look like? • Do the GCC and the other connected components look similar? Chain? Clique? Idea: use ‘density’ and ‘radius’ to find patterns 39
Density of Connected Component • What is a good metric for the density of a connected component? • A candidate: |E| / |V| (“average degree”) • Problem: it increases over time Number of Edges Number of Nodes 40
Density of Connected Component • We want a metric that can measure the ‘intrinsic’ density of a component • Proposed: Graph Fractal Dimension(GFD) • log |E| / log |V| Number of Edges Number of Edges [Leskovec+ KDD05] Number of Nodes Number of Nodes 41
Density of Connected Component • Graph Fractal Dimension(GFD) • log |E| / log |V| Chain: GFD ~1 Star: GFD ~1 Bipartite Core: 1 < GFD < 2 Clique: GFD ~2 42
Density of Connected Component What are the GFDs of connected components in a large, real graph? 43
Density of Connected Component • GFDs of CCs in YahooWeb graph Number of Edges Number of Edges Slope= 1.08 Number of Nodes Number of Nodes GFDs of CCs are constant on average GFDs of CCs are slightly denser than the tree 44
Radius of Connected Component Q1.1: What does the GCC look like? Q1.2: What do the rest CC’s look like? ( What are the GFDs?) 45
Slope= 1.38 Radius of Connected Component • What are the patterns of radii in connected components? Avg. Max. Radius Max. Core Chain Average Radius A1.1: GCC looks like a ‘kite’ A1.2: Chain-like disconnected components 46
Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 47
Q2: Evolution Patterns • How do the connected components evolve? • Do largest connected components grow with the same rate? • How often does a newcomer join the disconnected components? newcomer ? ? 48
Gelling Point • Gelling Point [McGlohon+ KDD08] • Diameter starts to shrink 49
Growth of Connected Component • GFDs of Top 3 CC’s over time Before “gelling point”: GFDs of Top 3 CC’s stay constant, “tree” like. After “deviation point”: GFD of GCC takes off, becomes denser. 50