Overview

Overview Goal: scalable algorithms to find patterns and anomalies on graphs Mining Large Graphs: Algorithms, Inference, and Discoveries Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation Patterns on the Connected Components of Terabyte-Scale Graphs PI: Christos Faloutsos (CMU) Students: Leman Akoglu, Polo Chau, U Kang 1

Mining Large Graphs: Algorithms, Inference, and Discoveries U Kang Duen Horng Chau Christos Faloutsos School of Computer Science Carnegie Mellon University

Outline • Problem Definition • Proposed Method • Experiment • Conclusion

Motivation • Inference on graph: “guilt by association” • Adult sites tend to be connected to adult sites, while edu. sites are connected to educational ones • Given labels(adult or edu) on a subset of the nodes, infer the labels of other unlabeled nodes on graph • Tool: Belief Propagation(BP) blue nodes connected to blue nodes red nodes connected to red nodes 4

Belief Propagation Belief computation Node belief Prior prob Messages from neighbors Message computation Propagation matrix Messsage from node i to node j Prior prob ~Messages from neighbors 5

A Challenge in BP • Scalability! • Existing works assume that all the nodes (and/or edges) of the input graph fit in memory • Problem: what if the graph is too large to fit in memory? • Challenge: Scaling up the inference algorithm for very large graphs whose nodes do not fit in memory 6

Problem Definition • How can we scale up the BP algorithm to very large graphs? • Goal • Scalability: to billions of nodes and edges • Efficiency: fast algorithm 7

Main Idea • Our approach • Use Hadoop to scale-up BP • Challenge • How can we formulate BP using a simple, efficient operation supported by Hadoop? 9

m01 m12 m24 m10 m21 m42 m13 m31 Main Idea • Key observation • BP message update equation = local message exchange A message is updated from its neighboring messages. For example, m12 is updated from m01 and m31 10

Main Idea • BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G • Nodes in L(G) are edges in G • Two nodes in L(G) are connected if they are adjacent in G 11

Proposed: HA-LFP algorithm • BP message update can be expressed by a generalized matrix-vector multiplication on a line graphL(G) induced from the original graph G Line graph of G Generalized m-v multiplication Multiply repeatedly until convergence New message vector Old message vector 12

Complexity One Iteration of HA-LFP on L(G) One Matrix Vector Multiplication on G = Time : O((V+E) / M) Space: O(V + E) V : # of nodes E : # of nodes M : # of machines 13

Questions Q1: How fast is HA-LFP? Q2: How does HA-LFP scale-up? Q3: How can we find `good’ and `bad’ sites in a web graph?

Running Time Q1: How fast is HA-LFP? [10 iteration] 16

Scale Up Q2: How does HA-LFP scale-up? Linear on the number of machines, edges 17

Advantage of HA-LFP • Scalability • The only solution when the node information cannot fit in memory. • Near-linear scale up • Running Time • Faster than the single-machine, for large graphs • Fault Tolerance 18

Analysis of Web Graph Q3: How can we find `good’ and `bad’ sites in a web graph? Pages whose goodness scores < 0.9 are likely to be adult pages 19

Conclusion HA-LFP Belief Propgation for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Many applications Finding `good’ and `bad’ web sites Fraud detection … 21

Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos School of Computer Science Carnegie Mellon University 22

Outline Problem Definition Proposed Method Experiment Conclusion 23

Problem Definition • Eigensolver • Computes top-k eigenvalues and eigenvectors • Application: • SVD, triangle counting, spectral clustering, … • Existing eigensolver • Can handle up to millions of nodes • How can we scale up eigensolvers to billion-scale graphs? 24

Main Idea • HEigen algorithm (Hadoop Eigen-solver) • Selective parallelize ‘Lanczos’ algorithm • Expensive operation: on Hadoop for scalability • Inexpensive operation: on a single-machine for accuracy • Block encoding • Block encoding, and then do matrix-vector multiplication • Exploiting skewness in matrix-matrix mult. • In matrix-matrix multiplication when a matrix is very large and the other is very small 26

Application of HEigen • Triangle Counting • Real social networks have a lot of triangles • Friends of friends are friends • But: triangles are expensive to compute • (3-way join; several approx. algos) • Q: Can we do that quickly? • A: Yes! • #triangles = 1/6 Sum ( λi3 ) • (and, because of skewness in eigenvalues, • we only need the top few eigenvalues!) [Tsourakakis ICDM 2008]

Questions Q1: How does HEigen scale-up? Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Q3: How can we find anomalous sites in a web graph? 29

Running Time Q1: How does HEigen scale-up? Heigen-BLOCK is faster than PLAIN ver. Linear on the number of machines, edges

Scale Up Q2: Which Matrix-Matrix multiplication algorithm runs the fastest? Cache-based MM runs the fastest!

Results Q3: How can we find anomalous sites in a web graph? • Triangle counting on Twitter social network [Twitter 2009; ~ 3 billion edges] • U.S. politicians: moderate number of triangles vs. degree • Adult sites: very large number of triangles vs. degree 32

Conclusion HEigen Eigensolver for billion-scale graphs on Hadoop Near-linear scalability on # of machines, edges Cache-based Matrix-Matrix multiplication: fastest! Anomalies in triangle counts Many applications Triangle counting SVD … 34

Patterns on the Connected Components of Terabyte-Scale Graphs U Kang* Mary McGlohon*† Leman Akoglu* Christos Faloutsos* (*) SCS, Carnegie Mellon University (†) Google 35

Outline • Problem Definition • Static Patterns • Evolution Patterns • Model • Conclusion 36

A large graph is composed of many connected components Problem Definition YahooWeb graph |V| = 1.4 billion |E| = 6.7 billion 120 GBytes Count Size Q1: static patterns? Q2: evolution patterns? Q3: model? 37

Q1: Static Patterns • What are the regularities in the connected components of a static graph? • How do they look like? • Do the GCC and the other connected components look similar? Chain? Clique? Idea: use ‘density’ and ‘radius’ to find patterns 39

Density of Connected Component • What is a good metric for the density of a connected component? • A candidate: |E| / |V| (“average degree”) • Problem: it increases over time Number of Edges Number of Nodes 40

Density of Connected Component • We want a metric that can measure the ‘intrinsic’ density of a component • Proposed: Graph Fractal Dimension(GFD) • log |E| / log |V| Number of Edges Number of Edges [Leskovec+ KDD05] Number of Nodes Number of Nodes 41

Density of Connected Component • Graph Fractal Dimension(GFD) • log |E| / log |V| Chain: GFD ~1 Star: GFD ~1 Bipartite Core: 1 < GFD < 2 Clique: GFD ~2 42

Density of Connected Component What are the GFDs of connected components in a large, real graph? 43

Density of Connected Component • GFDs of CCs in YahooWeb graph Number of Edges Number of Edges Slope= 1.08 Number of Nodes Number of Nodes GFDs of CCs are constant on average GFDs of CCs are slightly denser than the tree 44

Radius of Connected Component Q1.1: What does the GCC look like? Q1.2: What do the rest CC’s look like? ( What are the GFDs?) 45

Slope= 1.38 Radius of Connected Component • What are the patterns of radii in connected components? Avg. Max. Radius Max. Core Chain Average Radius A1.1: GCC looks like a ‘kite’ A1.2: Chain-like disconnected components 46

Q2: Evolution Patterns • How do the connected components evolve? • Do largest connected components grow with the same rate? • How often does a newcomer join the disconnected components? newcomer ? ? 48

Gelling Point • Gelling Point [McGlohon+ KDD08] • Diameter starts to shrink 49

Growth of Connected Component • GFDs of Top 3 CC’s over time Before “gelling point”: GFDs of Top 3 CC’s stay constant, “tree” like. After “deviation point”: GFD of GCC takes off, becomes denser. 50

Overview

Overview

Presentation Transcript

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview