Peta-Graph Mining

Peta-Graph Mining Christos Faloutsos Appel, Ana Chau, Polo Leskovec, Jure Kang, U Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Yahoo/Hadoop, 2008

Our goal: One-stop solution for mining huge graphs Yahoo/Hadoop, 2008

Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008

Degree Distributions - NetFlix count Movie in-degree 100 machines - 8min Yahoo/Hadoop, 2008

Degree Distributions - NetFlix count Theoretically expected Movie in-degree 100 machines - 8min Yahoo/Hadoop, 2008

Degree Distributions - NetFlix count User out-degree 100 machines - 8min Yahoo/Hadoop, 2008

Degree Distributions - NetFlix count Theoretically expected Sharp drop below 100 ratings User out-degree 100 machines - 8min Yahoo/Hadoop, 2008

Degree Distributions - Kronecker count degree 100 machines - 6h Nodes:259M - Edges: 1B Yahoo/Hadoop, 2008

Degree Distributions - timings Time (sec) 24 tasks 48 tasks 1 task Yahoo/Hadoop, 2008 Edge file size (MB)

Diameter Diameter of a graph Maximum shortest path Normally, > O(N**2) ANF : `Approximate Neighborhood function’ [Palmer+02]: O(E) Goal : calculate neighborhood function Neighborhood N(h) : number of pairs of nodes within distance h Yahoo/Hadoop, 2008

Diameter Time (min) 1 node 48 nodes 28 nodes Edge file (MB) • For large jobs, parallelization helps • Unstable results due to shared machines Yahoo/Hadoop, 2008

Diameter / Hop Plot (Netflix) # of reachable pairs within <= h hops h: # of hops Yahoo/Hadoop, 2008

Diameter / Hop Plot (Netflix) # of reachable pairs within <= h hops Diameter: 3 h: # of hops Yahoo/Hadoop, 2008

Community detection Cross associations [Chakrabarti+ ’04] Yahoo/Hadoop, 2008

Community detection Yahoo/Hadoop, 2008

Triangles • ‘friends of friends are friends’ Yahoo/Hadoop, 2008

Triangles • ‘friends of friends are friends’ • Naïve algo: 3-way join (slow) • [Tsourakakis’08]: # triangles ~ sum of cubes of eigenvalues • Thus, super-fast computation of #triangles (100x - 25,000x faster than naïve; >95% accuracy Yahoo/Hadoop, 2008

Triangles • Easy to implement on hadoop: it only needs eigenvalues (to do, with Lanczos) Yahoo/Hadoop, 2008

Visualization Principled visualization of large graphs (show few most `important’ edges) Yahoo/Hadoop, 2008

Summary Goal: one-stop solution for mining huge graphs Yahoo/Hadoop, 2008

Peta-Graph Mining

Peta-Graph Mining

Presentation Transcript

Graph-Based Data Mining

Introduction to Graph Mining

Lecture 8: Graph Data Mining

Centrality and Graph Mining

GRAPH MINING a general overview of some mining techniques

Sensor and Graph Mining

Lecture 11: Graph Data Mining

Graph mining in bioinformatics

Large Graph Mining

Lecture 14: Graph Data Mining

Large Graph Mining

PETA

Centrality and Graph Mining