250 likes | 515 Views
Peta-Graph Mining. Christos Faloutsos. Appel, Ana Chau, Polo Leskovec, Jure Kang, U. Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos. Our goal:. One-stop solution for mining huge graphs. Outline. Datasets: Synthetic (‘Kronecker’, ~300M nodes, 1B edges)
E N D
Peta-Graph Mining Christos Faloutsos Appel, Ana Chau, Polo Leskovec, Jure Kang, U Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Yahoo/Hadoop, 2008
Our goal: One-stop solution for mining huge graphs Yahoo/Hadoop, 2008
Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008
Degree Distributions - NetFlix count Movie in-degree 100 machines - 8min Yahoo/Hadoop, 2008
Degree Distributions - NetFlix count Theoretically expected Movie in-degree 100 machines - 8min Yahoo/Hadoop, 2008
Degree Distributions - NetFlix count User out-degree 100 machines - 8min Yahoo/Hadoop, 2008
Degree Distributions - NetFlix count Theoretically expected Sharp drop below 100 ratings User out-degree 100 machines - 8min Yahoo/Hadoop, 2008
Degree Distributions - Kronecker count degree 100 machines - 6h Nodes:259M - Edges: 1B Yahoo/Hadoop, 2008
Degree Distributions - timings Time (sec) 24 tasks 48 tasks 1 task Yahoo/Hadoop, 2008 Edge file size (MB)
Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008
Diameter Diameter of a graph Maximum shortest path Normally, > O(N**2) ANF : `Approximate Neighborhood function’ [Palmer+02]: O(E) Goal : calculate neighborhood function Neighborhood N(h) : number of pairs of nodes within distance h Yahoo/Hadoop, 2008
Diameter Time (min) 1 node 48 nodes 28 nodes Edge file (MB) • For large jobs, parallelization helps • Unstable results due to shared machines Yahoo/Hadoop, 2008
Diameter / Hop Plot (Netflix) # of reachable pairs within <= h hops h: # of hops Yahoo/Hadoop, 2008
Diameter / Hop Plot (Netflix) # of reachable pairs within <= h hops Diameter: 3 h: # of hops Yahoo/Hadoop, 2008
Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008
Community detection Cross associations [Chakrabarti+ ’04] Yahoo/Hadoop, 2008
Community detection Yahoo/Hadoop, 2008
Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008
Triangles • ‘friends of friends are friends’ Yahoo/Hadoop, 2008
Triangles • ‘friends of friends are friends’ Yahoo/Hadoop, 2008
Triangles • ‘friends of friends are friends’ • Naïve algo: 3-way join (slow) • [Tsourakakis’08]: # triangles ~ sum of cubes of eigenvalues • Thus, super-fast computation of #triangles (100x - 25,000x faster than naïve; >95% accuracy Yahoo/Hadoop, 2008
Triangles • Easy to implement on hadoop: it only needs eigenvalues (to do, with Lanczos) Yahoo/Hadoop, 2008
Outline • Datasets: • Synthetic (‘Kronecker’, ~300M nodes, 1B edges) • NetFlix (20K movies, ~500K users, 100M edges) Yahoo/Hadoop, 2008
Visualization Principled visualization of large graphs (show few most `important’ edges) Yahoo/Hadoop, 2008
Summary Goal: one-stop solution for mining huge graphs Yahoo/Hadoop, 2008