290 likes | 652 Views
Counting Triangles and the Curse of the Last Reducer. Siddharth Suri , Sergei Vassilvitskii Yahoo! Research. Presentation Nikos Stasinopoulos. The Social Network. Using Clustering Coefficient to identify cliques.
E N D
Counting Triangles and the Curse of the Last Reducer SiddharthSuri, Sergei VassilvitskiiYahoo! Research Presentation Nikos Stasinopoulos
Using Clustering Coefficient to identify cliques • In SN, nodes tend to cluster together.(Holland and Leinhardt, 1971;Watts and Strogatz, 1998) • Assume undirected graph G = (V, E), Γ(v) is v’s neighborhood Clustering Coefficient is the fraction of v’s neighbors which are neighbors themselves
Calculating CC by counting edges cc (A) = 1 cc (B) = 1 cc (C) = 1/3 cc (D) = N/A
Calculating CC by counting triangles cc (A) = 1 cc (B) = 1 cc (C) = 1/3 cc (D) = N/A Again,…
How to count triangles – Naiveapproach A Sequential Node Algorithm Pivot around each node Examine every pair of neighbors Count each triangle 6 times Quadratic,even for one high degree node Running Time :
Improve upon the NodeIterator The improved version Pivot around low-degree nodes Result: Count each a triangle only once and, more importantly, consider far fewer 2-paths which is optimal [Shank, Th., 2007] Running Time :
Why parallelize algorithms? • Graph Data Structure doesn’t fit in memory of a single machine • A sample Twitter graph has 42 million nodes and 2.4 billion edges ~ 4.5GB of compressed data. • When inside algorithm, computation of 2-paths explodes memory demand to petabytes
Advantages of MapReduce • Runs on commodity hardware • Non-critical failures • Widely used at: Yahoo!, Google, Facebook, MS (12/10) • Provided by cloud services such as AmazonWS • Open source
MR -NodeIterator Round 1: Generate all possible 2-paths starting from each node Round 2: Check if 2-paths and starting node form a triangle
MR -NodeIterator • Round 1: • Map 1: For each emit to reducer 1 • Reduce 1: Input • Output: all possible whereExample: (A,B);C - (A, D);C - (B, D);C Split input to reducers Formulate 2-paths Symbol denotes existence of neighbor edge • Round 2: • Map 2: Send and to reducer2 • Reduce 2: Input • Output: if exists, then count • Example: (A,B);C,
MR-NodeIterator++ Pivot around the node with lower degree Input to Red1 is Output is Reducer2 input contains entire edge list and is Result: Count each triangle only once
Data Skew – The Curse In context, there exist nodes with a high degree. Reducer with node @BarackObama (~10M followers) has to check 100 Trillion 2-paths using the naive approach. Natural Graphs commonly follow power law degree! The curse of 99% Complete
Lifting the Curse Splitting nodes to low-, and high-degree (ieNodeIterator++) • |L| is at most n and each low–degree node generates paths • |H| is at most and each high–degree node generates paths • Finally, total work is
Tackling the Curse – Graph Partitioning • The authors suggest partitioning the Graph. • is the induced subgraph • A contains 3/ρvertices and edges • Every triangle appears at least at one subgraph, possibly in more, so weights are introduced to scale the count
In how many subgraphs a triangle appears? • Assume G is divided in ρ=4 • Case 1:Triangle’s vertices lie in distinct subsets , appears once • Case 2:Two vertices in the same subset,triangle appears ρ-2 times • Case 3:All three nodes in one subset,see line #15
MR-GraphPartition Hash function distributes vertices to buckets Total size of Map output is Input size is Case 1 Case 3 Case 2 Scale #appearances
The partitioning (ρ) parameter • Total size of Map output is • Input size for each Reducer is • This calls for a tradeoff.Increasing total disk memory for the Mappers,greatly decreases RAM req. for Reducers. • Again, total work is , distributed to Reducers.
Results • Completion time distributes “normally” across the runtime spectrum
Contributions • Introduces MapReduce on counting triangles, even for the naive approach • Provides Graph Partition MR algorithm, extendable to other than triangles subgraphs • Implements some of Schank’s work in the context of Social Networks • Explores challenges in real-world data (data skew) • Results are exact, not approximations