320 likes | 542 Views
Graph Data Mining with Map-Reduce . Nima Sarshar, Ph.D. INTUIT Inc , Nima_sarshar@intuit.com . Intuit, Graphs and Me. B1. Me: Large-scale graph data processing, complex networks analysis, graph algorithms … Intuit: QuickBooks, TurboTax, Mint.com , GoPayment , … Graphs @ Intuit:
E N D
Graph Data Mining with Map-Reduce Nima Sarshar, Ph.D. INTUIT Inc, Nima_sarshar@intuit.com
Intuit, Graphs and Me B1 • Me: • Large-scale graph data processing, complex networks analysis, graph algorithms … • Intuit: • QuickBooks, TurboTax, Mint.com, GoPayment, … • Graphs @ Intuit: • Commercial Graph is the business “social network” B2 C1
My Goals for this Talk • You leave with your inner computer scientist tantalized: • There is more to writing efficient Map-Reduce algorithms than counting words and merging logs • You get a general sense of the state of the research • I convince you of the need for a real graph processing package for Hadoop • You know a bit about our work at Intuit
Plan • Jump right to it with an example (enumerating triangles) • Define the performance metrics (what are we optimizing for?) • Give a classification of known “recipes” • The triangle example with with a new trick • Personalized PageRank, connected components • A list of other algorithms
1 2 2 3 3 3 4 4 Finding Triangles with Map-Reduce 5 Potential Triangles to Consider Another round of Map Reduce jobs will check for the existence of the “closing” edge 1 2 4 3 Step 1: Key edges by both end nodes Step 2: Emit potential triangles 2 2 3 3 1 3 4 4 1 3 4 2
Problems with this Approach • Each triangle will be detected 3 times – once under each of its 3 vertices • Too many “potential” triangles are created in the first reduce step. • For a node with degree d: • Total # of records:
3 2 1 2 4 4 3 3 Modified Algorithm [Cohen ‘08] For each triangle exactly one potential triangle is created (under the lowest value node) 1 2 4 3 Step 1: Only under smaller node Step 2: Emit potential triangles 3 3 4 4 1 3 2
The quadratic problem still persists • Bin an edge under it’s LOW DEGREE node • Break ties arbitrarily, but consistently • This is neat. At least we are not triple counting • But the quadratic problem still exists. The number of records is still O(N<k2>) • We want to avoid binning edges under high degree nodes • The ordering of nodes is arbitrary! Let the degree of a node define its order. 3 1 2 4 4 3 1 5 5 2
The performance • Worst case: records vs. • The same as the best serial algorithm [Suri ‘11] • The gain for “real” graphs is fairly substantial. If a graph is reasonably random, it cuts down to: vs. • For a heavy-tailed social graph (like our Commercial Graph), this can be fairly huge
Enumerating Rectangles • Triangles will tell you the friends you have in common with another friend • “People you May Know”: Find another node, not connected to you, who has many friends in common with you. That node is a good candidate for “friendship”. • Basis of User Based or Content Based collaborative filtering • If the graph is bi-partite
Generalization to Rectangles A 1 1 1 Ordering triangle nodes has a unique equivalency class 2 2 3 3 4 4 B C There are 4 classes for a rectangle: requires a bit more work 3 2 4
Performance Metrics • Computation: • Total computation in all mappers and reducers • Communication: • How many bits are shuffled from the mapper to the reducer • Number of map-reduce steps: • You can work it into the above • The overhead of running jobs
“Recipes” for Graph MR Algorithms Roughly two classes of algorithms: • Partition-Compute then Merge • Create smaller sub-graphs that fit into a single memory • Do computation on the small graphs • Construct the final answer from the answers to the small sub-problems • Compute-in-Parallel then Merge
Finding Triangles By Partitioning [Suri ‘11] • Partition the nodes into b sets: • For every 3 sets create a reducer. • Send an edge to iff both its ends are in • Detect triangles using a serial algorithm within each reducer
3 2 1 2 4 4 3 3 b=4, V1={1}, V2={2}, V3={3}, V4={4}, 1 2 4 3 V1,2,3 • V1,3,4 • V2,3,4 1 2 2 1 4 3 4 3 3
Analysis • Every triangle is detected. All 3 vertices are guaranteed to be in at least one partition • Average # edges in each reducer is • Use an optimal serial triangle finder at each reducer. The total amount of work at all reducers is: • # of edges sent from the mappers to reducers (communication cost) is
One Problem • Each triangle may be detected multiple times. If all three vertices are mapped to the same partition, it will be detected times • This can be fixed with a similar ordering-of-nodes trick [Afrati’12] • Can be generalized to detect other small graph structures efficiently [Afrati ‘12]
Minimum Weights Spanning Tree • Partition the nodes into b sets • For every pair of sets create a reducer • Send all edges that have both their ends in one pair to the corresponding reducer • Compute the minimum spanning tree for the graph in each reducer. Remove other edges to sparsify the graph • Compute the MST for the sparsified graph
Personalized PageRank • Like the global PageRank: • But the random walker that comes back to where it started with probability d • For every v you will have a personalized page rank vector of length N. • We usually keep only a limited number of top personalized PageRanks for each node. • It finds the influential nodes in the proximity of a given node.
Monte Carlo Approximation Simulate many random walks from every single node. For each walk: • A walk starting from node v is identified by v • Keep track of <v,Uv,t> where Uv,tis the current end point at step t for the walk starting at node v • In each Map-Reduce step advance the walk by 1 step • Pick a random neighbor of Uv,t • Count the frequency of visits to each node
One can do better [Das Sarma ‘08] This takes T steps for a walk of length T • We can cut it down to T1/2 by a simple “stitching” idea • Do T/J random walks from every node for some J • To for a walk of length T, pick one of the T/J segments at random and jump to the end of the segment • Pick another random segment, etc • If you arrive at a node twice, do not use the same segment (that’s why you need T/J segments) Total iterations: J+T/J minimized when J=T1/2 O(T1/2)
Exponential speed up [Bahmani‘11] • The stitching was done somewhat serially (at each step, one segment was stitched to another) • Idea: Stich recursively, which will result in exponentially expanding the walk/segment ratio • Takes a little more tricks to make it work, but you can bring it down to O(log T)
Labeling Connected Components • Assign the same ID to all nodes inside the same component 6 1 2 5 4 3
How do we do it on one machine? • i=1 • Pick a random node you have not picked before, assign it id=i and put it in a stack • Pop a node from the stack, pull all it’s neighbors we have not seen before into the stack. Assign them id=i • If stack is not empty go to 3, otherwise i i+1 and go to 2 Time and memory complexity O(M). 6 1 2 5 4 3
In Map-Reduce: More Parallelizim • Instead of growing a frontier zone from a single seed, start growing it from all nodes. When two zones meet, merge them 1 2 3 4 Edge File <v1,v2> <v2,v3> <v3,v4> Zone File <v1,z1> <v2,z2> <v3,z3> <v4,z4>
Game Plan Bin Zone and Edge by Node Bin edge to zone map Collect over edges A zone to zone map Reconcile zones Reassign zones to nodes <v1,v2> <v1,z1> <[v1,v2],z1> <[v1,v2],z2> <[v1,v2],z1> <z2,v2> <z2,z1> 1 2 3 4 New Zone File <v1,z1> <v2,z1> <v3,z2> <v4,z3> <v2,v1> <v2,v3> <v2,z2> <z2,z1> <z3,z2> <z4,z3> <[v1,v2],z2> <[v2,v3],z2> <z2,v2> <z2,z1> <[v2,v3],z2> <[v2,v3],z3> <z3,v3> <z3,z2> <v3,v2> <v3,v4> <v3,z3> <[v2,v3],z3> <[v3,v4],z3> <[v3,v4],z3> <[v3,v4],z4> <z4,v4> <z4,z3> <v4,v3> <v4,z4> <[v3,v4],z4>
Analysis • Communication: O(M+N) • Number of rounds: O(d) where d is the diameter of the graph. Most real graphs have small diameters. • Random graph: d=O(log N) • This works worst for a “path-graph” • An algorithm with O(M+N) communication and O(log n) round exists for all graphs [Rastogi’12] • Uses an idea similar to MinHash
Intuit’s GraphEdge • A (hopefully soon to be open sourced) graph processing package for Hadoop built on Cascading • Efficient support of many core graph processing algorithms: • State of the art algorithms • Industry-grade test for scalability • Will take a few more months to release. • Would love to gauge your interest
Intuit’s Commercial Graph • Think of a graph in which a node is a business, or a consumer • An edge is a transaction between these entities • The entities are either direct clients of Intuit’s many offerings, or are business partners of Intuit’s clients • We experiment with a “toy” version of this graph: about 200M nodes and 10B edges.
References • Cohen, Jonathan. "Graph twiddling in a MapReduce world." Computing in Science & Engineering 11.4 (2009): 29-41. • Suri, Siddharth, and Sergei Vassilvitskii. "Counting triangles and the curse of the last reducer." Proceedings of the 20th international conference on World wide web. ACM, 2011. • BahmaniBahman, KaushikChakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." Proceedings of the 37th SIGMOD international conference on Management of data. 2011. • A. Das Sarma, S. Gollapudi, and R. Panigrahy. Estimating PageRank on graph streams. In PODS, pages 69–78, 2008. • Foto N. Afrati, DimitrisFotakis, Jeffrey D. Ullman, Enumerating Subgraph Instances Using Map-Reduce. http://arxiv.org/abs/1208.0615 2012 • Lattanzi, Silvio, et al. "Filtering: a method for solving graph problems in mapreduce.” 2011.