630 likes | 778 Views
Graph Algorithms with MapReduce Chapter 5. Thanks to Jimmy Lin slides. Topics. Introduction to graph algorithms and graph representations Single Source Shortest Path (SSSP) problem Refresher: Dijkstra’s algorithm Breadth-First Search with MapReduce PageRank. What’s a graph?.
E N D
Graph Algorithms with MapReduce Chapter 5 Thanks to Jimmy Lin slides
Topics • Introduction to graph algorithms and graph representations • Single Source Shortest Path (SSSP) problem • Refresher: Dijkstra’s algorithm • Breadth-First Search with MapReduce • PageRank
What’s a graph? • G = (V,E), where • V represents the set of vertices (nodes) • E represents the set of edges (links) • Both vertices and edges may contain additional information • Different types of graphs: • Directed vs. undirected edges • Presence or absence of cycles • Graphs are everywhere: • Hyperlink structure of the Web • Physical structure of computers on the Internet • Interstate highway system • Social networks
Some Graph Problems • Finding shortest paths • Routing Internet traffic and UPS trucks • Finding minimum spanning trees • Telco laying down fiber • Finding Max Flow • Airline scheduling • Identify “special” nodes and communities • Breaking up terrorist cells, spread of avian flu • Bipartite matching • Monster.com, Match.com • And of course... PageRank
Graphs and MapReduce • Graph algorithms typically involve: • Performing computation at each node • Processing node-specific data, edge-specific data, and link structure • Traversing the graph in some manner • Key questions: • How do you represent graph data in MapReduce? • How do you traverse a graph in MapReduce?
Representing Graphs • G = (V, E) • A poor representation for computational purposes • Two common representations • Adjacency matrix • Adjacency list
Adjacency Matrices Represent a graph as an n x n square matrix M • n = |V| • Mij = 1 means a link from node i to j 2 1 3 4
Adjacency Matrices: Critique • Advantages: • Naturally encapsulates iteration over nodes • Rows and columns correspond to inlinks and outlinks • Disadvantages: • Lots of zeros for sparse matrices • Lots of wasted space
Adjacency Lists Take adjacency matrices… and throw away all the zeros 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, 3
Adjacency Lists: Critique • Advantages: • Much more compact representation • Easy to compute over outlinks • Graph structure can be broken up and distributed • Disadvantages: • Much more difficult to compute over inlinks
Single Source Shortest Path • Problem: find shortest path from a source node to one or more target nodes • “Graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree” Wikipedia • First, a refresher: Dijkstra’s algorithm • Single machine
Dijkstra’s Algorithm Example 1 10 0 9 2 3 4 6 5 7 2 Example from CLR
Dijkstra’s Algorithm Example n3 n1 1 10 0 n0 9 2 3 4 6 5 7 n2 n4 2 Example from CLR
Dijkstra’s Algorithm Example 10 n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5 n2 n4 2 Example from CLR
Dijkstra’s Algorithm Example 8 14 n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5 7 n2 n4 2 Example from CLR
Dijkstra’s Algorithm Example 8 13 n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5 7 n2 n4 2 Example from CLR
Dijkstra’s Algorithm Example 8 9 n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5 7 n2 n4 2 Example from CLR
Dijkstra’s Algorithm Example 8 9 n3 n1 1 10 0 n0 9 2 3 4 6 5 7 5 7 n2 n4 2 Example from CLR
Single Source Shortest Path • Problem: find shortest path from a source node to one or more target nodes • Single processor machine: Dijkstra’s Algorithm • MapReduce: parallel Breadth-First Search (BFS) • How to do it? First simplify the problem!!
Finding the Shortest Path • First, consider equal edge weights • Solution to the problem can be defined inductively • Here’s the intuition: • DistanceTo(startNode) = 0 • For all nodes n directly reachable from startNode, DistanceTo(n) = 1 • For all nodes n reachable from some other set of nodes S, DistanceTo(n) = 1 + min(DistanceTo(m), m S)
Finding the Shortest Path • This strategy advances the “known frontier” by one hop • Subsequent iterations include more reachable nodes as frontier advances • Multiple iterations are needed to explore entire graph
Visualizing Parallel BFS 3 1 2 2 2 3 3 3 4 4
Termination • Does the algorithm ever terminate? • Eventually, all nodes will be discovered, all edges will be considered (in a connected graph) • When do we stop? • When distances at every node no longer change at next frontier
Next Step to Solving • Next – • No longer assume distance to each node is 1
Weighted Edges • Now add positive weights to the edges • Simple change: points-to list in map task includes a weight w for each pointed-to node • emit (p, D+wp) instead of (p, D+1) for each node p
Dijkstra’s Algorithm Example n3 n1 1 10 0 n0 9 2 3 4 6 5 7 n2 n4 2 Example from CLR
Multiple Iterations Needed • This MapReduce task advances the “known frontier” by one hop • Subsequent iterations include more reachable nodes as frontier advances • Multiple iterations are needed to explore entire graph • Each iteration a MapReduce task • Final output is input to next iteration - MapReduce task • Feed output back into the same MapReduce task
From Intuition to Algorithm • What info does the map task require? • A map task receives (k,v) • Key: • node n • Value: • D (distance from start) • points-to (adjacency list of nodes reachable from n) • What does the map task do? • Computes distances • Emit (p, D+wp) p points-to: Makes sure current distance is carried into the reducer • Emits graph structure of node n (n, struct) which contains the current shortest distance to node n
From Intuition to Algorithm • What info does the reduce task require? • The reduce task gathers possible distances to a given p • What does the reduce task do? • selects the minimum one
Algorithm • Assume adjacency list has information about edges and distances!!
class Mapper method MAP(nid n, node N) D ← N.Distance Emit(nid n, N) // Pass along graph structure for all nodeid m € N.AdjacencyList do Emit(nid m, d+w) // Emit distances to reachable nodes class Reducer method REDUCE (nid m, [d1, d2, ...]) dmin ← ∞ M ← Φ for all d € counts [d1, d2, ...] do if IsNode(d) then M ← d // Recover graph structure else if d < dmin then // Look for shorter distance dmin ← d if M.Distance > dmin // update shortest distance M.Distance ← dmin Increment counter for driver Emit(nid m, node M)
Map Algorithm • Line 2. N is an adjacency list and current distance (shortest) • Line 4. Emits (k,v) in k which is current node info , but only one of these for a node because assume each node assigned to one mapper • Line 6. Emits different type of (k,v) which only has distance to neighbor not adjacency list • Shuffles (k,v) with same k to same reducers
Reduce Algorithm • Line 2. Will have different types of (k,v) as input • Line 5. Determine what type of (k,v) if adjacency list • Line 6. If v is not adjacency list (Node structure) then it is a distance, find shortest • Only 1 IsNode as far as I can tell • Line 9. Determine if new shortest • Line 10. Update current shortest, increment a counter to determine if should stop
Shortest path – one more thing • Only finds shortest distances, not the shortest path • Is this true? • Do we have to use backpointers to find shortest path to retrace • NO -- • Emit paths along with distances, each node has shortest path accessible at all times • Most paths relatively short, uses little space
Weighted edgesFinds Minimum? • Discover node r • Discovered shortest D to p and shortest D to r goes through p • Maybe path through q to r that is shorter, but path lies outside current search frontier • Not true if D = 1 since shortest path cannot lie outside search frontier, since would be longer path • Have found shortest path within frontier • Will discover shortest path as frontier expands • With sufficient iterations, eventually discover shortest Distance
Dijkstra’s Algorithm Example n3 n1 1 10 0 n0 9 2 3 4 6 5 7 n2 n4 2 Example from CLR
Termination • Does this ever terminate? • Yes! Eventually, no better distances will be found. When distance is the same, we stop • Checking of termination must occur outside of MapReduce • Driver program submits MR job to iterate algorithm, see if termination condition met • Hadoop provides Counters (drivers) outside MapReduce • Drivers determine after reducers if done • In shortest path reducers count each change to min distance, passes count to driver
Iterations • How many iterations needed to compute shortest distance to all nodes? • Diameter of graph or greatest distance between any pair of nodes • Small for many real-world problems – 6 degrees of separation • For global social network – 6 MapReduce iterations
Fig. 5.6 needs how many iterations for n1-n6? • Worst case? • need (#nodes – 1)
Comparison to Dijkstra • Dijkstra’s algorithm is more efficient • At any step it only pursues edges from the minimum-cost path inside the frontier • MapReduce explores all paths in parallel • Brute force – wastes time • Divide and conquer • Except at search frontier, within frontier repeating same computations • Throw more hardware at the problem
General Approach • MapReduce is adept at manipulating graphs • Store graphs as adjacency lists • Graph algorithms with MapReduce: • Each map task receives a node and its outlinks • Map task compute some function of the link structure, emits value with target as the key • Reduce task collects keys (target nodes) and aggregates • Iterate multiple MapReduce cycles until some termination condition • Remember to “pass” graph structure from one iteration to next
Another example –Random Walks Over the Web • Model: • User starts at a random Web page • User randomly clicks on links, surfing from page to page (may also teleport to completely diff page • How frequently will a page be encountered during this surfing? • This is PageRank • Probability distribution over nodes in a graph representing likelihood random walk over a graph will arrive at a particular node
PageRank: Defined Given page n with in-bound links L(n), where • C(m) is the out-degree ofm • P(m) is the page rank of m • is probability of random jump • |G| is the total number of nodes in the graph m1 n mn … mn
Computing PageRank • Properties of PageRank • Can be computed iteratively • Effects at each iteration is local • Sketch of algorithm: • Start with seed (Pi ) values • Each page distributes (Pi ) “credit” to all pages it links to • Each target page adds up “credit” from multiple in-bound links to compute (Pi+1) • Iterate until values converge
Computing PageRank • What does map do? • What does reduce do?
PageRank MapReduce • Fig. 5.7 • Begins with 5 nodes splitting 1.0 -> 0.2 each • Each node must split their 0.2 to outgoing nodes (map) • Then add up all incoming values (reduce) • Each iteration is one MapReduce job
PageRank in MapReduce Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence ...