450 likes | 719 Views
Pregel : A System for Large-Scale Graph Processing. Grzegorz Malewicz , Matthew H. Austern , Aart J. C. Bik, James C. Dehnert , Ilan Horn, Naty Leiser , and Grzegorz Czajkwoski Google, Inc. SIGMOD ’10 15 Mar 2013 Dong Chang. Outline. Introduction Computation Model
E N D
Pregel: A System for Large-Scale Graph Processing GrzegorzMalewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, NatyLeiser, and GrzegorzCzajkwoski Google, Inc. SIGMOD ’10 15 Mar 2013 Dong Chang
Outline • Introduction • Computation Model • Writing a Pregel Program • System Implementation • Experiments • Conclusion & Future Work
Outline • Introduction • Computation Model • Writing a Pregel Program • System Implementation • Experiments • Conclusion & Future Work
Introduction (2/2) • Many practical computing problems concern large graphs • MapReduce is ill-suited for graph processing • Many iterations are needed for parallel graph processing • Materializations of intermediate results at every MapReduce iteration harm performance • Large graph data • Graph algorithms • Web graph • Transportation routes • Citation relationships • Social networks • PageRank • Shortest path • Connected components • Clustering techniques
MapReduceExecution • Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. • The input splits can be processed in parallel by different machines • Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a hash function: hash(key) mod R • R and the partitioning function are specified by the programmer.
Data Flow • Input, final output are stored on a distributed file system • Scheduler tries to schedule map tasks “close” to physical storage location of input data • Intermediate results are stored on local file system of map and reduce workers • Output can be input to another map reduce task
Outline • Introduction • Computation Model • Writing a Pregel Program • System Implementation • Experiments • Conclusion & Future Work
Computation Model (1/3) Input Supersteps (a sequence of iterations) Output
Computation Model (2/3) • “Think like a vertex” • Inspired by Valiant’s Bulk Synchronous Parallel model (1990) • Source: http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Computation Model (3/3) • Superstep: the vertices compute in parallel • Each vertex • Receives messages sent in the previous superstep • Executes the same user-defined function • Modifies its value or that of its outgoing edges • Sends messages to other vertices (to be received in the next superstep) • Mutates the topology of the graph • Votes to halt if it has no further work to do • Termination condition • All vertices are simultaneously inactive • There are no messages in transit
Example: SSSP – Parallel BFS in Pregel 1 10 0 9 2 3 4 6 5 7 2
Example: SSSP – Parallel BFS in Pregel 10 1 10 0 9 2 3 4 6 5 7 5 2
Example: SSSP – Parallel BFS in Pregel 10 1 10 0 9 2 3 4 6 5 7 5 2
Example: SSSP – Parallel BFS in Pregel 11 10 1 14 8 10 0 9 2 3 4 6 12 5 7 7 5 2
Example: SSSP – Parallel BFS in Pregel 8 11 1 10 0 9 2 3 4 6 5 7 5 7 2
Example: SSSP – Parallel BFS in Pregel 9 8 11 1 13 10 14 0 9 2 3 4 6 15 5 7 5 7 2
Example: SSSP – Parallel BFS in Pregel 8 9 1 10 0 9 2 3 4 6 5 7 5 7 2
Example: SSSP – Parallel BFS in Pregel 8 9 1 10 0 9 2 3 4 6 13 5 7 5 7 2
Example: SSSP – Parallel BFS in Pregel 8 9 1 10 0 9 2 3 4 6 5 7 5 7 2
Differences from MapReduce • Graph algorithms can be written as a series of chained MapReduce invocation • Pregel • Keeps vertices & edges on the machine that performs computation • Uses network transfers only for messages • MapReduce • Passes the entire state of the graph from one stage to the next • Needs to coordinate the steps of a chained MapReduce
Outline • Introduction • Computation Model • Writing a Pregel Program • System Implementation • Experiments • Conclusion & Future Work
C++ API • Writing a Pregel program • Subclassing the predefined Vertex class Override this! in msgs out msg
Outline • Introduction • Computation Model • Writing a Pregel Program • System Implementation • Experiments • Conclusion & Future Work
MapReduce Coordination • Master data structures • Task status: (idle, in-progress, completed) • Idle tasks get scheduled as workers become available • When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer • Master pushes this info to reducers • Master pings workers periodically to detect failures
Mapreduce Failures • Map worker failure • Map tasks completed or in-progress at worker are reset to idle • Reduce workers are notified when task is rescheduled on another worker • Reduce worker failure • Only in-progress tasks are reset to idle • Master failure • MapReduce task is aborted and client is notified
System Architecture • Pregel system also uses the master/worker model • Master • Maintains worker • Recovers faults of workers • Provides Web-UI monitoring tool of job progress • Worker • Processes its task • Communicates with the other workers • Persistent data is stored as files on a distributed storage system (such as GFS or BigTable) • Temporary data is stored on local disk
Execution of a Pregel Program • Many copies of the program begin executing on a cluster of machines • The master assigns a partition of the input to each worker • Each worker loads the vertices and marks them as active • The master instructs each worker to perform a superstep • Each worker loops through its active vertices & computes for each vertex • Messages are sent asynchronously, but are delivered before the end of the superstep • This step is repeated as long as any vertices are active, or any messages are in transit • After the computation halts, the master may instruct each worker to save its portion of the graph
Fault Tolerance • Checkpointing • The master periodically instructs the workers to save the state of their partitions to persistent storage • e.g., Vertex values, edge values, incoming messages • Failure detection • Using regular “ping” messages • Recovery • The master reassigns graph partitions to the currently available workers • The workers all reload their partition state from most recent available checkpoint
Outline • Introduction • Computation Model • Writing a Pregel Program • System Implementation • Experiments • Conclusion & Future Work
Experiments • Environment • H/W: A cluster of 300 multicore commodity PCs • Data: binary trees, log-normal random graphs (general graphs) • Naïve SSSP implementation • The weight of all edges = 1 • No checkpointing
Experiments • SSSP – 1 billion vertex binary tree: varying # of worker tasks
Experiments • SSSP – binary trees: varying graph sizes on 800 worker tasks
Experiments • SSSP – Random graphs: varying graph sizes on 800 worker tasks
Outline • Introduction • Computation Model • Writing a Pregel Program • System Implementation • Experiments • Conclusion & Future Work
Conclusion & Future Work • Pregel is a scalable and fault-tolerant platform with an API that is sufficiently flexible to express arbitrary graph algorithms • Future work • Relaxing the synchronicity of the model • Not to wait for slower workers at inter-superstep barriers • Assigning vertices to machines to minimize inter-machine communication • Caring dense graphs in which most vertices send messages to most other vertices