Pregel : A System for Large-Scale Graph Processing

Pregel: A System for Large-Scale Graph Processing GrzegorzMalewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, NatyLeiser, and GrzegorzCzajkwoski Google, Inc. SIGMOD ’10 15 Mar 2013 Dong Chang

Outline • Introduction • Computation Model • Writing a Pregel Program • System Implementation • Experiments • Conclusion & Future Work

Introduction (1/2)

Introduction (2/2) • Many practical computing problems concern large graphs • MapReduce is ill-suited for graph processing • Many iterations are needed for parallel graph processing • Materializations of intermediate results at every MapReduce iteration harm performance • Large graph data • Graph algorithms • Web graph • Transportation routes • Citation relationships • Social networks • PageRank • Shortest path • Connected components • Clustering techniques

MapReduceExecution • Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. • The input splits can be processed in parallel by different machines • Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a hash function: hash(key) mod R • R and the partitioning function are specified by the programmer.

MapReduceExecution

Data Flow • Input, final output are stored on a distributed file system • Scheduler tries to schedule map tasks “close” to physical storage location of input data • Intermediate results are stored on local file system of map and reduce workers • Output can be input to another map reduce task

MapReduceExecution

MapReduceParallel Execution

Computation Model (1/3) Input Supersteps (a sequence of iterations) Output

Computation Model (2/3) • “Think like a vertex” • Inspired by Valiant’s Bulk Synchronous Parallel model (1990) • Source: http://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Computation Model (3/3) • Superstep: the vertices compute in parallel • Each vertex • Receives messages sent in the previous superstep • Executes the same user-defined function • Modifies its value or that of its outgoing edges • Sends messages to other vertices (to be received in the next superstep) • Mutates the topology of the graph • Votes to halt if it has no further work to do • Termination condition • All vertices are simultaneously inactive • There are no messages in transit

An Example

Example: SSSP – Parallel BFS in Pregel   1 10 0 9 2 3 4 6 5 7   2

Example: SSSP – Parallel BFS in Pregel  10   1    10  0 9 2 3 4 6   5 7 5    2

Example: SSSP – Parallel BFS in Pregel 10  1 10 0 9 2 3 4 6 5 7 5  2

Example: SSSP – Parallel BFS in Pregel 11 10  1 14 8 10 0 9 2 3 4 6 12 5 7 7 5  2

Example: SSSP – Parallel BFS in Pregel 8 11 1 10 0 9 2 3 4 6 5 7 5 7 2

Example: SSSP – Parallel BFS in Pregel 9 8 11 1 13 10 14 0 9 2 3 4 6 15 5 7 5 7 2

Example: SSSP – Parallel BFS in Pregel 8 9 1 10 0 9 2 3 4 6 13 5 7 5 7 2

Differences from MapReduce • Graph algorithms can be written as a series of chained MapReduce invocation • Pregel • Keeps vertices & edges on the machine that performs computation • Uses network transfers only for messages • MapReduce • Passes the entire state of the graph from one stage to the next • Needs to coordinate the steps of a chained MapReduce

C++ API • Writing a Pregel program • Subclassing the predefined Vertex class Override this! in msgs out msg

Example: Vertex Class for SSSP

MapReduce Coordination • Master data structures • Task status: (idle, in-progress, completed) • Idle tasks get scheduled as workers become available • When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer • Master pushes this info to reducers • Master pings workers periodically to detect failures

Mapreduce Failures • Map worker failure • Map tasks completed or in-progress at worker are reset to idle • Reduce workers are notified when task is rescheduled on another worker • Reduce worker failure • Only in-progress tasks are reset to idle • Master failure • MapReduce task is aborted and client is notified

System Architecture • Pregel system also uses the master/worker model • Master • Maintains worker • Recovers faults of workers • Provides Web-UI monitoring tool of job progress • Worker • Processes its task • Communicates with the other workers • Persistent data is stored as files on a distributed storage system (such as GFS or BigTable) • Temporary data is stored on local disk

Execution of a Pregel Program • Many copies of the program begin executing on a cluster of machines • The master assigns a partition of the input to each worker • Each worker loads the vertices and marks them as active • The master instructs each worker to perform a superstep • Each worker loops through its active vertices & computes for each vertex • Messages are sent asynchronously, but are delivered before the end of the superstep • This step is repeated as long as any vertices are active, or any messages are in transit • After the computation halts, the master may instruct each worker to save its portion of the graph

Fault Tolerance • Checkpointing • The master periodically instructs the workers to save the state of their partitions to persistent storage • e.g., Vertex values, edge values, incoming messages • Failure detection • Using regular “ping” messages • Recovery • The master reassigns graph partitions to the currently available workers • The workers all reload their partition state from most recent available checkpoint

Experiments • Environment • H/W: A cluster of 300 multicore commodity PCs • Data: binary trees, log-normal random graphs (general graphs) • Naïve SSSP implementation • The weight of all edges = 1 • No checkpointing

Experiments • SSSP – 1 billion vertex binary tree: varying # of worker tasks

Experiments • SSSP – binary trees: varying graph sizes on 800 worker tasks

Experiments • SSSP – Random graphs: varying graph sizes on 800 worker tasks

Conclusion & Future Work • Pregel is a scalable and fault-tolerant platform with an API that is sufficiently flexible to express arbitrary graph algorithms • Future work • Relaxing the synchronicity of the model • Not to wait for slower workers at inter-superstep barriers • Assigning vertices to machines to minimize inter-machine communication • Caring dense graphs in which most vertices send messages to most other vertices

Thank You!

Pregel : A System for Large-Scale Graph Processing