Map-Reduce and Its Children

Map-Reduce and Its Children Distributed File Systems Map-Reduce and Hadoop Dataflow Systems Extensions for Recursion

Distributed File Systems Chunking Replication Distribution on Racks

Distributed File Systems • Files are very large, read/append. • They are divided into chunks. • Typically 64MB to a chunk. • Chunks are replicated at several compute-nodes. • A master (possibly replicated) keeps track of all locations of all chunks.

Compute Nodes • Organized into racks. • Intra-rack connection typically gigabit speed. • Inter-rack connection faster by a small factor.

File Chunks Racks of Compute Nodes

3-way replication of files, with copies on different racks.

Implementations • GFS (Google File System – proprietary). • HDFS (Hadoop Distributed File System – open source). • CloudStore (Kosmix File System, open source).

Above the DFS Map-Reduce Key-Value Stores SQL Implementations

The New Stack SQL Implementations, e.g., PIG (relational algebra), HIVE Object Store (key-value store), e.g., BigTable, Hbase, Cassandra Map-Reduce, e.g. Hadoop Distributed File System

Map-Reduce Systems • Map-reduce (Google) and open-source (Apache) equivalent Hadoop. • Important specialized parallel computing tool. • Cope with compute-node failures. • Avoid restart of the entire job.

Key-Value Stores • BigTable (Google), Hbase, Cassandra (Apache), Dynamo (Amazon). • Each row is a key plus values over a flexible set of columns. • Each column component can be a set of values.

SQL-Like Systems • PIG – Yahoo! implementation of relational algebra. • Translates to a sequence of map-reduce operations, using Hadoop. • Hive – open-source (Apache) implementation of a restricted SQL, called QL, over Hadoop.

SQL-Like Systems – (2) • Sawzall – Google implementation of parallel select + aggregation. • Scope – Microsoft implementation of restricted SQL.

Map-Reduce Formal Definition Fault-Tolerance Example: Join

Map-Reduce • You write two functions, Map and Reduce. • They each have a special form to be explained. • System (e.g., Hadoop) creates a large number of tasks for each function. • Work is divided among tasks in a precise way.

Map-Reduce Pattern “key”-value pairs Input from DFS Output to DFS Map tasks Reduce tasks

Map-Reduce Algorithms • Map tasks convert inputs to key-value pairs. • “keys” are not necessarily unique. • Outputs of Map tasks are sorted by key, and each key is assigned to one Reduce task. • Reduce tasks combine values associated with a key.

Coping With Failures • Map-reduce is designed to deal with compute-nodes failing to execute a task. • Re-executes failed tasks, not whole jobs. • Failure modes: • Compute node failure (e.g., disk crash). • Rack communication failure. • Software failures, e.g., a task requires Java n; node has Java n-1.

Things Map-Reduce is Good At • Matrix-Matrix and Matrix-vector multiplication. • One step of the PageRank iteration was the original application. • Relational algebra operations. • We’ll do an example of the join. • Many other “embarrassingly parallel” operations.

Joining by Map-Reduce • Suppose we want to compute R(A,B) JOIN S(B,C), using k Reduce tasks. • I.e., find tuples with matching B-values. • R and S are each stored in a chunked file.

Joining by Map-Reduce – (2) • Use a hash function h from B-values to k buckets. • Bucket = Reduce task. • The Map tasks take chunks from R and S, and send: • Tuple R(a,b) to Reduce task h(b). • Key = b value = R(a,b). • Tuple S(b,c) to Reduce task h(b). • Key = b; value = S(b,c).

Map tasks send R(a,b) if h(b) = i All (a,b,c) such that h(b) = i, and (a,b) is in R, and (b,c) is in S. Map tasks send S(b,c) if h(b) = i Joining by Map-Reduce – (3) Reduce task i

Joining by Map-Reduce – (4) • Key point: If R(a,b) joins with S(b,c), then both tuples are sent to Reduce task h(b). • Thus, their join (a,b,c) will be produced there and shipped to the output file.

Dataflow Systems Arbitrary Acyclic Flow Among Tasks Preserving Fault Tolerance The Blocking Property

Generalization of Map-Reduce • Map-reduce uses only two functions (Map and Reduce). • Each is implemented by a rank of tasks. • Data flows from Map tasks to Reduce tasks only.

Generalization – (2) • Natural generalization is to allow any number of functions, connected in an acyclic network. • Each function implemented by tasks that feed tasks of successor function(s). • Key fault-tolerance (blocking ) property: tasks produce all their output at the end.

Many Implementations • Clustera – University of Wisconsin. • Hyracks – Univ. of California/Irvine. • Dryad/DryadLINQ – Microsoft. • Nephele/PACT – T. U. Berlin. • BOOM – Berkeley. • epiC – N. U. Singapore.

Example: Join + Aggregation • Relations D(emp, dept) and S(emp,sal). • Compute the sum of the salaries for each department. • D JOIN S computed by map-reduce. • But each Reduce task can also group its emp-dept-sal tuples by dept and sum the salaries.

Example: Continued • A Third function is needed to take the dept-SUM(sal) pairs from each Reduce task, organize them by dept, and compute the final sum for each department.

Final Group + Aggreg- ate Join + Group Tasks Hash by dept Hash by emp 3-Layer Dataflow Map Tasks D S

Recursion Transitive-Closure Example Fault-Tolerance Problem Endgame Problem Some Systems and Approaches Recent research ideas contributed by F. Afrati, V. Borkar, M. Carey, N. Polyzotis

Applications Requiring Recursion • PageRank, the original map-reduce application is really a recursion implemented by many rounds of map-reduce. • Analysis of Web structure. • Analysis of social networks. • PDE’s.

Nonlinear (Right) Linear Transitive Closure • Many recursive applications involving large data are similar to transitive closure : Path(X,Y) :- Arc(X,Y) Path(X,Y) :- Path(X,Z) & Path(Z,Y) Path(X,Y) :- Arc(X,Y) Path(X,Y) :- Arc(X,Z) & Path(Z,Y)

Implementing TC on a Cluster • Use k tasks. • Hash function h sends each node of the graph to one of the k tasks. • Task i receives and stores Path(a,b) if either h(a) = i or h(b) = i, or both. • Task i must join Path(a,c) with Path(c,b) if h(c) = i.

TC on a Cluster – Basis • Data is stored as relation Arc(a,b). • Map tasks read chunks of the Arc relation and send each tuple Arc(a,b) to recursive tasks h(a) and h(b). • Treated as if it were tuple Path(a,b). • If h(a) = h(b), only one task receives.

Send Path(a,c) to tasks h(a) and h(c); send Path(d,b) to tasks h(d) and h(b) Path(a,b) received Task i Store Path(a,b) if new. Otherwise, ignore. Look up Path(b,c) and/or Path(d,a) for any c and d TC on a Cluster – Recursive Tasks

Big Problem: Managing Failure • Map-reduce depends on the blocking property. • Only then can you restart a failed task without restarting the whole job. • But any recursive task has to deliver some output and later get more input.

HaLoop (U. Washington) • Iterates Hadoop, once for each round of the recursion. • Like iterative PageRank. • Similar idea: Twister (U. Indiana). • Clever piece is the way HaLoop tries to run each task in round i at a compute node where it can find its needed output from round i – 1.

Pregel (Google) • Views all computation as a recursion on some graph. • Nodes send messages to one another. • Messages bunched into supersteps. • Checkpoint all compute nodes after some fixed number of supersteps. • On failure, rolls all tasks back to previous checkpoint.

Is this the shortest path from M I know about? I found a path from node M to you of length L I found a path from node M to you of length L+5 I found a path from node M to you of length L+6 I found a path from node M to you of length L+3 Example: Shortest Paths Via Pregel Node N table of shortest paths to N 5 6 3

Using Idempotence • Some recursive applications allow restart of tasks even if they have produced some output. • Example: TC is idempotent; you can send a task a duplicate Path fact without altering the result. • But if you were counting paths, the answer would be wrong.

Big Problem: The Endgame • Some recursions, like TC, take a large number of rounds, but the number of new discoveries in later rounds drops. • T. Vassilakis (Google): searches forward on the Web graph can take hundreds of rounds. • Problem: in a cluster, transmitting small files carries much overhead.

Approach: Merge Tasks • Decide when to migrate tasks to fewer compute nodes. • Data for several tasks at the same node are combined into a single file and distributed at the receiving end. • Downside: old tasks have a lot of state to move. • Example: “paths seen so far.”

Approach: Modify Algorithms • Nonlinear recursions can terminate in many fewer steps than equivalent linear recursions. • Example: TC. • O(n) rounds on n-node graph for linear. • O(log n) rounds for nonlinear.

Advantage of Linear TC • The data-volume cost (= sum of input sizes of all tasks) for executing linear TC is generally lower than that for nonlinear TC. • Why? Each path is discovered only once. • Note: distinct paths between the same endpoints may each be discovered.

Example: Linear TC Arc + Path = Path

Nonlinear TC Constructs Path + Path = Path in Many Ways

Smart TC • (Valduriez-Boral, Ioannides) Construct a path from two paths: • The first has a length that is a power of 2. • The second is no longer than the first.

Example: Smart TC

Other Nonlinear TC Algorithms • You can have the unique-decomposition property with many variants of nonlinear TC. • Example: Balance constructs paths from two equal-length paths. • Favor first path when length is odd.

Map-Reduce and Its Children