260 likes | 472 Views
Cluster Computing and Datalog. Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion. Acknowledgements. Joint work with Foto Afrati Alkis Polyzotis and Vinayak Borkar contributed to the architecture discussions. Implementing Datalog via Map-Reduce.
E N D
Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion
Acknowledgements • Joint work with Foto Afrati • Alkis Polyzotis and Vinayak Borkar contributed to the architecture discussions.
Implementing Datalog via Map-Reduce • Joins are straightforward to implement as a round of map-reduce. • Likewise, union/duplicate-elimination is a round of map-reduce. • But implementation of a recursion can thus take many rounds of map-reduce.
Seminaïve Evaluation • Specific combination of joins and unions. • Example: chain rule q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) • Let r, s, t = “old” relations; r’, s’, t’ = incremental relations. • Simplification: assume |r’| = a|r|, etc.
A 3-Way Join Using Map-Reduce q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) • Use k compute nodes. • Give X and Y shares to determine the reduce-task that gets each tuple. • Optimum strategy replicates r and t, not s, using communication |s| + 2k|r||t|.
Seminaïve Evaluation – (2) • Need to compute sum (union) of seven terms (joins): rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’ • Obvious method for computing a round of seminaïve evaluation: • Replicate r and r’; replicate t and t’; do not replicate s or s’. • Communication = (1+a)(|s| + 2k|r||t|)
Seminaïve Evaluation – (3) • There are many other ways we might use k nodes to do the same task. • Example: one group of nodes does (r+r’)s’(t+t’); a second group does r’s(t+t’); the third group does rst’. • Theorem: no grouping does better than the obvious method for this example.
Networks of Processes for Recursions • Is it possible to do a recursion without multiple rounds of map-reduce and their associated communication cost? • Note: tasks do not have to be Map or Reduce tasks; they can have other behaviors.
Example: Very Simple Recursion p(X,Y) :- e(X,Z) & p(Z,Y); p(X,Y) :- p0(X,Y); • Use k compute nodes. • Hash Y-values to one of k buckets h(Y). • Each node gets a complete copy of e. • p0 is distributed among the k nodes, with p0(x,y) going to node h(y).
Example – Continued p(X,Y) :- e(X,Z) & p(Z,Y) • Each node applies the recursive rule and generates new tuples p(x,y). • Key point: since new tuples have a Y-value that hashes to the same node, no communication is necessary. • Duplicates are eliminated locally.
Harder Case of Recursion • Consider a recursive rule p(X,Y) :- p(X,Z) & p(Z,Y) • Responsibility divided among compute nodes by hashing Z-values. • Node n gets tuple p(a,b) if either h(a) = n or h(b) = n.
p(a,b) if h(a) = n or h(b) = n p(c,d) produced To nodes for h(c) and h(d) Search for matches Compute Node for h(Z) = n Node for h(Z) = n Remember all Received tuples (eliminate duplicates)
Comparison with Iteration • Advantage: Lets us avoid some communication of data that would be needed in iterated map-reduce rounds. • Disadvantage: Tasks run longer, more likely to fail.
Node Failures • To cope with failures, map-reduce implementations rely on each task getting its input at the beginning, and on output not being consumed elsewhere until the task completes. • But recursions can’t work that way. • What happens if a node fails after some of its output has been consumed?
Node Failures – (2) • Actually, there is no problem! • We restart the tasks of the failed node at another node. • The replacement task will send some data that the failed task also sent. • But each node remembers tuples to eliminate duplicates anyway.
Node Failures – (3) • But the “no problem” conclusion is highly dependent on the Datalog assumption that it is computing sets. • Argument would fail if we were computing bags or aggregations of the tuples produced. • Similar problems for other recursions, e.g., PDE’s.
Extension of Map-Reduce Architecture for Recursion • Necessarily, all tasks need to operate in rounds. • The master controller learns of all input files that are part of the round-i input to task T and records that T has received these files.
Extension – (2) • Suppose some task S fails, and it never supplies the round-(i +1) input to T. • A replacement S’ for S is restarted at some other node. • The master knows that T has received up to round i from S, so it ignores the first i output files from S’.
Extension – (3) • Master knows where all the inputs ever received by S are from, so it can provide those to S’.
Checkpointing and State • Another approach is to design tasks so that they can periodically write a state file, which is replicated elsewhere. • Tasks take input + state. • Initially, state is empty. • Master can restart a task from some state and feed it only inputs received after that state was written.
Example: Checkpointing p(X,Y) :- p(X,Z) & p(Z,Y) • Two groups of tasks: • Join tasks: hash on Z, using h(Z). • Like tasks from previous example. • Eliminate-duplicates tasks: hash on X and Y, using h’(X,Y). • Receives tuples from join tasks. • Distributes truly new tuples to join tasks.
to h(a) and h(b) if new p(a,b) p(a,b) to h’(a,b) Example – (2) . . . Dup-elim tasks. State has p(x,y) if h’(x,y) is right. Join tasks. State has p(x,y) if h(x) or h(y) is right.
Example – Details • Each task writes “buffer” files locally, one for each of the tasks in the other rank. • The two ranks of tasks are run on different racks of nodes, to minimize the probability that tasks in both ranks will fail at the same time.
Example – Details – (2) • Periodically, each task writes its state (tuples received so far) incrementally and lets the master controller replicate it. • Problem: the controller can’t be too eager to pass output files to their input, or files become tiny.
Future Research • There is work to be done on optimization, using map-reduce or similar facilities, for restricted SQL such as Datalog, Datalog–, Datalog + aggregation. • Check out Hive, PIG, as well as work on multiway join optimization.
Future Research – (2) • Almost everything is open about recursive Datalog implementation under map-reduce or similar systems. • Seminaïve evaluation in general case. • Architectures for managing failures. • Clustera and Hyrax are interesting examples of (nonrecursive) extension of map-reduce. • When can we avoid communication as with p(X,Y) :- e(X,Z) & p(Z,Y)?