160 likes | 238 Views
Introduction to Search Engines Technology CS 236375 Technion , Winter 2013. Map-Reduce. Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo! Labs, Haifa. Problem Example. Given a huge set of documents, count how many times each token appears in the corpus.
E N D
Introduction to Search Engines TechnologyCS 236375 Technion, Winter 2013 Map-Reduce Amit Gross Some slides are courtesy of:Edward Bortnikov & Ronny Lempel, Yahoo! Labs, Haifa
Problem Example • Given a huge set of documents, count how many times each token appears in the corpus. • The data is of petabyte scale. • Some sort of distributed computing is required – but we don’t want to reinvent the wheel for each new task.
Solution Paradigm • Describe the problem as a set of Map-Reduce tasks, from the functional programming paradigm. • Map: data -> (key,value)* • Document -> (token, ‘1’)* • Reduce: (key,List<value>) -> (key,value’) • (token,List<1>) -> (token,#repeats)
Word-count - example Input: • D1 = The good the bad and the ugly • D2 = As good as it gets and more • D3 = Is it ugly and bad? It is, and more! Map: Text->(term,’1’): (The,1); (Good,1); (the,1); (bad,1); (and,1); (the,1); (ugly,1); (as,1); (good,1); (as,1); (it,1); (gets,1); (and,1); (more,1); (is,1); (it,1) (ugly,1); (and,1); (bad,1); (it,1); (is,1); (and,1); (more,1)
Word-count - example (the,[1,1,1]); (good, [1,1]); (bad, [1,1]); (ugly,[1,1]); (and, [1,1,1,1]); (as, [1,1]); (it,[1,1,1]); (gets, [1]); (more, [1,1]); (is,[1,1]) Reduce (term,list<1>)->(term,#occurances) (the,3); (good,2); (bad,2); (ugly,2); (and,4); (as,2); (it,3); (gets,1); (more,2); (is,2)
Word-count – pseudo-code: Map(Document): terms[] <- parse(Document) for each t in terms: emit(t,’1’) Reduce(term,list<id>): emit(term,sum(list))
Other examples: grep(Text,regex): • Map(Text,regex)->(line#,1) • Reduce(line,[1])->line# Inverted-Index: • Map(docId,Text) -> (term, docId) • Reduce(term,list<docId>)-> (term,sorted(list<docId>)) Reverse Web-Link-Graph • Map(Webpages)->(target,source) [for each link] • Reduce(target,list<source>)-> (target,list<source>)
Data-flow Input (text) (key,value) Mapper Sort pairs by keyCreate a list per keyShuffle keys by hashvalue output (key,list<value>) Reducer Framework User Supplied
Example: MR job on 2 Machines Shuffle M1 R1 M2 Output (on DFS) Input splits (on DFS) R2 M3 M4 Synchronousexecution: every R starts computing after all M’s have completed
Storage • Job input and output are stored on DFS • Replicated, reliable storage • Intermediate files reside on local disks • Non-reliable • Data is transferred between Mapper to Reducers via network, on files – time consuming.
Combiners • Often, the reducer does is simple aggregation • Sum, average, min/max, … • Commutative and associative functions • We can do some aggregation at the mapper side • … and eliminate a lot of network traffic! • Where can we use it in an example we have already seen? • Word Count – combiner identical to reducer
Data-flow with combiner (key,value’) (key,value) Input (text) Mapper Combiner Sort pairs by keyCreate a list per keyShuffle keys by hashvalue Done on the same machine! output (key,list<value>) Reducer Framework User Supplied
Fault tolerance • If the probability for a machine to fail is p, the probability for some machine in a cluster of n machines to fail is • For p=0.0005, n=2000 we get • We cannot assume no faults in large scale clusters. • The framework takes care for fault tolerance. • There is a master that controls the data flow. • If a mapper/reducer fails – just resend the task to a different machine. • If the master fails – some other machine becomes the new master.
Straggler Tasks Slowest task (straggler) affects the job latency M1 R1 M2 Input (on DFS) Output (on DFS) R2 M3 M4
Speculative Execution • Schedule a backup task if the original task takes too long to complete • Same input(s), different output(s) • Failed tasks and stragglers get the same treatment • Let the fastest win • After one task completes, kill all the clones • Challenge: how can we tell a task is late?
Summary • A simple paradigm for batch processing • Data- and computation-intensive jobs • Simplicity is key for scalability • No silver bullet • E.g., MPI is better for iterative computation-intensive workloads (e.g., scientific simulations)