Introduction to Search Engines Technology CS 236375 Technion , Winter 2013

Introduction to Search Engines TechnologyCS 236375 Technion, Winter 2013 Map-Reduce Amit Gross Some slides are courtesy of:Edward Bortnikov & Ronny Lempel, Yahoo! Labs, Haifa

Problem Example • Given a huge set of documents, count how many times each token appears in the corpus. • The data is of petabyte scale. • Some sort of distributed computing is required – but we don’t want to reinvent the wheel for each new task.

Solution Paradigm • Describe the problem as a set of Map-Reduce tasks, from the functional programming paradigm. • Map: data -> (key,value)* • Document -> (token, ‘1’)* • Reduce: (key,List<value>) -> (key,value’) • (token,List<1>) -> (token,#repeats)

Word-count - example Input: • D1 = The good the bad and the ugly • D2 = As good as it gets and more • D3 = Is it ugly and bad? It is, and more! Map: Text->(term,’1’): (The,1); (Good,1); (the,1); (bad,1); (and,1); (the,1); (ugly,1); (as,1); (good,1); (as,1); (it,1); (gets,1); (and,1); (more,1); (is,1); (it,1) (ugly,1); (and,1); (bad,1); (it,1); (is,1); (and,1); (more,1)

Word-count - example (the,[1,1,1]); (good, [1,1]); (bad, [1,1]); (ugly,[1,1]); (and, [1,1,1,1]); (as, [1,1]); (it,[1,1,1]); (gets, [1]); (more, [1,1]); (is,[1,1]) Reduce (term,list<1>)->(term,#occurances) (the,3); (good,2); (bad,2); (ugly,2); (and,4); (as,2); (it,3); (gets,1); (more,2); (is,2)

Word-count – pseudo-code: Map(Document): terms[] <- parse(Document) for each t in terms: emit(t,’1’) Reduce(term,list<id>): emit(term,sum(list))

Other examples: grep(Text,regex): • Map(Text,regex)->(line#,1) • Reduce(line,[1])->line# Inverted-Index: • Map(docId,Text) -> (term, docId) • Reduce(term,list<docId>)-> (term,sorted(list<docId>)) Reverse Web-Link-Graph • Map(Webpages)->(target,source) [for each link] • Reduce(target,list<source>)-> (target,list<source>)

Data-flow Input (text) (key,value) Mapper Sort pairs by keyCreate a list per keyShuffle keys by hashvalue output (key,list<value>) Reducer Framework User Supplied

Example: MR job on 2 Machines Shuffle M1 R1 M2 Output (on DFS) Input splits (on DFS) R2 M3 M4 Synchronousexecution: every R starts computing after all M’s have completed

Storage • Job input and output are stored on DFS • Replicated, reliable storage • Intermediate files reside on local disks • Non-reliable • Data is transferred between Mapper to Reducers via network, on files – time consuming.

Combiners • Often, the reducer does is simple aggregation • Sum, average, min/max, … • Commutative and associative functions • We can do some aggregation at the mapper side • … and eliminate a lot of network traffic! • Where can we use it in an example we have already seen? • Word Count – combiner identical to reducer

Data-flow with combiner (key,value’) (key,value) Input (text) Mapper Combiner Sort pairs by keyCreate a list per keyShuffle keys by hashvalue Done on the same machine! output (key,list<value>) Reducer Framework User Supplied

Fault tolerance • If the probability for a machine to fail is p, the probability for some machine in a cluster of n machines to fail is • For p=0.0005, n=2000 we get • We cannot assume no faults in large scale clusters. • The framework takes care for fault tolerance. • There is a master that controls the data flow. • If a mapper/reducer fails – just resend the task to a different machine. • If the master fails – some other machine becomes the new master.

Straggler Tasks Slowest task (straggler) affects the job latency M1 R1 M2 Input (on DFS) Output (on DFS) R2 M3 M4

Speculative Execution • Schedule a backup task if the original task takes too long to complete • Same input(s), different output(s) • Failed tasks and stragglers get the same treatment • Let the fastest win • After one task completes, kill all the clones • Challenge: how can we tell a task is late?

Summary • A simple paradigm for batch processing • Data- and computation-intensive jobs • Simplicity is key for scalability • No silver bullet • E.g., MPI is better for iterative computation-intensive workloads (e.g., scientific simulations)

Introduction to Search Engines Technology CS 236375 Technion , Winter 2013

Introduction to Search Engines Technology CS 236375 Technion , Winter 2013

Presentation Transcript

Search Engines.

Search Engines

Search Engines

Introduction to IR Systems: Search Engines

Search Engines

SEARCH ENGINES

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines?

Search Engines

Technion Israel Institute of Technology Introduction

Search Engines

Search Engines

Alina Shaikhet (CS, Technion)

Search Engines