320 likes | 511 Views
MapReduce. Web data sets can be very large Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: Cluster of commodity Linux nodes Gigabit Ethernet interconnect How to organize computations on commodity clusters?. MapReduce.
E N D
Web data sets can be very large • Tens to hundreds of terabytes • Cannot mine on a single server • Standard architecture emerging: • Cluster of commodity Linux nodes • Gigabit Ethernet interconnect • How to organize computations on commodity clusters?
MapReduce • Map-reduce is a high-level programming system that allows processes to be written simply. • The user writes code for two functions, map and reduce. • A master controller divides the input data into chunks, and assigns different processors to execute the map function on each chunk. • Other processors, perhaps the same ones, are then assigned to perform the reduce function on pieces of the output from the map function.
Data Organization • Data is assumed stored in files. • Typically, the files are very large compared with the files found in conventional systems. • For example, one file might be all the tuples of a very large relation. • Or, the file might be a terabyte of "market baskets,“ • Or, the file might be the "transition matrix of the Web," which is a representation of the graph with all Web pages as nodes and hyperlinks as edges. • Files are divided into chunks, which might be complete cylinders of a disk, and are typically many megabytes.
The Map Function • Input is a set of key-value records. • Executed by one or more processes, located at any number of processors. • Each map process is given a chunk of the entire input data on which to work. • Output is a list of key-value pairs. • The types of keys and values for the output of the map function need not be the same as the types of input keys and values. • The "keys" that are output from the map function are not true keys in the database sense. • That is, there can be many pairs with the same key value.
The Reduce Function • The second user-defined function, reduce, is also executed by one or more processes, located at any number of processors. • Input a key value from the intermediate result, together with the list of all values that appear with this key in the intermediate result. • The reduce function itself combines the list of values associated with a given key k.
A Simple Example • Counting words in a large set of documents map(string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”); reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));
Example for MapReduce • Page 1: the weather is good • Page 2: today is good • Page 3: good weather is good.
Map output • Worker 1: • (the 1), (weather 1), (is 1), (good 1). • Worker 2: • (today 1), (is 1), (good 1). • Worker 3: • (good 1), (weather 1), (is 1), (good 1).
Reduce Input • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), (good 1), (good 1)
Reduce Output • Worker 1: • (the 1) • Worker 2: • (is 3) • Worker 3: • (weather 2) • Worker 4: • (today 1) • Worker 5: • (good 4)
MapReduceExample Constructing an Inverted Index • Input is a collection of documents, • Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Map Function • Input is a set of (i,d) pairs • i is document ID • d is corresponding document. • The map function scans d and for each word w it finds, it emits the pair (w, i). • Notice that in the output, the word is the key and the document ID is the associated value. • Output of map is a list of word-ID pairs. • Not necessary to catch duplicate words in the document; the elimination of duplicates can be done later, at the reduce phase. • The intermediate result is the collection of all word-ID pairs created from all the documents in the input database.
Note. The output of a map-reduce algorithm is always a set of key-value pairs. Useful in some applications to compose two or more map-reduce operations.
MapReduceExample Constructing an Inverted Index • Input is a collection of documents, • Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Reduce Function • The intermediate result consists of pairs of the form (w, [i1, i2,…,in]), • where the i's are a list of document ID's, one for each occurrence of word w. • The reduce function takes a list of ID's, eliminates duplicates, and sorts the list of unique ID's.
Parallelism • This organization of the computation makes excellent use of whatever parallelism is available. • The map function works on a single document, so we could have as many processes and processors as there are documents in the database. • The reduce function works on a single word, so we could have as many processes and processors as there are words in the database. • Of course, it is unlikely that we would use so many processors in practice.
Some Applications • Distributed Grep: • Map - Emits a line if it matches the supplied pattern • Reduce - Copies the the intermediate data to output • Count of URL access frequency • Map – Process web log and outputs <URL, 1> • Reduce - Emits <URL, total count> • Reverse Web-Link Graph • Map – process web log and outputs <target, source> • Reduce - emits <target, list(source)>
Refinement • Fault tolerance • Different partitioning functions. • Combiner function. • Different input/output types. • Skipping bad records. • Local execution. • Status info. • Counters.
Fault Tolerance • Network Failure: • Detect failure via periodic heartbeats • Re-execute completed and in-progress map tasks • Re-execute in progress reduce tasks • Task completion committed through master • Master failure: • Could handle, but don't yet (master failure unlikely)
Fault Tolerance • Reactive way • Worker failure • Heartbeat, Workers are periodically pinged by master • NO response = failed worker • If the processor of a worker fails, the tasks of that worker are reassigned to another worker. • Master failure • Master writes periodic checkpoints • Another master can be started from the last checkpointed state • If eventually the master dies, the job will be aborted
Fault Tolerance • Proactive way (Redundant Execution) • The problem of “stragglers” (slow workers) • Other jobs consuming resources on machine • Bad disks with soft errors transfer data very slowly • Weird things: processor caches disabled (!!) • When computation almost done, reschedule in-progress tasks • Whenever either the primary or the backup executions finishes, mark it as completed
Fault Tolerance • Input error: bad records • Map/Reduce functions sometimes fail for particular inputs • Best solution is to debug & fix, but not always possible • On segment fault • Send UDP packet to master from signal handler • Include sequence number of record being processed • Skip bad records • If master sees two failures for same record, next worker is told to skip the record
Locality issue • Master scheduling policy • Asks GFS for locations of replicas of input file blocks • Map tasks typically split into 64MB (== GFS block size) • Map tasks scheduled so GFS input block replica are on same machine or same rack • Effect • Thousands of machines read input at local disk speed • Without this, rack switches limit read rate
Refinements • Task Granularity • Minimizes time for fault recovery • load balancing • Local execution for debugging/testing • Compression of intermediate data • Better Shuffle-Sort
Points need to be emphasized • No reduce can begin until map is complete • Master must communicate locations of intermediate files • Tasks scheduled based on location of data • If map worker fails any time before reduce finishes, task must be completely rerun • MapReduce library does most of the hard work for us!