Google’s MapReduce

Google’s MapReduce Connor Poske Florida State University

Outline • Part I: • History • MapReduce architecture and features • How it works • Part II: • MapReduce programming model and example

Initial History • There is a demand for large scale data processing. • The folks at Google have discovered certain common themes for processing very large input sizes. - Multiple machines are needed - There are usually 2 basic operations on the input data: 1) Map 2) Reduce

Map • Similar to the Lisp primitive • Apply a single function to multiple inputs In the MapReduce model, the map function applies an operation to a list of pairs of the form (input_key, input_value), and produces a set of INTERMEDIATE key/value tuples. Map(input_key, input_value) -> (output_key, intermediate_value) list

Reduce • Accepts the set of intermediate key/value tuples as input • Applies a reduce operation to all values that share the same key Reduce(output_key, intermediate_value list) -> output list

Quick example • Pseudo-code counts the number of occurrences of each word in a large collection of documents Map(String fileName, String fileContents) //fileName is input key, fileContents is input value For each word w in fileContents EmitIntermediate(w, “1”) Reduce(String word, Iterator Values) //word: input key, values: a list of counts int count = 0 for each v in values count += 1 Emit(AsString(count))

The idea sounds good, but… • We can’t forget about the problems arising from large scale, multiple-machine data processing • How do we parallelize everything? • How do we balance the input load? • Handle failures? Enter the MapReduce model…

MapReduce • The MapReduce implementation is an abstraction that hides these complexities from the programmer • The User defines the Map and Reduce functions • The MapReduce implementation automatically distributes the data, then applies the user-defined functions on the data • Actual code slightly more complex than previous example

MapReduce Architecture • User program with Map and Reduce functions • Cluster of average PCs • Upon execution, cluster is divided into: • Master worker • Map workers • Reduce workers

Execution Overview • Split up input data, start up program on all machines • Master machine assigns M Map and R Reduce tasks to idle worker machines • Map function executed and results buffered locally • Periodically, data in local memory is written to disk. Locations on disk of data are forwarded to master --Map phase complete— • Reduce worker uses RPCs to read intermediate data from Map machines. Data is sorted by key. • Reduce worker iterates over data and passes each unique key along with associated values to the Reduce function • Master wakes up the user program, MapReduce call returns.

Execution Overview

Master worker • Stores state information about Map and Reduce workers • Idle, in-progress, or completed • Stores location and sizes on disk of intermediate file regions on Map machines • Pushes this information incrementally to workers with in-progress reduce tasks • Displays status of entire operation via HTTP • Runs internal HTTP server • Displays progress I.E. bytes of intermediate data, bytes of output, processing rates, etc

Parallelization • Map() runs in parallel, creating different intermediate output from different input keys and values • Reduce() runs in parallel, each working on a different key • All data is processed independently by different worker machines • Reduce phase cannot begin until Map phase is completely finished!

Load Balancing • User defines a MapReduce “spec” object • MapReduceSpecification spec • Spec.set_machines(2000) • Spec.set_map_megabytes(100) • Spec.set_reduce_megabytes(100) That’s it! The library will automatically take care of the rest.

Fault Tolerance • Master pings workers periodically Switch(ping response) case (idle): Assign task if possible case (in-progress): do nothing case (completed): reset to idle case (no response): Reassign task

Fault Tolerance • What if a map task completes but the machine fails before the intermediate data is retrieved via RPC? • Re-execute the map task on an idle machine • What if the intermediate data is partially read, but the machine fails before all reduce operations can complete? • What if the master fails…? PWNED

Fault Tolerance • Skipping bad records • Optional parameter to change mode of execution • When enabled, MapReduce library detects records that cause crashes and skips them • Bottom line: MapReduce is very robust in its ability to recover from failure and handle errors

Part II: Programming Model • MapReduce library is extremely easy to use • Involves setting up only a few parameters, and defining the map() and reduce() functions • Define map() and reduce() • Define and set parameters for MapReduceInput object • Define and set parameters for MapReduceOutput object • Main program

Map() Class WordCounter : public Mapper{ public: virtual void Map(const MapInput &input) { //parse each word and for each word //emit(word, “1”) } }; REGISTER_MAPPER(WordCounter);

Reduce() Class Adder : public Reducer { virtual void Reduce(ReduceInput *input) { //Iterate over all entries with same key //and add the values } }; REGISTER_REDUCER(Adder);

Main() int main(int argc, char ** argv) { MapReduceSpecification spec; MapReduceInput *input; //store list of input files into “spec” for( int i = 0; I < argc; ++i) { input = spec.add_input(); input->set_format(“text”); input->set_filepattern(argv[i]); input->set_mapper_class(“WordCounter”); }

Main() //Specify the output files MapReductOutput *output = spec.output(); out->set_filebase (“/gfs/test/freq”); out->set_num_tasks(100); // freq-00000-of-00100 // freq-00001-of-00100 out->set_format(“text”); out->set_reducer_class(“Adder”);

Main() //Tuning parameters and actual MapReduce call spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); MapReduceResult result; if(!MapReduce(spec, &result)) abort(); Return 0; } //end main

Other possible uses • Distributed grep • Map emits a line if it matches a supplied pattern • Reduce simply copies intermediate data to output • Count URL access frequency • Map processes logs of web page requests and emits (URL, 1) • Reduce adds all values for each URL and emits (URL, count) • Inverted Index • Map parses each document and emits a sequence of (word, document ID) pairs. • Reduce accepts all pairs for a given word, sorts the list based on Document ID, and emits (word, list(document ID)) • Many more…

Conclusion • MapReduce provides a easy to use, clean abstraction for large scale data processing • Very robust in fault tolerance and error handling • Can be used for multiple scenarios • Restricting the programming model to the Map and Reduce paradigms makes it easy to parallelize computations and make them fault-tolerant

Google’s MapReduce