1 / 25

Google’s MapReduce

Google’s MapReduce . Connor Poske Florida State University. Outline . Part I: History MapReduce architecture and features How it works Part II: MapReduce programming model and example. Initial History . There is a demand for large scale data processing.

maggiecook
Download Presentation

Google’s MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Google’s MapReduce Connor Poske Florida State University

  2. Outline • Part I: • History • MapReduce architecture and features • How it works • Part II: • MapReduce programming model and example

  3. Initial History • There is a demand for large scale data processing. • The folks at Google have discovered certain common themes for processing very large input sizes. - Multiple machines are needed - There are usually 2 basic operations on the input data: 1) Map 2) Reduce

  4. Map • Similar to the Lisp primitive • Apply a single function to multiple inputs In the MapReduce model, the map function applies an operation to a list of pairs of the form (input_key, input_value), and produces a set of INTERMEDIATE key/value tuples. Map(input_key, input_value) -> (output_key, intermediate_value) list

  5. Reduce • Accepts the set of intermediate key/value tuples as input • Applies a reduce operation to all values that share the same key Reduce(output_key, intermediate_value list) -> output list

  6. Quick example • Pseudo-code counts the number of occurrences of each word in a large collection of documents Map(String fileName, String fileContents) //fileName is input key, fileContents is input value For each word w in fileContents EmitIntermediate(w, “1”) Reduce(String word, Iterator Values) //word: input key, values: a list of counts int count = 0 for each v in values count += 1 Emit(AsString(count))

  7. The idea sounds good, but… • We can’t forget about the problems arising from large scale, multiple-machine data processing • How do we parallelize everything? • How do we balance the input load? • Handle failures? Enter the MapReduce model…

  8. MapReduce • The MapReduce implementation is an abstraction that hides these complexities from the programmer • The User defines the Map and Reduce functions • The MapReduce implementation automatically distributes the data, then applies the user-defined functions on the data • Actual code slightly more complex than previous example

  9. MapReduce Architecture • User program with Map and Reduce functions • Cluster of average PCs • Upon execution, cluster is divided into: • Master worker • Map workers • Reduce workers

  10. Execution Overview • Split up input data, start up program on all machines • Master machine assigns M Map and R Reduce tasks to idle worker machines • Map function executed and results buffered locally • Periodically, data in local memory is written to disk. Locations on disk of data are forwarded to master --Map phase complete— • Reduce worker uses RPCs to read intermediate data from Map machines. Data is sorted by key. • Reduce worker iterates over data and passes each unique key along with associated values to the Reduce function • Master wakes up the user program, MapReduce call returns.

  11. Execution Overview

  12. Master worker • Stores state information about Map and Reduce workers • Idle, in-progress, or completed • Stores location and sizes on disk of intermediate file regions on Map machines • Pushes this information incrementally to workers with in-progress reduce tasks • Displays status of entire operation via HTTP • Runs internal HTTP server • Displays progress I.E. bytes of intermediate data, bytes of output, processing rates, etc

  13. Parallelization • Map() runs in parallel, creating different intermediate output from different input keys and values • Reduce() runs in parallel, each working on a different key • All data is processed independently by different worker machines • Reduce phase cannot begin until Map phase is completely finished!

  14. Load Balancing • User defines a MapReduce “spec” object • MapReduceSpecification spec • Spec.set_machines(2000) • Spec.set_map_megabytes(100) • Spec.set_reduce_megabytes(100) That’s it! The library will automatically take care of the rest.

  15. Fault Tolerance • Master pings workers periodically Switch(ping response) case (idle): Assign task if possible case (in-progress): do nothing case (completed): reset to idle case (no response): Reassign task

  16. Fault Tolerance • What if a map task completes but the machine fails before the intermediate data is retrieved via RPC? • Re-execute the map task on an idle machine • What if the intermediate data is partially read, but the machine fails before all reduce operations can complete? • What if the master fails…? PWNED

  17. Fault Tolerance • Skipping bad records • Optional parameter to change mode of execution • When enabled, MapReduce library detects records that cause crashes and skips them • Bottom line: MapReduce is very robust in its ability to recover from failure and handle errors

  18. Part II: Programming Model • MapReduce library is extremely easy to use • Involves setting up only a few parameters, and defining the map() and reduce() functions • Define map() and reduce() • Define and set parameters for MapReduceInput object • Define and set parameters for MapReduceOutput object • Main program

  19. Map() Class WordCounter : public Mapper{ public: virtual void Map(const MapInput &input) { //parse each word and for each word //emit(word, “1”) } }; REGISTER_MAPPER(WordCounter);

  20. Reduce() Class Adder : public Reducer { virtual void Reduce(ReduceInput *input) { //Iterate over all entries with same key //and add the values } }; REGISTER_REDUCER(Adder);

  21. Main() int main(int argc, char ** argv) { MapReduceSpecification spec; MapReduceInput *input; //store list of input files into “spec” for( int i = 0; I < argc; ++i) { input = spec.add_input(); input->set_format(“text”); input->set_filepattern(argv[i]); input->set_mapper_class(“WordCounter”); }

  22. Main() //Specify the output files MapReductOutput *output = spec.output(); out->set_filebase (“/gfs/test/freq”); out->set_num_tasks(100); // freq-00000-of-00100 // freq-00001-of-00100 out->set_format(“text”); out->set_reducer_class(“Adder”);

  23. Main() //Tuning parameters and actual MapReduce call spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); MapReduceResult result; if(!MapReduce(spec, &result)) abort(); Return 0; } //end main

  24. Other possible uses • Distributed grep • Map emits a line if it matches a supplied pattern • Reduce simply copies intermediate data to output • Count URL access frequency • Map processes logs of web page requests and emits (URL, 1) • Reduce adds all values for each URL and emits (URL, count) • Inverted Index • Map parses each document and emits a sequence of (word, document ID) pairs. • Reduce accepts all pairs for a given word, sorts the list based on Document ID, and emits (word, list(document ID)) • Many more…

  25. Conclusion • MapReduce provides a easy to use, clean abstraction for large scale data processing • Very robust in fault tolerance and error handling • Can be used for multiple scenarios • Restricting the programming model to the Map and Reduce paradigms makes it easy to parallelize computations and make them fault-tolerant

More Related