290 likes | 620 Views
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004. Shimin Chen DISC Reading Group. Motivation: Large Scale Data Processing. Process lots of data to produce other derived data Input: crawled documents, web request logs etc.
E N D
MapReduce: Simplified Data Processing on Large ClustersJ. Dean and S. Ghemawat (Google) OSDI 2004 Shimin ChenDISC Reading Group
Motivation: Large Scale Data Processing • Process lots of data to produce other derived data • Input: crawled documents, web request logs etc. • Output: inverted indices, web page graph structure,top queries in a day etc. • Want to use hundreds or thousands of CPUs • but want to only focus on the functionality • MapReduce hides messy details in a library: • Parallelization • Data distribution • Fault-tolerance • Load balancing
Outline • Programming Model • Implementation • Refinements • Evaluation • Conclusion
Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair to generate intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Given all intermediate values for a particular key, produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other functional languages
Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
Looking at Actual Code (Appendix A) #include "mapreduce/mapreduce.h“ // User's map function class WordCounter : public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while ((i < n) && isspace(text[i])) i++; // Find word end int start = i; while ((i < n) && !isspace(text[i])) i++; if (start < i) Emit(text.substr(start,i-start),"1"); } } }; REGISTER_MAPPER(WordCounter);
// User's reduce function • class Adder : public Reducer { • virtual void Reduce(ReduceInput* input) { • // Iterate over all entries with the • // same key and add the values • int64 value = 0; • while (!input->done()) { • value += StringToInt(input->value()); • input->NextValue(); • } • // Emit sum for input->key() • Emit(IntToString(value)); • } • }; • REGISTER_REDUCER(Adder);
int main(int argc, char** argv) { • ParseCommandLineFlags(argc, argv); • MapReduceSpecification spec; • // Store list of input files into "spec" • for (int i = 1; i < argc; i++) { • MapReduceInput* input = spec.add_input(); • input->set_format("text"); • input->set_filepattern(argv[i]); • input->set_mapper_class("WordCounter"); • } • // Specify the output files: • // /gfs/test/freq-00000-of-00100,/gfs/test/freq-00001-of-00100 • MapReduceOutput* out = spec.output(); • out->set_filebase("/gfs/test/freq"); • out->set_num_tasks(100); • out->set_format("text"); • out->set_reducer_class("Adder"); • // Optional: do partial sums within map tasks • out->set_combiner_class("Adder"); • // Tuning parameters • spec.set_machines(2000); • spec.set_map_megabytes(100); • spec.set_reduce_megabytes(100); • // Now run it • MapReduceResult result; • if (!MapReduce(spec, &result)) abort(); • // Done: 'result' structure contains info about counters, time • // taken, number of machines used, etc. • return 0; • }
More Examples • Inverted index: worddocuments • Map: parse each document, emits <word, document ID> • Reduce: emits <word, list of document IDs> • Distributed grep • Map: emits a line if input document match a given pattern • Reduce: identity function • Distributed sort • Map: extracts key from each record, emits a <key, record> • Reduce: emits all pairs unchanged • Relies on partitioning function and ordering guarantees (later in the talk)
Outline • Programming Model • Implementation • Refinements • Evaluation • Conclusion
Implementation Overview Execution environment: Google cluster • 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • 100 mbps or 1 gbps Ethernet, but limited (average) bisection bandwidth • Storage is on local IDE disks • GFS: distributed file system manages data • Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines
Parallelization • Map • Divide the input into M equal-sized splits • Each split is 16-64 MB large • Reduce • Partitioning intermediate key space into R pieces • hash(intermediate_key) mod R • Typical setting: • 2,000 machines • M = 200,000 • R = 5,000
Execution Overview (0) mapreduce(spec, &result) M input splits of 16-64MB each • Read all intermediate data • Sort it by intermediate keys R regions Partitioning functionhash(intermediate_key) mod R
More Details • Master: • Map task: state (idle/in-progress/completed), R file locations, worker machine • Reduce task: state (idle/in-progress/completed), worker machine • O(M+R) scheduling decisions, O(MR) space • Locality preserving scheduling • Schedule a map task close to the input location • Prefer fine-grain tasks • Dynamic load balancing • Speeds up recovery
Fault Tolerance via Re-Execution On worker failure: • Detect failure via periodic heartbeats • Re-execute completed and in-progress map tasks • Fine-grain: the completed tasks can be re-executed on multiple machines quickly • Re-execute in progress reduce tasks • Task completion committed through master Master failure: • Could handle, but don't yet (master failure unlikely)
Outline • Programming Model • Implementation • Refinements • Evaluation • Conclusion
Backup Tasks • Problem: “straggler” • A machine takes an unusually long time to complete one of the last few map or reduce tasks • E.g. bad disk with frequent correctable errors; other jobs running; machine configuration problems etc. • Near end of phase, master schedules backup executions of the remaining in-progress tasks • Whichever one finishes first "wins"
Combiner Function • Purpose: reduce data sent over network • Combiner function: performs partial merging of intermediate data at the map worker • Typically, combiner function == reducer function • Requires commutative and associative • E.g. word count
Skipping Bad Records • Map/Reduce functions sometimes fail for particular inputs • Best solution is to debug & fix, but not always possible • On seg fault: • Send UDP packet to master from signal handler • Include sequence number of record being processed • If master sees two failures for same record: • Next worker is told to skip the record • Effect: Can work around bugs in third-party libraries
Other Refinements • Extensible input and output types • Local execution for debugging • Status web page • User-defined counters • Counter values returned to user code • Displayed on status web page
Outline • Programming Model • Implementation • Refinements • Evaluation • Conclusion
Setup Tests run on cluster of 1800 machines: • 4 GB of memory • Dual-processor 2 GHz Xeons with Hyperthreading • Dual 160 GB IDE disks • Gigabit Ethernet per machine • Bisection bandwidth approximately 100 Gbps
Benchmarks Two benchmarks: Grep Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records) • M=15,000 (input split size about 64MB) • R=1 Sort Sort 1010 100-byte records (modeled after TeraSort benchmark) • M=15,000 (input split size about 64MB) • R=4,000
Grep Locality optimization helps: • 1800 machines read 1 TB of data at peak of ~31 GB/s • Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs • Total time about 150 seconds; 1 minute startup time
Sort 44% longer 5% longer
Conclusion • MapReduce has proven to be a useful abstraction • Greatly simplifies large-scale computations at Google • Fun to use: focus on problem, let library deal w/ messy details