C-Store: MapReduce

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009

Motivation: Large Scale Data Processing • Many tasks: • Process lots of data to produce other data. • The Problems • How to parallelize the computation? • How to distribute the data? • How to handle failures? • MapReduce is a programming model for solving the above problems.

Programming model • Input & Output: each a set of key/value pairs • Programmer specifies two functions: • map (in_key, in_value) -> list(out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs • reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)

Example: Count word occurrences • map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); • reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Implementation Environment at Google • Large clusters of commodity PCs connected together with switched Ethernet. • 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • Commodity networking hardware: 100 MB/s or 1GB/s. • Storage is on local IDE disks • GFS: distributed file system manages data (SOSP'03) • Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines • Implementation is a C++ library linked into user programs

Typical Hadoop Cluster(Hadoop is developed mainly by Yahoo!)

fork fork fork Master assign map assign reduce Input Data Worker Output File 0 write Worker local write Split 0 read Worker Split 1 Output File 1 Split 2 Worker Worker remote read, sort Distributed Execution Overview User Program

Data flow • Input, final output are stored on a distributed file system • Scheduler tries to schedule map tasks “close” to physical storage location of input data • Intermediate results are stored on local FS of map and reduce workers • Output is often input to another map reduce task

Coordination • Master data structures • Task status: (idle, in-progress, completed) • Idle tasks get scheduled as workers become available • When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer • Master pushes this info to reducers • Master pings workers periodically to detect failures

Failures • Map worker failure • Map tasks completed or in-progress at worker are reset to idle • Reduce workers are notified when task is rescheduled on another worker • Reduce worker failure • Only in-progress tasks are reset to idle • Master failure • MapReduce task is aborted and client is notified

How many Map and Reduce jobs? • M map tasks, R reduce tasks • Rule of thumb: • Make M and R much larger than the number of nodes in cluster • One DFS chunk per map is common • Improves dynamic load balancing and speeds recovery from worker failure • Usually R is smaller than M, because output is spread across R files

Combiners • Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k • E.g., popular words in Word Count • Can save network time by pre-aggregating at mapper • combine(k1, list(v1))  v2 • Usually same as reduce function • Works only if reduce function is commutative and associative

Partition Function • Inputs to map tasks are created by contiguous splits of input file • For reduce, we need to ensure that records with the same intermediate key end up at the same worker • System uses a default partition function e.g., hash(key) mod R • Sometimes useful to override • E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file

Implementations • Google • Not available outside Google • Hadoop • An open-source implementation in Java • Uses HDFS for stable storage • Download: http://lucene.apache.org/hadoop/

References • Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. http://labs.google.com/papers/mapreduce.html • The HTML Slides. http://labs.google.com/papers/mapreduce-osdi04-slides/index.html • Matei Zaharia. Introduction to MapReduce and Hadoop. www.cs.berkeley.edu/~demmel/cs267_Spr09/Lectures/Cloud_MapReduce_Zaharia.ppt • The Stanford CS345A Slides on MapReduce. http://www.stanford.edu/class/cs345a/mapreduce.ppt

C-Store: MapReduce