230 likes | 342 Views
MapReduce , the Big Data Workhorse. Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013. Distributed Computing. Use several computers to process large amounts of data Often significant distribution overhead If math helps:
E N D
MapReduce, the Big Data Workhorse Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013
Distributed Computing • Use several computers to process large amounts of data • Often significant distribution overhead • If math helps: • How do you deal with dependencies between data elements? • ie counting word occurrences: what if a word gets sent to two computers?
History of MapReduce • Developed at Google 1999-2000, published by Google 2004 • Used to make/maintain Google WWW index • Open source implementation by the Apache Software Foundation: Hadoop • “Spinoffs” egHBase (used by Facebook) • Amazon’s Elastic MapReduce (EMR) service • Uses the Hadoop implementation of MapReduce • Various wrapper libraries, egMRjob
MapReduce, Conceptually • Split data for distributed processing • But some data may depend on other data to be processed correctly • MapReducemaps which data need to be processed together • Then reduces (processes) the data
The MapReduce Framework • Input is split into different chunks Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9
The MapReduce Framework • Each chunk is sent to one of several computers running the same map() function Input 1 Mapper 1 Input 2 Input 3 Input 4 Mapper 2 Input 5 Input 6 Input 7 Mapper 3 Input 8 Input 9
The MapReduce Framework • Each map() function outputs several (key, value) pairs (k1, v1) Input 1 (k3, v2) Mapper 1 (k3, v3) Input 2 (k3, v6) Input 3 Input 4 (k2, v4) Mapper 2 (k1, v5) Input 5 (k3, v9) Input 6 (k2, v8) Input 7 (k2, v10) Mapper 3 Input 8 (k1, v7) Input 9 (k1, v12) (k2, v11)
The MapReduce Framework • The map() outputs are collected and sorted by key (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
The MapReduce Framework • Several computers running the same reduce()function receive the (key, value) pairs (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
The MapReduce Framework • All the records for a given key will be sent to the same reducer; this is why we sort (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
The MapReduce Framework • Each reducer outputs a final value (maybe with a key) (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 Output 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 Output 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 Output 3 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
The MapReduce Framework • The reducer outputs are aggregated and become the final output (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 Output 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 Output 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 Output 3 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
Example – Word Count • Problem: given a large body of text, count how many times each word occurs • How can we parallelize? • Mapper key = • Mapper value = • Reducer key = • Reducer value = word # occurrences in this mapper’s input word sum of # occurrences over all mappers
Example – Word Count function map(input): counts = new dictionary() for word in input: counts[word]++ for word in counts: yield (word, count[word])
Example – Word Count function reduce (key, values): sum = 0 for val in values: sum += val yield (key, sum)
Now Let’s Do It • I need 3 volunteer slave nodes • I’ll be the master node
Considerations • Hadoop takes care of distribution, but only as efficiently as you allow • Input must be split evenly • Values should be spread evenly over keys • If not, reduce() step will not be very well distributed – imagine all values get mapped to the same key, then the reduce() step is not parallelized at all! • Several keys should be used • If you have few keys, then few computers can be used as reducers • By the same token, more/smaller input chunks are good • You need to know the data you’re processing!
Practical Hadoop Concerns • I/O is often the bottleneck, so use compression! • Some compression formats are not splittable • Entire input files (large!) will be sent to single mappers, destroying hopes of distribution • Consider using a combiner (“pre-reducer”) • EMR considerations: • Input from S3 is fast • Nodes are virtual machines
Hadoop Streaming • Hadoop in its original form uses Java • Hadoop Streaming allows programmers to avoid direct interaction with Java by instead using Unix STDIN/STDOUT • Requires serialization of keys and values • Potential problems – “<key>\t<value>”, but what if serialized key or value contains a “\t”? • Beware of stray “print” statements • Safer to print to STDERR
Hadoop Streaming JAVA HADOOP Serialized Input Serialized Output STDOUT STDIN
Thank You! • Thanks for your attention • Please provide feedback, comments, questions, etc: vyassa.baratham@stonybrook.edu • Interested in physics? Want to learn about Monte Carlo Simulation?