1 / 21

MapReduce , the Big Data Workhorse

MapReduce , the Big Data Workhorse. Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013. Distributed Computing. Use several computers to process large amounts of data Often significant distribution overhead If math helps:

Download Presentation

MapReduce , the Big Data Workhorse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce, the Big Data Workhorse Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013

  2. Distributed Computing • Use several computers to process large amounts of data • Often significant distribution overhead • If math helps: • How do you deal with dependencies between data elements? • ie counting word occurrences: what if a word gets sent to two computers?

  3. History of MapReduce • Developed at Google 1999-2000, published by Google 2004 • Used to make/maintain Google WWW index • Open source implementation by the Apache Software Foundation: Hadoop • “Spinoffs” egHBase (used by Facebook) • Amazon’s Elastic MapReduce (EMR) service • Uses the Hadoop implementation of MapReduce • Various wrapper libraries, egMRjob

  4. MapReduce, Conceptually • Split data for distributed processing • But some data may depend on other data to be processed correctly • MapReducemaps which data need to be processed together • Then reduces (processes) the data

  5. The MapReduce Framework • Input is split into different chunks Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9

  6. The MapReduce Framework • Each chunk is sent to one of several computers running the same map() function Input 1 Mapper 1 Input 2 Input 3 Input 4 Mapper 2 Input 5 Input 6 Input 7 Mapper 3 Input 8 Input 9

  7. The MapReduce Framework • Each map() function outputs several (key, value) pairs (k1, v1) Input 1 (k3, v2) Mapper 1 (k3, v3) Input 2 (k3, v6) Input 3 Input 4 (k2, v4) Mapper 2 (k1, v5) Input 5 (k3, v9) Input 6 (k2, v8) Input 7 (k2, v10) Mapper 3 Input 8 (k1, v7) Input 9 (k1, v12) (k2, v11)

  8. The MapReduce Framework • The map() outputs are collected and sorted by key (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

  9. The MapReduce Framework • Several computers running the same reduce()function receive the (key, value) pairs (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

  10. The MapReduce Framework • All the records for a given key will be sent to the same reducer; this is why we sort (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

  11. The MapReduce Framework • Each reducer outputs a final value (maybe with a key) (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 Output 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 Output 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 Output 3 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

  12. The MapReduce Framework • The reducer outputs are aggregated and become the final output (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 Output 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 Output 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 Output 3 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

  13. Example – Word Count • Problem: given a large body of text, count how many times each word occurs • How can we parallelize? • Mapper key = • Mapper value = • Reducer key = • Reducer value = word # occurrences in this mapper’s input word sum of # occurrences over all mappers

  14. Example – Word Count function map(input): counts = new dictionary() for word in input: counts[word]++ for word in counts: yield (word, count[word])

  15. Example – Word Count function reduce (key, values): sum = 0 for val in values: sum += val yield (key, sum)

  16. Now Let’s Do It • I need 3 volunteer slave nodes • I’ll be the master node

  17. Considerations • Hadoop takes care of distribution, but only as efficiently as you allow • Input must be split evenly • Values should be spread evenly over keys • If not, reduce() step will not be very well distributed – imagine all values get mapped to the same key, then the reduce() step is not parallelized at all! • Several keys should be used • If you have few keys, then few computers can be used as reducers • By the same token, more/smaller input chunks are good • You need to know the data you’re processing!

  18. Practical Hadoop Concerns • I/O is often the bottleneck, so use compression! • Some compression formats are not splittable • Entire input files (large!) will be sent to single mappers, destroying hopes of distribution • Consider using a combiner (“pre-reducer”) • EMR considerations: • Input from S3 is fast • Nodes are virtual machines

  19. Hadoop Streaming • Hadoop in its original form uses Java • Hadoop Streaming allows programmers to avoid direct interaction with Java by instead using Unix STDIN/STDOUT • Requires serialization of keys and values • Potential problems – “<key>\t<value>”, but what if serialized key or value contains a “\t”? • Beware of stray “print” statements • Safer to print to STDERR

  20. Hadoop Streaming JAVA HADOOP Serialized Input Serialized Output STDOUT STDIN

  21. Thank You! • Thanks for your attention • Please provide feedback, comments, questions, etc: vyassa.baratham@stonybrook.edu • Interested in physics? Want to learn about Monte Carlo Simulation?

More Related