1 / 36

Tutorial for MapReduce (Hadoop) & Large Scale Processing

Tutorial for MapReduce (Hadoop) & Large Scale Processing. Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted from IR course lectures by Jamie Callan. Outline. Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking

ham
Download Presentation

Tutorial for MapReduce (Hadoop) & Large Scale Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted from IR course lectures by Jamie Callan © 2010, Le Zhao

  2. Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao

  3. Outline • Why MapReduce (Hadoop) • Why go large scale • Compared to other parallel computing models • Hadoop related tools • MapReduce basics • The MapReduce way of thinking • Manipulating large data © 2010, Le Zhao

  4. Why NOT to do parallel computing • Concerns: a parallel system needs to provide: • Data distribution • Computation distribution • Fault tolerance • Job scheduling © 2010, Le Zhao

  5. Why MapReduce (Hadoop) • Previous parallel computation models • 1) scp + ssh • Manual everything • 2) network cross-mounted disks + condor/torque • No data distr, disk access is bottleneck • Can only partition totally distributed computation • No fault tolerance • Prioritized job scheduling © 2010, Le Zhao

  6. Hadoop • Parallel batch computation • Data distribution • Hadoop Distributed File System (HDFS) • Like Linux FS, but with automatic data repetition • Computation distribution • Automatic, user only need to specify #input_splits • Can distribute aggregation computations as well • Fault tolerance • Automatic recovery from failure • Speculative execution (a backup task) • Job scheduling • Ok, but still relies on the politeness of users © 2010, Le Zhao

  7. How you can use Hadoop • Hadoop Streaming • Quick hacking – much like shell scripting • Uses STDIN & STDOUT carry data • cat file | mapper | sort | reducer > output • Easier to use legacy code, all programming languages • Hadoop Java API • Build large systems • More data types • More control over Hadoop’s behavior • Easier debugging with Java’s error stacktrace display • NetBeans plugin for Hadoop provides easy programming • http://hadoopstudio.org/docs.html © 2010, Le Zhao

  8. Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao

  9. Map and Reduce MapReduce is a new use of an old idea in Computer Science Map: Apply a function to every object in a list Each object is independent Order is unimportant Maps can be done in parallel The function produces a result Reduce: Combine the results to produce a final result You may have seen this in a Lisp or functional programming course © 2009, Jamie Callan

  10. MapReduce • Input reader • Divide input into splits, assign each split to a Map processor • Map • Apply the Map function to each record in the split • Each Map function returns a list of (key, value) pairs • Shuffle/Partition and Sort • Shuffle distributes sorting & aggregation to many reducers • All records for key k are directed to the same reduce processor • Sort groups the same keys together, and prepares for aggregation • Reduce • Apply the Reduce function to each key • The result of the Reduce function is a list of (key, value) pairs © 2010, Jamie Callan

  11. MapReduce in One Picture Tom White,Hadoop: The Definitive Guide © 2010, Le Zhao

  12. Outline • Why MapReduce (Hadoop) • MapReduce basics • The MapReduce way of thinking • Two simple use cases • Two more advanced & useful MapReduce tricks • Two MapReduce applications • Manipulating large data © 2010, Le Zhao

  13. MapReduce Use Case (1) – Map Only Data distributive tasks – Map Only • E.g. classify individual documents • Map does everything • Input: (docno, doc_content), … • Output: (docno, [class, class, …]), … • No reduce © 2010, Le Zhao

  14. MapReduce Use Case (2) – Filtering and Accumulation Filtering & Accumulation – Map and Reduce • E.g. Counting total enrollments of two given classes • Map selects records and outputs initial counts • In: (Jamie, 11741), (Tom, 11493), … • Out: (11741, 1), (11493, 1), … • Shuffle/Partition by class_id • Sort • In: (11741, 1), (11493, 1), (11741, 1), … • Out: (11493, 1), …, (11741, 1), (11741, 1), … • Reduce accumulates counts • In: (11493, [1, 1, …]), (11741, [1, 1, …]) • Sum and Output: (11493, 16), (11741, 35) © 2010, Le Zhao

  15. MapReduce Use Case (3) – Database Join Problem: Massive lookups • Given two large lists: (URL, ID) and (URL, doc_content) pairs • Produce (ID, doc_content) Solution: Database join • Input stream: both (URL, ID) and (URL, doc_content) lists • (http://del.icio.us/post, 0), (http://digg.com/submit, 1), … • (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), … • Map simply passes input along, • Shuffle and Sort on URL (group ID & doc_content for the same URL together) • Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), (http://digg.com/submit, 1), … • Reduce outputs result stream of (ID, doc_content) pairs • In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), … • Out: (0, <html0>), (1, <html1>), … © 2010, Le Zhao

  16. MapReduce Use Case (4) – Secondary Sort 1 2 1 2  3 3 Problem: Sorting on values • E.g. Reverse graph edge directions & output in node order • Input: adjacency list of graph (3 nodes and 4 edges) (3, [1, 2]) (1, [3]) (1, [2, 3])  (2, [1, 3]) (3, [1]) • Note, the node_ids in the output values are also sorted. But Hadoop only sorts on keys! Solution: Secondary sort • Map • In: (3, [1, 2]), (1, [2, 3]). • Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction) • Out: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]). • Copy node_ids from value to key. © 2010, Le Zhao

  17. MapReduce Use Case (4) – Secondary Sort Secondary Sort (ctd.) • Shuffle on Key.field1, and Sort on whole Key (both fields) • In: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]),(<3, 1>, [1]) • Out: (<1, 3>, [3]), (<2, 1>, [1]), (<2, 3>, [3]), (<3, 1>, [1]) • Grouping comparator • Merge according to part of the key • Out: (<1, 3>, [3]), (<2, 1>, [1, 3]), (<3, 1>, [1])this will be the reducer’s input • Reduce • Merge & output: (1, [3]), (2, [1, 3]), (3, [1]) © 2010, Le Zhao

  18. Using MapReduce to Construct Indexes:Preliminaries Construction of binary inverted lists • Input: documents: (docid, [term, term..]), (docid, [term, ..]), .. • Output: (term, [docid, docid, …]) • E.g., (apple, [1, 23, 49, 127, …]) • Binary inverted lists fit on a slide more easily • Everything also applies to frequency and positional inverted lists A document id is an internal document id, e.g., a unique integer • Not an external document id such as a url MapReduce elements • Combiner, Secondary Sort, complex keys, Sorting on keys’ fields © 2010, Jamie Callan

  19. Using MapReduce to Construct Indexes:A Simple Approach A simple approach to creating binary inverted lists • Each Map task is a document parser • Input: A stream of documents • Output: A stream of (term, docid) tuples • (long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) … • Shuffle sorts tuples by key and routes tuples to Reducers • Reducers convert streams of keys into streams of inverted lists • Input: (long, 1) (long, 127) (long, 49) (long, 23) … • The reducer sorts the values for a key and builds an inverted list • Longest inverted list must fit in memory • Output: (long, [df:492, docids:1, 23, 49, 127, …]) © 2010, Jamie Callan

  20. Using MapReduce to Construct Indexes:A Simple Approach A more succinct representation of the previous algorithm • Map: (docid1, content1)  (t1, docid1) (t2, docid1) … • Shuffle by t • Sort by t (t5, docid1) (t4, docid3) …  (t4, docid3) (t4, docid1) (t5, docid1) … • Reduce: (t4, [docid3 docid1 …])  (t, ilist) docid: a unique integer t: a term, e.g., “apple” ilist: a complete inverted list but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of a word fits in memory © 2010, Jamie Callan

  21. Using MapReduce to Construct Indexes:Using Combine • Map: (docid1, content1)  (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) … • Each output inverted list covers just one document • Combine Sort by t Combine: (t1 [ilist1,2ilist1,3ilist1,1 …])  (t1, ilist1,27) • Each output inverted list covers a sequence of documents • Shuffle by t • Sort by t (t4, ilist4,1) (t5, ilist5,3) …  (t4, ilist4,2) (t4, ilist4,4) (t4, ilist4,1) … • Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …])  (t7, ilistfinal) ilisti,j: the j’th inverted list fragment for term i © 2010, Jamie Callan

  22. Using MapReduce to Construct Indexes : Inverted List Fragments Inverted Lists Documents Processors Processors Parser / Indexer A-F : Merger Parser / Indexer G-P : Merger : : : Parser / Indexer : : Q-Z Merger Map/Combine Shuffle/Sort Reduce 22 © 2010, Jamie Callan

  23. Using MapReduce to ConstructPartitioned Indexes Map: (docid1, content1)  ([p, t1], ilist1,1) Combine to sort and group values ([p, t1] [ilist1,2ilist1,3ilist1,1 …])  ([p, t1], ilist1,27) Shuffle by p Sort values by [p, t] Reduce: ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], ilistfinal) p: partition (shard) id © 2010, Jamie Callan

  24. Using MapReduce to Construct Indexes:Secondary Sort So far, we have assumed that Reduce can sort values in memory …but what if there are too many to fit in memory? Map: (docid1, content1)  ([t1, fd1,1], ilist1,1) Combine to sort and group values Shuffle by t Sort by [t, fd], then Group by t (Secondary Sort) ([t7, fd7,2], ilist7,2), ([t7, fd7,1], ilist7,1) …  (t7, [ilist7,1, ilist7,2, …]) Reduce: (t7, [ilist7,1, ilist7,2, …])  (t7, ilistfinal) Values arrive in order, so Reduce can stream its output fdi,j is the first docid in ilisti,j © 2010, Jamie Callan

  25. Using MapReduce to Construct Indexes:Putting it All Together Map: (docid1, content1)  ([p, t1, fd1,1], ilist1,1) Combine to sort and group values ([p, t1, fd1,1] [ilist1,2ilist1,3ilist1,1 …])  ([p, t1, fd1,27], ilist1,27) Shuffle by p Secondary Sortby [(p, t), fd] ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …]) Reduce: ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …])  ([p, t7], ilistfinal) © 2010, Jamie Callan

  26. Using MapReduce to Construct Indexes : Inverted List Fragments Inverted Lists Documents Processors Processors Parser / Indexer Shard : Merger Parser / Indexer Shard : Merger : : : Parser / Indexer : : Shard Merger Map/Combine Shuffle/Sort Reduce 26 © 2010, Jamie Callan

  27. PageRank Calculation:Preliminaries One PageRank iteration: • Input: • (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) .. • Output: • (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) .. MapReduce elements • Score distribution and accumulation • Database join • Side-effect files © 2010, Jamie Callan

  28. PageRank: Score Distribution and Accumulation • Map • In: (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) .. • Out: (out11, score1(t)/n1), (out12, score1(t)/n1) .., (out21, score2(t)/n2), .. • Shuffle & Sort by node_id • In: (id2, score1), (id1, score2), (id1, score1), .. • Out: (id1, score1), (id1, score2), .., (id2, score1), .. • Reduce • In: (id1, [score1, score2, ..]), (id2, [score1, ..]), .. • Out: (id1, score1(t+1)), (id2, score2(t+1)), .. © 2010, Jamie Callan

  29. PageRank: Database Join to associate outlinks with score • Map • In & Out: (id1, score1(t+1)), (id2, score2(t+1)), .., (id1, [out11, out12, ..]), (id2, [out21, out22, ..]) .. • Shuffle & Sort by node_id • Out: (id1, score1(t+1)), (id1, [out11, out12, ..]), (id2, [out21, out22, ..]), (id2, score2(t+1)), .. • Reduce • In: (id1, [score1(t+1), out11, out12, ..]), (id2, [out21, out22, .., score2(t+1)]), .. • Out: (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) .. © 2010, Jamie Callan

  30. PageRank: Side Effect Files for dangling nodes • Dangling Nodes • Nodes with no outlinks (observed but not crawled URLs) • Score has no outlet • need to distribute to all graph nodes evenly • Map for dangling nodes: • In: .., (id3, [score3]), .. • Out: .., ("*", 0.85×score3), .. • Reduce • In: .., ("*", [score1, score2, ..]), .. • Out: .., everything else, .. • Output to side-effect: ("*", score), fed to Mapper of next iteration © 2010, Jamie Callan

  31. Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao

  32. Manipulating Large Data • Do everything in Hadoop (and HDFS) • Make sure every step is parallelized! • Any serial step breaks your design • E.g. storing the URL list for a Web graph • Each node in Web graph has an id • [URL1, URL2, …], use line number as id – bottle neck • [(id1, URL1), (id2, URL2), …], explicit id © 2010, Le Zhao

  33. Hadoop based Tools • For Developing in Java, NetBeans plugin • http://www.hadoopstudio.org/docs.html • Pig Latin, a SQL-like high level data processing script language • Hive, Data warehouse, SQL • Cascading, Data processing • Mahout, Machine Learning algorithms on Hadoop • HBase, Distributed data store as a large table • More • http://hadoop.apache.org/ • http://en.wikipedia.org/wiki/Hadoop • Many other toolkits, Nutch, Cloud9, Ivory © 2010, Le Zhao

  34. Get Your Hands Dirty • Hadoop Virtual Machine • http://www.cloudera.com/developers/downloads/virtual-machine/ • This runs Hadoop 0.20 • An earlier Hadoop 0.18.0 version is here http://code.google.com/edu/parallel/tools/hadoopvm/index.html • Amazon EC2 • Various other Hadoop clusters around • The NetBeans plugin simulates Hadoop • The workflow view works on Windows • Local running & debugging works on MacOS and Linux • http://www.hadoopstudio.org/docs.html © 2010, Le Zhao

  35. Conclusions • Why large scale • MapReduce advantages • Hadoop uses • Use cases • Map only: for totally distributive computation • Map+Reduce: for filtering & aggregation • Database join: for massive dictionary lookups • Secondary sort: for sorting on values • Inverted indexing: combiner, complex keys • PageRank: side effect files • Large data © 2010, Jamie Callan

  36. For More Information • L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster architecture.” IEEE Micro, 2003. • J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. 2004. • S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. 2003. • I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann. 1999. • J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys, 38 (2). 2006. • http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce Tutorial”. Fetched January 21, 2010. • Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009 • J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February 7, 2010. © 2010, Jamie Callan

More Related