Tutorial for MapReduce (Hadoop) & Large Scale Processing

Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted from IR course lectures by Jamie Callan © 2010, Le Zhao

Outline • Why MapReduce (Hadoop) • Why go large scale • Compared to other parallel computing models • Hadoop related tools • MapReduce basics • The MapReduce way of thinking • Manipulating large data © 2010, Le Zhao

Why MapReduce (Hadoop) • Previous parallel computation models • 1) scp + ssh • Manual everything • 2) network cross-mounted disks + condor/torque • No data distr, disk access is bottleneck • Can only partition totally distributed computation • No fault tolerance • Prioritized job scheduling © 2010, Le Zhao

Hadoop • Parallel batch computation • Data distribution • Hadoop Distributed File System (HDFS) • Like Linux FS, but with automatic data repetition • Computation distribution • Automatic, user only need to specify #input_splits • Can distribute aggregation computations as well • Fault tolerance • Automatic recovery from failure • Speculative execution (a backup task) • Job scheduling • Ok, but still relies on the politeness of users © 2010, Le Zhao

How you can use Hadoop • Hadoop Streaming • Quick hacking – much like shell scripting • Uses STDIN & STDOUT carry data • cat file | mapper | sort | reducer > output • Easier to use legacy code, all programming languages • Hadoop Java API • Build large systems • More data types • More control over Hadoop’s behavior • Easier debugging with Java’s error stacktrace display • NetBeans plugin for Hadoop provides easy programming • http://hadoopstudio.org/docs.html © 2010, Le Zhao

Map and Reduce MapReduce is a new use of an old idea in Computer Science Map: Apply a function to every object in a list Each object is independent Order is unimportant Maps can be done in parallel The function produces a result Reduce: Combine the results to produce a final result You may have seen this in a Lisp or functional programming course © 2009, Jamie Callan

MapReduce • Input reader • Divide input into splits, assign each split to a Map processor • Map • Apply the Map function to each record in the split • Each Map function returns a list of (key, value) pairs • Shuffle/Partition and Sort • Shuffle distributes sorting & aggregation to many reducers • All records for key k are directed to the same reduce processor • Sort groups the same keys together, and prepares for aggregation • Reduce • Apply the Reduce function to each key • The result of the Reduce function is a list of (key, value) pairs © 2010, Jamie Callan

Outline • Why MapReduce (Hadoop) • MapReduce basics • The MapReduce way of thinking • Two simple use cases • Two more advanced & useful MapReduce tricks • Two MapReduce applications • Manipulating large data © 2010, Le Zhao

MapReduce Use Case (1) – Map Only Data distributive tasks – Map Only • E.g. classify individual documents • Map does everything • Input: (docno, doc_content), … • Output: (docno, [class, class, …]), … • No reduce © 2010, Le Zhao

MapReduce Use Case (2) – Filtering and Accumulation Filtering & Accumulation – Map and Reduce • E.g. Counting total enrollments of two given classes • Map selects records and outputs initial counts • In: (Jamie, 11741), (Tom, 11493), … • Out: (11741, 1), (11493, 1), … • Shuffle/Partition by class_id • Sort • In: (11741, 1), (11493, 1), (11741, 1), … • Out: (11493, 1), …, (11741, 1), (11741, 1), … • Reduce accumulates counts • In: (11493, [1, 1, …]), (11741, [1, 1, …]) • Sum and Output: (11493, 16), (11741, 35) © 2010, Le Zhao

MapReduce Use Case (3) – Database Join Problem: Massive lookups • Given two large lists: (URL, ID) and (URL, doc_content) pairs • Produce (ID, doc_content) Solution: Database join • Input stream: both (URL, ID) and (URL, doc_content) lists • (http://del.icio.us/post, 0), (http://digg.com/submit, 1), … • (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), … • Map simply passes input along, • Shuffle and Sort on URL (group ID & doc_content for the same URL together) • Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), (http://digg.com/submit, 1), … • Reduce outputs result stream of (ID, doc_content) pairs • In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), … • Out: (0, <html0>), (1, <html1>), … © 2010, Le Zhao

MapReduce Use Case (4) – Secondary Sort 1 2 1 2  3 3 Problem: Sorting on values • E.g. Reverse graph edge directions & output in node order • Input: adjacency list of graph (3 nodes and 4 edges) (3, [1, 2]) (1, [3]) (1, [2, 3])  (2, [1, 3]) (3, [1]) • Note, the node_ids in the output values are also sorted. But Hadoop only sorts on keys! Solution: Secondary sort • Map • In: (3, [1, 2]), (1, [2, 3]). • Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction) • Out: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]). • Copy node_ids from value to key. © 2010, Le Zhao

MapReduce Use Case (4) – Secondary Sort Secondary Sort (ctd.) • Shuffle on Key.field1, and Sort on whole Key (both fields) • In: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]),(<3, 1>, [1]) • Out: (<1, 3>, [3]), (<2, 1>, [1]), (<2, 3>, [3]), (<3, 1>, [1]) • Grouping comparator • Merge according to part of the key • Out: (<1, 3>, [3]), (<2, 1>, [1, 3]), (<3, 1>, [1])this will be the reducer’s input • Reduce • Merge & output: (1, [3]), (2, [1, 3]), (3, [1]) © 2010, Le Zhao

Using MapReduce to Construct Indexes:Preliminaries Construction of binary inverted lists • Input: documents: (docid, [term, term..]), (docid, [term, ..]), .. • Output: (term, [docid, docid, …]) • E.g., (apple, [1, 23, 49, 127, …]) • Binary inverted lists fit on a slide more easily • Everything also applies to frequency and positional inverted lists A document id is an internal document id, e.g., a unique integer • Not an external document id such as a url MapReduce elements • Combiner, Secondary Sort, complex keys, Sorting on keys’ fields © 2010, Jamie Callan

Using MapReduce to Construct Indexes:A Simple Approach A simple approach to creating binary inverted lists • Each Map task is a document parser • Input: A stream of documents • Output: A stream of (term, docid) tuples • (long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) … • Shuffle sorts tuples by key and routes tuples to Reducers • Reducers convert streams of keys into streams of inverted lists • Input: (long, 1) (long, 127) (long, 49) (long, 23) … • The reducer sorts the values for a key and builds an inverted list • Longest inverted list must fit in memory • Output: (long, [df:492, docids:1, 23, 49, 127, …]) © 2010, Jamie Callan

Using MapReduce to Construct Indexes:A Simple Approach A more succinct representation of the previous algorithm • Map: (docid1, content1)  (t1, docid1) (t2, docid1) … • Shuffle by t • Sort by t (t5, docid1) (t4, docid3) …  (t4, docid3) (t4, docid1) (t5, docid1) … • Reduce: (t4, [docid3 docid1 …])  (t, ilist) docid: a unique integer t: a term, e.g., “apple” ilist: a complete inverted list but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of a word fits in memory © 2010, Jamie Callan

Using MapReduce to Construct Indexes:Using Combine • Map: (docid1, content1)  (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) … • Each output inverted list covers just one document • Combine Sort by t Combine: (t1 [ilist1,2ilist1,3ilist1,1 …])  (t1, ilist1,27) • Each output inverted list covers a sequence of documents • Shuffle by t • Sort by t (t4, ilist4,1) (t5, ilist5,3) …  (t4, ilist4,2) (t4, ilist4,4) (t4, ilist4,1) … • Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …])  (t7, ilistfinal) ilisti,j: the j’th inverted list fragment for term i © 2010, Jamie Callan

Using MapReduce to Construct Indexes : Inverted List Fragments Inverted Lists Documents Processors Processors Parser / Indexer A-F : Merger Parser / Indexer G-P : Merger : : : Parser / Indexer : : Q-Z Merger Map/Combine Shuffle/Sort Reduce 22 © 2010, Jamie Callan

Using MapReduce to ConstructPartitioned Indexes Map: (docid1, content1)  ([p, t1], ilist1,1) Combine to sort and group values ([p, t1] [ilist1,2ilist1,3ilist1,1 …])  ([p, t1], ilist1,27) Shuffle by p Sort values by [p, t] Reduce: ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], ilistfinal) p: partition (shard) id © 2010, Jamie Callan

Using MapReduce to Construct Indexes:Secondary Sort So far, we have assumed that Reduce can sort values in memory …but what if there are too many to fit in memory? Map: (docid1, content1)  ([t1, fd1,1], ilist1,1) Combine to sort and group values Shuffle by t Sort by [t, fd], then Group by t (Secondary Sort) ([t7, fd7,2], ilist7,2), ([t7, fd7,1], ilist7,1) …  (t7, [ilist7,1, ilist7,2, …]) Reduce: (t7, [ilist7,1, ilist7,2, …])  (t7, ilistfinal) Values arrive in order, so Reduce can stream its output fdi,j is the first docid in ilisti,j © 2010, Jamie Callan

Using MapReduce to Construct Indexes:Putting it All Together Map: (docid1, content1)  ([p, t1, fd1,1], ilist1,1) Combine to sort and group values ([p, t1, fd1,1] [ilist1,2ilist1,3ilist1,1 …])  ([p, t1, fd1,27], ilist1,27) Shuffle by p Secondary Sortby [(p, t), fd] ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …])  ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …]) Reduce: ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …])  ([p, t7], ilistfinal) © 2010, Jamie Callan

Using MapReduce to Construct Indexes : Inverted List Fragments Inverted Lists Documents Processors Processors Parser / Indexer Shard : Merger Parser / Indexer Shard : Merger : : : Parser / Indexer : : Shard Merger Map/Combine Shuffle/Sort Reduce 26 © 2010, Jamie Callan

PageRank Calculation:Preliminaries One PageRank iteration: • Input: • (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) .. • Output: • (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) .. MapReduce elements • Score distribution and accumulation • Database join • Side-effect files © 2010, Jamie Callan

PageRank: Score Distribution and Accumulation • Map • In: (id1, [score1(t), out11, out12, ..]), (id2, [score2(t), out21, out22, ..]) .. • Out: (out11, score1(t)/n1), (out12, score1(t)/n1) .., (out21, score2(t)/n2), .. • Shuffle & Sort by node_id • In: (id2, score1), (id1, score2), (id1, score1), .. • Out: (id1, score1), (id1, score2), .., (id2, score1), .. • Reduce • In: (id1, [score1, score2, ..]), (id2, [score1, ..]), .. • Out: (id1, score1(t+1)), (id2, score2(t+1)), .. © 2010, Jamie Callan

PageRank: Database Join to associate outlinks with score • Map • In & Out: (id1, score1(t+1)), (id2, score2(t+1)), .., (id1, [out11, out12, ..]), (id2, [out21, out22, ..]) .. • Shuffle & Sort by node_id • Out: (id1, score1(t+1)), (id1, [out11, out12, ..]), (id2, [out21, out22, ..]), (id2, score2(t+1)), .. • Reduce • In: (id1, [score1(t+1), out11, out12, ..]), (id2, [out21, out22, .., score2(t+1)]), .. • Out: (id1, [score1(t+1), out11, out12, ..]), (id2, [score2(t+1), out21, out22, ..]) .. © 2010, Jamie Callan

PageRank: Side Effect Files for dangling nodes • Dangling Nodes • Nodes with no outlinks (observed but not crawled URLs) • Score has no outlet • need to distribute to all graph nodes evenly • Map for dangling nodes: • In: .., (id3, [score3]), .. • Out: .., ("*", 0.85×score3), .. • Reduce • In: .., ("*", [score1, score2, ..]), .. • Out: .., everything else, .. • Output to side-effect: ("*", score), fed to Mapper of next iteration © 2010, Jamie Callan

Manipulating Large Data • Do everything in Hadoop (and HDFS) • Make sure every step is parallelized! • Any serial step breaks your design • E.g. storing the URL list for a Web graph • Each node in Web graph has an id • [URL1, URL2, …], use line number as id – bottle neck • [(id1, URL1), (id2, URL2), …], explicit id © 2010, Le Zhao

Hadoop based Tools • For Developing in Java, NetBeans plugin • http://www.hadoopstudio.org/docs.html • Pig Latin, a SQL-like high level data processing script language • Hive, Data warehouse, SQL • Cascading, Data processing • Mahout, Machine Learning algorithms on Hadoop • HBase, Distributed data store as a large table • More • http://hadoop.apache.org/ • http://en.wikipedia.org/wiki/Hadoop • Many other toolkits, Nutch, Cloud9, Ivory © 2010, Le Zhao

Get Your Hands Dirty • Hadoop Virtual Machine • http://www.cloudera.com/developers/downloads/virtual-machine/ • This runs Hadoop 0.20 • An earlier Hadoop 0.18.0 version is here http://code.google.com/edu/parallel/tools/hadoopvm/index.html • Amazon EC2 • Various other Hadoop clusters around • The NetBeans plugin simulates Hadoop • The workflow view works on Windows • Local running & debugging works on MacOS and Linux • http://www.hadoopstudio.org/docs.html © 2010, Le Zhao

Conclusions • Why large scale • MapReduce advantages • Hadoop uses • Use cases • Map only: for totally distributive computation • Map+Reduce: for filtering & aggregation • Database join: for massive dictionary lookups • Secondary sort: for sorting on values • Inverted indexing: combiner, complex keys • PageRank: side effect files • Large data © 2010, Jamie Callan

For More Information • L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster architecture.” IEEE Micro, 2003. • J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. 2004. • S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. 2003. • I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann. 1999. • J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys, 38 (2). 2006. • http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce Tutorial”. Fetched January 21, 2010. • Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009 • J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February 7, 2010. © 2010, Jamie Callan

Tutorial for MapReduce (Hadoop) & Large Scale Processing