Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer

Data-Intensive Text Processing with MapReduceJ. Lin & C. Dyer

MapReduce • Programming model for distributed computations on massive amounts of data • Execution framework for large-scale data processing on clusters of commodity servers • Developed by Google – built on old, principles of parallel and distributed processing • Hadoop – adoption of open-source implementation by Yahoo (now Apache project)

Big Data • Big data – issue to grapple with • Web-scale synonymous with data-intensive processing • Public, private repositories of vast data • Behavior data important - BI

4th paradigm • Manipulate, explore, mine massive data – 4th paradigm of science (theory, experiments, simulations) • In CS, systems must be able to scale • Increases in capacity > improvements in bandwidth

MapReduce (MR) • MapReduce • level of abstraction and beneficial division of labor • Programming model – powerful abstraction separates what from how of data intensive processing

Big Ideas behind MapReduce • Scale out not up • Purchasing symmetric multi-processing machines (SMP) with large number of processor sockets (100s), large shared memory (GBs) not cost effective • Why? Machine with 2x processors > 2x cost • Barroso & Holzle analysis using TPC benchmarks • SMP – communication order magnitude faster • Cluster of low end approach 4x more cost effective than high end • However, even low end only 10-50% utilization – not energy efficient

Big Ideas behind MapReduce • Assume failures are common • Assume cluster machines mean-time failure 1000 days • 10,000 server cluster, 10 failures a day • MR copes with failure • Move processing to the data • MR assume architecture where processors/storage co-located • Run code on processor attached to data

Big Ideas behind MapReduce • Process data sequentially not random • If 1TB DB with 1010, 100B records • If update 1%, take 1 month • If read entire DB and rewrites all records with updates, takes < 1 work day on single machine • Solid state won’t help • MR – designed for batch processing, trade latency for throughput

Big Ideas behind MapReduce • Hide system-level details from application developer • Writing distributed programs difficult • Details across threads, processes, machines • Code that runs concurrently is unpredictable • Deadlocks, race conditions, etc. • MR isolates develop from system-level details • No locking, starvation, etc. • Well-defined interfaces • Separates what (programmer) from how (responsibility of execution framework) • Framework designed once and verified for correctness

Big Ideas behind MapReduce • Seamless scalability • Given 2x data, algorithms take at most 2x to run • Given cluster 2x large, take ½ time to run • The above is unobtainable for algorithms • 9 women can’t have a baby in 1 month • E.g. 2x programs takes longer • Degree of parallelization increases communication • MR small step toward attaining • Algorithm fixed, framework executes algorithm • If use 10 machines 10 hours, 100 machines 1 hour

Motivation for MapReduce • Still waiting for parallel processing to replace sequential • Progress of Moore’s law - most problems could be solved by single computer, so ignore parallel, etc. • Around 2005, no longer true • Semiconductor industry ran out of opportunities to improve • Faster clocks cheaper pipelines, superscalar architecture • Then came multi-core • Not matched by advances in software

Motivation • Parallel processing only way forward • MapReduce to the rescue • Anyone can download open source Hadoop implementation of MapReduce • Rent a cluster from a utility cloud • Process TB within the week • Multiple cores in a chip, multiple machines in a cluster

Motivation • MapReduce: effective data analysis tool • First widely-adopted step away from von Neumann model • Can’t treat multi-core processor, cluster as conglomeration of many von Neumann machine image that communicates over network • Wrong abstraction • MR – organize computations not over individual machines, but over clusters • Datacenter is the computer

Motivation • Models of parallel computation • PRAM (Parallel RAM machine) • Arbitrary number of processors, share unbounded large memory, operate synchronously on shared input • MR most successful abstraction for large-scale resources • Manages complexity, hides details, presents well-defined behavior • Makes certain tasks easier, others harder • MapReduce first in new class of programming models

Basics • Divide and conquer • Partition large problem into smaller subproblems • Worker work on subproblems in parallel • Threads in a core, cores in multi-core processor, multiple processor in a machine, machines in a cluster • Combine intermediate results from worker to final result

Basics • MR – abstraction that hides system-level details from programmer • Move code to data • Spread data across disks • DFS manages storage • Based on Functional Programming

Functional Programming Roots • MapReduce = functional programming plus distributed processing on steroids • Not a new idea… dates back to the 50’s (or even 30’s) • What is functional programming? • Computation as application of functions • Computation is evaluation of mathematical functions • Avoids state and mutable data • Emphasizes application of functions instead of changes in state

Functional Programming Roots • How is it different? • Traditional notions of “data” and “instructions” are not applicable • Data flows are implicit in program • Different orders of execution are possible • Theoretical foundation provided by lambda calculus • a formal system for function definition • Exemplified by LISP, Scheme

Overview of Lisp • Functions written in prefix notation (+ 1 2)  3 (* 3 4)  12 (sqrt ( + (* 3 3) (* 4 4)))  5 (define x 3)  x (* x 5)  15

Functional Programming Roots • Two important concepts in functional programming • Map: do something to everything in a list • Fold: combine results of a list in some way

Functional Programming Map • Higher order functions – accept other functions as arguments • Map • Takes a function f and its argument, which is a list • applies to all elements in list • Lists are primitive data types • [1 2 3 4 5] • [[a 1] [b 2] [c 3]] • Returns a list as result • Simple map example: (map (lambda (x) (* x x)) [1 2 3 4 5])  [1 4 9 16 25]

Functional Programming Reduce • Fold • Takes function g, which has 2 arguments: an initial value and a list. • The g applied to initial value and 1st item in list • Result stored in intermediate variable • Intermediate variable and next item in list 2nd application of g, etc. • Fold returns final value of intermediate variable

Map/Fold in Action • Simple map example: • Fold examples: • Sum of squares: (map (lambda (x) (* x x)) [1 2 3 4 5])  [1 4 9 16 25] (fold + 0 [1 2 3 4 5]) 15 (fold * 1 [1 2 3 4 5]) 120 (define (sum-of-squares v) // where v is a list (fold + 0 (map (lambda (x) (* x x)) v))) (sum-of-squares [1 2 3 4 5]) 55

Functional Programming Roots • Use map/fold in combination • Map – transformation of dataset • Fold- aggregation operation • Can apply map in parallel • Fold – more restrictions, elements must be brought together • Many applications do not require g be applied to all elements of list, fold aggregations in parallel

MapReduce • Map in MapReduce is same as in functional programming • Reduce corresponds to fold • 2 stages: • User specified computation applied over all input, can occur in parallel, return intermediate output • Output aggregated by another user-specified computation

Mappers/Reducers • Key-value pair (k,v) – basic data structure in MR • Keys, values – int, strings, etc., user defined • e.g. keys – URLs, values – HTML content • e.g. keys – node ids, values – adjacency lists of nodes Map: (k1, v1) -> [(k2, v2)] Reduce: (k2, [v2]) -> [(k3, v2)] Where […] denotes a list

General Flow Map • Apply mapper to every input key-value pair stored in DFS • Generate arbitrary number of intermediate (k,v) • Distributed group by operation (shuffle) on intermediate keys • Sort intermediate results by key (not across reducers) • Aggregate intermediate results • Generate final output to DFS – one file per reducer Reduce

What function is implemented?

Example: unigram (word count) • (docid, doc) on DFS, doc is text • Mapper tokenizes (docid, doc), emits (k,v) for every word – (word, 1) • Execution framework all same keys brought together in reducer • Reducer – sums all counts (of 1) for word • Each reduce writes to one file • Words within file sorted, file same # words • Can use output as input to another MR

Execution Framework • Scheduling • Job divided into tasks (certain block of (k,v) pairs) • Can have 1000s jobs need to be assigned • May exceed number that can run concurrently • Task queue • Coordination among tasks from different jobs

Execution Framework • Speculative execution • Map phase only as fast as? • slowest map task • Problem: Stragglers, flaky hardware • Solution: Use speculative execution: • Exact copy of same task on different machine • Uses result of fastest task in attempt to finish • Better for map or reduce? • Can improve running time by 44% (Google) • Doesn’t help if skewed in distributed of values

Execution Framework • Data/code co-location • Execute near data • It not possible must stream data • Try to keep within same rack

Execution Framework • Synchronization • Concurrently running processes join up • Intermediate (k,v) grouped by key, copy intermediate data over network, shuffle/sort • Number of copy operations? Worst case: • M X R copy operations • Each mapper may send intermediate results to every reducer • Reduce computation cannot start until all mappers finished, (k,v) shuffled/sorted • Differs from functional programming • Can copy intermediate (k,v) over network to reducer when mapper finishes

Hadoop • Careful using external resources (e.g. bottleneck querying SQL DB) • Mappers can emit arbitrary number of intermediate (k,v), can be of different type • Reduce can emit artibtraty number of final (k,v) and can be of different type than intermediate (k,v) • Different from functional programming, can have side effects (state change internal – may cause problems, external may write to files) • MapReduce can have no reduce, but must have mapper • Can just pass identity function to reducer • May not have any input (compute pi)

Other Sources • Other source can serve as source/destination for data from MapReduce • Google – BigTable • Hbase – BigTable clone • Hadoop – integrated RDB with parallel processing, can write to DB tables

Google File System (Hadoop) • Divides file into chunks – 1MB used to be 64MB • Master and Chunk servers • Data Replicated 3 times • Shadow Master

CAP Theorem • Consistency, availability, partition tolerance • Cannot satisfy all 3 • Partitioning unavoidable in large data systems, must trade off availability and consistency • If master fails, system is unavailable so consistent! • If multiple masters, more available, but inconsistent • Workaround to single namenode • Warm standby namenode • Hadoop community working on it

Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer