Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer

Data-Intensive Text Processing with MapReduceJ. Lin & C. Dyer Chapter 1

MapReduce • Programming model for distributed computations on massive amounts of data • Execution framework for large-scale data processing on clusters of commodity servers • Developed by Google – built on old, principles of parallel and distributed processing • Hadoop – adoption of open-source implementation by Yahoo (now Apache project)

Big Data • Big data – issue to grapple with • Web-scale synonymous with data-intensive processing • Public, private repositories of vast data • Behavior data important - BI

4th paradigm • Manipulate, explore, mine massive data – 4th paradigm of science (theory, experiments, simulations) • In CS, systems must be able to scale • Increases in capacity > improvements in bandwidth

Problems/Solutions • NLP and IR • Data driven algorithmic approach to capture statistical regularities • Data – corpora (NLP), collections (IR) • representations of data -features (superficial, deep) • Method – algorithms • Examples • Is email spam or not? Is this word part of an address or location?

Problems/Solutions • Who shot Lincoln? • NLP – sophisticated linguistics syntactic, semantic analysis • 2001- on left of “who shot Lincoln”, tally up, redundancy based approach • Probability distribution of sequence of words • Training, smoothing • Markov assumption • N-gram language model, conditional probability of a word is given by n-1 previous words

MapReduce (MR) • MapReduce • level of abstraction and beneficial division of labor • Programming model – powerful abstraction separates what from how of data intensive processing

Big Ideas behind MapReduce • Scale out not up • Purchasing symmetric multi-processing machines (SMP) with large number of processor sockets (100s), large shared memory (GBs) not cost effective • Why? Machine with 2x processors > 2x cost • Barroso & Holzle analysis using TPC benchmarks • SMP – communication order magnitude faster • Cluster of low end approach 4x more cost effective than high end • However, even low end only 10-50% utilization – not energy efficient

Big Ideas behind MapReduce • Assume failures are common • Assume cluster machines mean-time failure 1000 days • 10,000 server cluster, 10 failures a day • MR copes with failure • Move processing to the data • MR assume architecture where processors/storage co-located • Run code on processor attached to data

Big Ideas behind MapReduce • Process data sequentially not random • If 1TB DB with 1010, 100B records • If update 1% randomly, takes 1 month • If read entire DB and rewrites all records with updates sequentially, takes < 1 work day on single machine • Solid state won’t help • MR – designed for batch processing, trade latency for throughput

Big Ideas behind MapReduce • Hide system-level details from application developer • Writing distributed programs difficult • Details across threads, processes, machines • Code running concurrently is unpredictable • Deadlocks, race conditions, etc. • MR isolates develop from system-level details • No locking, starvation, etc. • Well-defined interfaces • Separates what (programmer) from how (responsibility of execution framework) • Framework designed once and verified for correctness

Big Ideas behind MapReduce • Seamless scalability • Given 2x data, algorithm takes at most 2x to run • Given cluster 2x large, take ½ time to run • The above is unobtainable for algorithms • 9 women can’t have a baby in 1 month • E.g. 2x programs takes longer • Degree of parallelization increases communication • MR small step toward attaining • Algorithm fixed, framework executes algorithm • If use 10 machines 10 hours, 100 machines 1 hour

Motivation for MapReduce • Still waiting for parallel processing to replace sequential • Progress of Moore’s law - most problems could be solved by single computer, so ignore parallel, etc. • Around 2005, no longer true • Semiconductor industry ran out of opportunities to improve • Faster clocks cheaper pipelines, superscalar architecture • Then came multi-core • Not matched by advances in software

Motivation • Parallel processing only way forward • MapReduce to the rescue • Anyone can download open source Hadoop implementation of MapReduce • Rent a cluster from a utility cloud • Process TB within the week • Multiple cores in a chip, multiple machines in a cluster

Motivation • MapReduce: effective data analysis tool • First widely-adopted step away from von Neumann model • Can’t treat multi-core processor, cluster as conglomeration of many von Neumann machine image that communicates over network • Wrong abstraction • MR – organize computations not over individual machines, but over clusters • Datacenter is the computer

Motivation • Previous models of parallel computation • PRAM • Arbitrary number of processors, share unbounded large memory, operate synchronously on shared input • LogP, BSP • MR most successful abstraction for large-scale resources • Manages complexity, hides details, presents well-defined behavior • Makes certain tasks easier, others harder • MapReduce first in new class of programming models

Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer