Mastering MapReduce: Big Data Processing Paradigm

A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial

MAP-REDUCE

What is it? Amir Payberah https://www.sics.se/~amir/dic.htm

MapReduce Basics • A programming model and its associated implementation for parallel processing of large data sets • It was developed within Google as a mechanism for processing large amounts of raw data, e.g. crawled documents or web request logs. • Capable of efficiently distribute processing of TB’s of data on 1000’s of processing nodes • This distribution implies parallel computing since the same computations are performed on each CPU, but with a different dataset (or different segment of a large dataset) • Implementation’s run-time system library takes care of parallelism, fault tolerance, data distribution, load balancing etc • Complementary to RDBMS, but differs in many ways (data size, access, update, structure, integrity and scale) • Features: fault tolerance, locality, task granularity, backup tasks, skipping bad records and so on

MapReduce Simple Dataflow Amir Payberah https://www.sics.se/~amir/dic.htm

MapReduce – Functional Programming Concepts • MapReduce programs are designed to compute large volumes of data in a parallel fashion • This model would not scale to large clusters (hundreds or thousands of nodes) if the components were allowed to share data arbitrarily • The communication overhead required to keep the data on the nodes synchronized at all times would prevent the system from performing reliably or efficiently at large scale • Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated • If in a mapping task you change an input (key, value) pair, it does not get reflected back in the input files; communication occurs only by generating new output (key, value) pairs which are then forwarded by the Hadoop system into the next phase of execution https://developer.yahoo.com/hadoop/tutorial

MapReduce – List Processing • Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements • A MapReduce program will do this twice, using two different list processing idioms: map, and reduce • These terms are taken from several list processing languages such as LISP, Scheme, or ML https://developer.yahoo.com/hadoop/tutorial

MapReduce – Mapping Lists • The first phase of a MapReduce program is called mapping • A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element. • Say there is a toUpper(str) function, which returns an uppercase version of the input string. The Map would then turn input strings into a list of uppercase strings. • Note: the input has not been modifed. A new string has been returned https://developer.yahoo.com/hadoop/tutorial

MapReduce – Reducing Lists • Reducing lets you aggregate values together • A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value. • Reducing is often used to produce "summary" data, turning a large volume of data into a smaller summary of itself. For example, "+" can be used as a reducing function, to return the sum of a list of input values. https://developer.yahoo.com/hadoop/tutorial

Putting Map and Reduce together • A MapReduce program has two components: one that implements the mapper, and another that implements the reducer • Keys and values: In MapReduce, no value stands on its own. Every value has a key associated with it. Keys identify related values. Eg. the list below is for flight departures and the number of passengers that failed to board • The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of these functions is the same: both a key and a value must be emitted to the next list in the data flow. EK123 65, 12:00pm BA789 50, 12:02pm EK123 40, 12:05pm QF456 25, 12:15pm ... https://developer.yahoo.com/hadoop/tutorial

MapReduce - Keys • In MapReduce, an arbitrary number of values can be output from each phase; a mapper may map one input into zero, one, or one hundred outputs. A reducer may compute over an input list and emit one or a dozen different outputs • Keys divide the reduce space: A reducing function turns a large list of values into one (or a few) output values • In MapReduce, all of the output values are not usually reduced together • All of the values with the same key are presented to a single reducer together • This is performed independently of any reduce operations occurring on other lists of values, with different keys attached Different colors represent different keys. All values with the same key are presented to a single reduce task. https://developer.yahoo.com/hadoop/tutorial

Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

High-Level Structure of a MR Program – 1/2 mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) https://developer.yahoo.com/hadoop/tutorial

High-Level Structure of a MR Program – 2/2 • Several instances of the mapper function are created on the different machines in a Hadoop cluster • Each instance receives a different input file (it is assumed that there are many such files) • The mappers output (word, 1) pairs which are then forwarded to the reducers • Several instances of the reducer method are also instantiated on the different machines • Each reducer is responsible for processing the list of values associated with a different word • The list of values will be a list of 1's; the reducer sums up those ones into a final count associated with a single word. The reducer then emits the final (word, count) output which is written to an output file. mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) https://developer.yahoo.com/hadoop/tutorial

Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

MapReduce Data Flow – 1/4 https://developer.yahoo.com/hadoop/tutorial

MapReduce Data Flow – 2/4 • MapReduce inputs typically come from input files loaded onto the Hadoop cluster’s HDFS F/S • These files are evenly distributed across all nodes • Running a MapReduce program involves running mapping tasks on many or all of the nodes in the Hadoop cluster • Each of these mapping tasks is equivalent: no mappers have particular "identities" associated with them • Thus, any mapper can process any input file. Each mapper loads the set of files local to that machine and processes them https://developer.yahoo.com/hadoop/tutorial

MapReduce Data Flow – 3/4 • When the mapping phase has completed, the intermediate (key, value) pairs must be exchanged between machines • All values with the same key are sent to a single reducer https://developer.yahoo.com/hadoop/tutorial

MapReduce Data Flow – 4/4 • The reduce tasks are spread across the same nodes in the cluster as the mappers. This is the only communication step in MapReduce • The user never explicitly marshals information from one machine to another. • All data transfer is handled by the Hadoop MapReduce platform runtime, guided implicitly by the different keys associated with values. This is a fundamental element of Hadoop MapReduce's reliability. • If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects, e.g., communicating with the outside world, then the shared state must be restored in a restarted task. By eliminating communication and side-effects, restarts can be handled more gracefully. https://developer.yahoo.com/hadoop/tutorial

In Depth View of Map-Reduce

https://developer.yahoo.com/hadoop/tutorial

InputFormat – 1/2 • Input files reside on HDFS and can be of an arbitrary format • Deciding how to split up and read these files is decided by the InputFormat class. It: • Selects the files or other objects that should be used for input • Defines the InputSplits that break a file into tasks • Provides a factory for RecordReader objects that read the file https://developer.yahoo.com/hadoop/tutorial

InputFormat – 2/2 • An InputSplit is the unit of work which comprises a single map task • By default this is a 64MB chunk. As various blocks make up a file, it is possible to run parallel Map tasks on these chunks https://developer.yahoo.com/hadoop/tutorial

RecordReader and Mapper • An InputSplit defined a unit of work • The RecordReader class defines how to load the data and convert into (key, value) pairs that the Map phase can use • The Mapper function does the Map, emitting (key, value) pairs for use by the Reduce phase https://developer.yahoo.com/hadoop/tutorial

Partition and Shuffle • On completion of the first batch of Map tasks, nodes begin exchanging outputs to Reducers – this is called the Shuffle phase • Each reducer is given a different subset of the key space (called Partitions) by the Partitioner class • These (key,value) pairs are then inputs for the Reduce phase https://developer.yahoo.com/hadoop/tutorial

Sort, Reduce and Output • Intermediate (key, value) pairs from the Shuffle process are then sorted as input to the Reducer • The Reducers iterate over all their values and produce an output • The outputs are then written back to HDFS https://developer.yahoo.com/hadoop/tutorial

Handling Failure • Worker failure • To detect failure, the master pings every worker periodically • If no response is received from a worker in a certain amount of time, the master marks the worker as failed • Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers • Completed map tasks are re-executed as their output on is stored on the local disk(s) of the failed machine and is therefore inaccessible • Completed reduce tasks do not need to be re-executed since their output is stored in a global file system • Master failure • Periodic checkpoints are written to handle master failure • If the master task dies, a new copy can be started from the last checkpoint state

Data Locality • Network bandwidth is a valuable scarce resource and it should be consumed wisely • The distributed file system replicates data across different nodes • The Master takes these locations into account when scheduling Map tasks, trying to place them with the data • Otherwise, Map tasks are scheduled to reside “near” a replica of the data (e.g., on a worker machine that is on the same network switch) • When running large MapReduce operations, most input data is read locally and consume no network bandwidth • Data locality worked well with a Hadoop-specific distributed file system • Integration of a Cloud-based file system incurs extra cost and loss data locality

Task Granularity • Finely granular tasks: many more map tasks than machines • Better dynamic load balancing • Minimizes time for fault recovery • Can pipeline the shuffling/grouping while maps are still running • Typically 200k Map tasks, 5k Reduce tasks for 2k hosts • For M map tasks and R reduce tasks there are O(M+R) scheduling decisions and O(M*R) states

Load Balancing • Built-in dynamic load balancing • One other problem that can slow calculations is the existence of stragglers; machines suffering from either hardware defects, contention for resources with other applications etc. • When an overall MapReduce operation passes some point deemed to be “nearly complete,” the Master schedules backup tasks for all of the currently in-progress tasks • When a particular task is completed, whether it be “original” or back-up, its value is used • This strategy costs little more overall, but can result in big performance gains

Refinements • Partitioning function • MapReduce users specify the number of reduce tasks/output files (R) • Data gets partitioned across these tasks using a partitioning function on the intermediate key • Default is “hash(key) mod R”, resulting in well balanced partitions • Special partitioning function can also be used, such as “hash(Hostname(urlkey))” to combine all URLs (output keys) from the same host to the same output file • Ordering guarantees • Within a given partition, the intermediate key/value pairs are processed in increasing key order • This ordering guarantee makes it easy to generate a sorted output file per partition • Allows users to have sorted output and efficient access lookups by key

Refinements (Cont’d) • Combiner function • There can be significant repetition in the intermediate keys produced by each map task and the reduce task is associative • While one reduce task can perform the aggregation, an on-processor combiner function can be used to perform partial merging of Map output locally before sending over the network • The combiner function is executed on each machine that performs a map task • The program logic for the combiner function and reduce tasks are potentially same, except how the output is handled, i.e. writing output in an intermediate file or in the final output file • Input/Output types • Multiple input/output format supported • User can also add support to new input/output type by providing an implementation to the reader/writer interface

Refinements (Cont’d) • Skipping bad records • MapReduce provides a mode for skipping records that are diagnosed to cause Map() crashes • Each worker process installs a signal handler that catches segment violations and bus errors, tracked by master • When the master notices more than one failure on a particular record, it indicates that the record should be skipped during re-execution • Local execution/debugging • Not straightforward due to the distributed computation of MapReduce • Alternative implementation of the MapReduce library that sequentially on one node (local machine) • Users can use any debugging or testing tools they find useful

Refinements (Cont’d) • Status information • Master contains internal http server to produce status pages with information on how many tasks have been completed, how many are in progress, bytes of input, bytes of intermediate data, bytes of output, and processing rates. • The status page contains links to the standard error and standard output files generated by each task • A user can monitor progress, predict computation time and accelerate it by adding more hosts Counters • Counters • A facility to count occurrences of various events • To use this facility, user code creates a named counter object and then increments the counter appropriately in Map and/or Reduce function

MapReduce Applications • Applications • Text tokenization (alert system), indexing, and search • Data mining, statistical modeling, and machine learning • Healthcare – parse, clean and reconcile extremely large amount of data • Biosciences – drug discovery, meta-genomics, bioassay activities • Cost-effective mash-ups – retrieving and analyzing biomedical knowledge • Computational biology – parallelize bioinformatics algorithms for SNP discovery, genotyping and personal genomics, e.g. CloudBurst • Emergency response – real-time monitoring/forecasting for operational decision support • and so on (Check: http://wiki.apache.org/hadoop/PoweredBy) • MapReduce inapplicability • Database management – does not provide traditional DBMS features • Database implementation – lack of schema, low data integrity • Normalization poses problems for MapReduce, due to non-local reading • Applications cannot have read and write many times feature

How Hadoop Runs a MapReduce job • Client submits MapReduce job • JobTracker coordinates job run • TaskTracker runs split tasks • HDFS is used for file storage Hadoop: The Definitive Guide, O’Reilly

Streaming and Pipes • Hadoop Streaming, API to MapReduce to write non-Java map and reduce function • Hadoop and the user program communicates using standard I/O streams • Hadoop Pipes is the C++ interface to MapReduce • Uses socket as channel to communicate with the process running the C++ Map or Reduce function Hadoop: The Definitive Guide, O’Reilly

Progress and Status Updates • Operations constituting progress • Reading an input record • Writing an output record • Setting status description • Incrementing a counter • Calling progress () method Hadoop: The Definitive Guide, O’Reilly

Hadoop Failures • Task failure • Map or reduce task throws a runtime exception • For streaming tasks, streaming processes exiting with a non-zero exit code are considered as failed • Task call also be killed and re-scheduled • Tasktracker failure • Crash or slow execution can cause infrequent (or stop) sending heartbeats to the job tracker • A tasktracker can also be blacklisted by the jobtracker if it fails a significant number of tasks, higher than average task failure rate • Jobtracker failure • Single point of failure - no mechanism to deal with it • One solution is to run multiple jobtracker or have backup jobtracker

Checkpointing in Hadoop Hadoop: The Definitive Guide, O’Reilly

Job Scheduling in Hadoop • Started with FIFO scheduling and now comes with a choice of schedulers • The fair scheduler • Aims to give every user a fair share of the cluster capacity over time • Jobs are placed in pools and by default each user gets their own pool • Support preemption – capacity provisioning of over-capacity to under-capacity pool • The capacity scheduler • Slightly different approach to multi-user scheduling • A cluster is made up of a number of queues, which may be hierarchical, and each queue has an allocated capacity • Within each queue jobs are scheduled using FIFO scheduling, with priorities

MapReduce and HDFS Amir Payberah https://www.sics.se/~amir/dic.htm

Fault Tolerance Amir Payberah https://www.sics.se/~amir/dic.htm

Mastering MapReduce: Big Data Processing Paradigm