300 likes | 507 Views
Distributed and Parallel Processing Technology Chapter2. MapReduce. Sun Jo. Introduction. MapReduce is a programming model for data processing. Hadoop can run MapReduce programs written in various languages. We shall look at the same program expressed in Java, Ruby, Python, and C++.
E N D
Distributed and Parallel Processing TechnologyChapter2.MapReduce Sun Jo
Introduction • MapReduce is a programming model for data processing. • Hadoop can run MapReduce programs written in various languages. • We shall look at the same program expressed in Java, Ruby, Python, and C++.
A Weather Dataset • Program that mines weather data • Weather sensors collect data every hour at many locations across the globe • They gather a large volume of log data, which is good candidate for analysis with MapReduce • Data Format • Data from the National Climate Data Center(NCDC) • Stored using a line-oriented ASCII format, in which each line is a record
A Weather Dataset • Data Format • Data files are organized by date and weather station. • There is a directory for each year from 1901 to 2001, each containing a gzipped file for each weather station with its readings for that year. • The whole dataset is made up of a large number of relatively small files since there are tens of thousands of weather station. • The data was preprocessed so that each year’s readings were concatenated into a single file.
Analyzing the Data with Unix Tools • What’s the highest recorded global temperature for each year in the dataset? • Unix Shell script program with awk, the classic tool for processing line-oriented data • Beginning of a run • The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large Instance. • The scripts loops through the compressed year files • printing the year processing each file using awk • Awkextracts the air temperature and the quality code from the data. • Temperature value 9999 signifies a missing value in the NCDC dataset. Maximum temperature is 31.7℃ for 1901.
Analyzing the Data with Unix Tools • To speed up the processing, run parts of the program in parallel • Problems for parallel processing • Dividing the work into equal-size pieces isn’t always easy or obvious. • The file size for different years varies • The whole run is dominated by the longest file • A better approach is to split the input into fixed-size chunks and assign each chunk to a process • Combining the results from independent processes may need further processing. • Still limited by the processing capacity of a single machine, handling coordination and reliability for multiple machines • It’s feasible to parallelize the processing, though, it’s messy in practice.
Analyzing the Data with Hadoop – Map and Reduce • Map and Reduce • MapReduce works by breaking the processing into 2 phases: the map and the reduce. • Both map and reduce phases have key-value pairs as input and output. • Programmers have to specify two functions: map and reduce function. • The input to the map phase is the raw NCDC data. • Here, the key is the offset of the beginning of the line and the value is each line of the data set. • The map function pulls out the year and the air temperature from each input value. • The reduce function takes <year, temperature> pairs as input and produces the maximum temperature for each year as the result.
Analyzing the Data with Hadoop – Map and Reduce • Original NCDC Format • Input file for the map function, stored in HDFS • Output of the map function, running in parallel for each block • Input for the reduce function & Output of the reduce function
Analyzing the Data with Hadoop – Map and Reduce • The whole data flow Shuffling Map() Reduce () <1950, 0> <1950, 22> <1949,111> <1949, [111, 78]> <1950, [0, 22, -11]> <1949, 111> <1950, 22> <1951, [10, 76,34], 19> <1951, 10> <1952, 22> <1951, 76> <1952 ,[22, 34]> <1953, [45]> <1955, [23]> <1952, 34> <1953, 45> <1955,25> <1954, 0> <1954, 22> <1950, -11> <1949, 78> <1951, 25> Input File
Analyzing the Data with Hadoop – Java MapReduce • Having run through how the MapReduce program works, express it in code • A map function, a reduce function, and some code to run the job are needed. • Map function
Analyzing the Data with Hadoop – Java MapReduce • Reduce function
Analyzing the Data with Hadoop – Java MapReduce • Main function for running the MapReduce job
Analyzing the Data with Hadoop – Java MapReduce • A test run • The output is written to the output directory, which contains one output file per reducer
Analyzing the Data with Hadoop – Java MapReduce • The new Java MapReduce API • The new API, referred to as “Context Objects”, is type-incompatible with the old, so applications need to be rewritten to take advantage of it. • Notable differences • Favors abstract classes over interfaces. The Mapper and Reducer interfaces are abstract classes. • The new API is in the org.apache.hadoop.mapreducepackage and subpackages. • The old API can still be found in org.apache.hadoop.mapred • Makes extensive use of context objects that allow the user code to communicate with MapReduce system • i.e.) The MapContext unifies the role of the JobConf, the OutputCollector, and the Reporter • Supports both a ‘push’ and a ‘pull’ style of iteration • Basically key-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method. • The same goes for the reducer • Configuration has been unified. • The old API has a JobConf object for job configuration, which is an extension of Hadoop’s vanilla Configuration object. • In the new API, job configuration is done through a Configuration. • Job control is performed through the Job class rather than JobClient. • Output files are named slightly differently • part-m-nnnnn for map outputs, part-r-nnnnn for reduce outputs • (nnnnn is an integer designating the part number, starting from 0)
Analyzing the Data with Hadoop – Java MapReduce • The new Java MapReduce API • Example 2-6 shows the MaxTemperature application rewritten to use the new API.
Scaling Out • To scale out, we need to store the data in a distributed filesystem, HDFS. • Hadoop moves the MapReduce computation to each machine hosting a part of the data. • Data Flow • A MapReduce job consists of the input data, the MapReduce program, and configuration information. • Hadoop runs the job by dividing it into 2 types of tasks, map and reduce tasks. • Two types of nodes, 1 jobtracker and several tasktrackers • Jobtracker: coordinates and schedules tasks to run on tasktrakers. • Tasktrackers : run tasks and send progress report to the jobtracker. • Hadoop divides the input into fixed-size pieces, called input splits, or just splits. • Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split. • The quality of the load balancing increases as the splits become more fine-grained. • Default size : 1 HDFS block, 64MB • Map tasks write their output to the local disk, not to HDFS. • If the node running a map task fails, Hadoop will automatically rerun the map task on another node to re-create the map output.
Scaling Out • Data Flow – single reduce task • Reduce tasks don’t have the advantage of data locality – the input to a single reduce task is normally the output from all mappers. • All map outputs are merged across the network and passed to the user-defined reduce function. • The output of the reduce is normally stored in HDFS.
Scaling Out • Data Flow – multiple reduce tasks • The number of reduce tasks is specified independently not governed by the input size. • The map tasks partition their output by keys, each creating one partition for each reduce task. • There can be many keys and their associated values in each partition, but the records for any key are all in a single partition.
Scaling Out • Data Flow – zero reduce task
Scaling Out • Combiner Functions • Many MapReduce jobs are limited by the bandwidth available on the cluster. • It pays to minimize the data transferred between map and reduce tasks. • Hadoop allows the user to specify a combiner function to be run on the map output – the combiner function’s output forms the input to the reduce function. • The contract for the combiner function constrains the type of function that may be used. • Example without a combiner function • Example with a combiner function, finding maximum temperature for a map Reduce () Map () shuffling <1950, 0> <1950, 20> <1950, 10> <1950, [0, 20, 10, 25, 15]> <1950, 25> <1950, 25> <1950, 15> Map () combiner shuffling Reduce () <1950, 0> <1950, 20> <1950, 10> <1950, 20> <1950, [20, 25]> <1950, 25> <1950, 25> <1950, 15> <1950, 25>
Scaling Out • Combiner Functions • The function calls on the temperature values can be expressed as follows: • Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) ) = max(20, 25) = 25 • Calculating ‘mean’ temperatures couldn’t use the mean as the combiner function • mean(0, 20, 10, 25, 15) = 14 • mean( mean(0, 20, 10), mean(25, 15) ) = mean(10, 20) = 15. • The combiner function doesn’t replace the reduce function. • It can help cut down the amount of data shuffled between the maps and the reduces
Scaling Out • Combiner Functions • Specifying a combiner function • The combiner function is defined using the Reducer interface • It is the same implementation as the reducer function in MaxTemperatureReducer. • The only change is to set the combiner class on the JobConf.
Hadoop Streaming • Hadoop provides an API to MapReduce • write the map and reduce functions in languages other than Java. • We can use any language in MapReduce program. • Hadoop Streaming • Map input data is passed over standard input to your map function. • The map function processes the data line by line and writes lines to standard output. • A map output key-value pair is written as a single tab-delimited line. • Reduce function reads lines from standard input (sorted by key), and writes its results to standard output.
Hadoop Streaming • Ruby • The map function can be expressed in Ruby. • Simulating the map function in Ruby with a Unix pipeline • The reduce function for maximum temperature in Ruby
Hadoop Streaming • Ruby • Simulating the whole MapReduce pipeline with a Unix pipeline • Hadoop command to run the whole MapReduce job • When using the combiner which is coded in any streaming language
Hadoop Streaming • Python • Streaming supports any programming language that can read from standard input and write to standard output. • The map and reduce script in Python • Test the programs and run the job in the same way we did in Ruby.
Hadoop Pipes • Hadoop Pipes • The name of the C++ interface to Hadoop MapReduce. • Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. • The source code for the map and reduce functions in C++
Hadoop Pipes • The source code for the map and reduce functions in C++
Hadoop Pipes • Compiling and Running • Makefilefor C++ MapReduce program • Defining PLATFORM which specifies the operating system, architecture, and data model (e.g., 32- or 64-bit). • To run a Pipes job, we need to run Hadoop (daemon) in pseudo-distributed model. • Next step is to copy the executable code (program) to HDFS. • Next, the sample data is copied from the local filesystem to HDFS.
Hadoop Pipes • Compiling and Running • Now, we can run the job. For this, we use the Hadoop pipes command, passing URI of the executable in HDFS using the –program argument: