330 likes | 343 Views
This summary provides an overview of MapReduce and the Hadoop API, including implementation details and examples of word count program. It explains the map and reduce functions, their implementation, and the concept of fault tolerance. The Hadoop framework and its use in conjunction with the Google Filesystem are also discussed.
E N D
A Summary of: MapReduce & Hadoop API Google MapReduce Framework Slides prepared by Peter Erickson (poe9514@rit.edu)
What is MapReduce? Massively parallel processing of very large data sets (larger than 1TB). Parallelize the computations over hundreds or thousands of CPUs. Ensure fault tolerance. Do this all through an easy-to-use, abstract and reusable framework. 2
What is MapReduce? Simple algorithm that can be easily parallelized, and efficiently handles large sets of data: MAP – Sort data in key, value pairs then REDUCE – Perform a reduction across n maps (nodes) - Simple, clean abstraction for programmers. 3
Implementation Programmers need only to implement two functions: map ( Object key, Object value ) -> Map<Object, Object> reduce ( Object key, List<Object> values ) -> Map<Object, Object> Let's look at an example to better understand these functions. 4
Word Count Example Program Given a document of words, sum the usage of all words, i.e. (the:42), (is:10), (a:23), (computer:2), etc How would you perform this task sequentially? Map<Word, Integer> maps words to the number of occurrences in the document. How would you perform this task in parallel? Split the document amongst n processors and map words in same way, then reduce all nodes. 5
Word Count: Map function Map is performed on each node, using a subset of the overall data. // input_key: document name // input_value: document contents Map<String, Integer> map( String input_key, String input_value ) { // iterate over all words (.split( “\\s” separates a string between spaces) for ( String word : input_value.split( “\\s” ) { // insert “1” into the Map for this word (there will be collisions) output.put( word, "1" ); } } 6
Word Count: Map function // input_key: document name // input_value: document contents map( String input_key, String input_value ) { // iterate over all words (.split( “\\s” separates a string between spaces) for ( String word : input_value.split( “\\s” ) { // insert “1” into the Map for this word (there will be collisions) output.put( word, "1" ); } } - Collisions in the map are IMPORTANT. - Values with the same key are passed to the Reduce Function. - The Reduce Function makes the decision of how to merge these collisions. 7
Reduction Phase Values in the map are merged in some way using the Reduce Function. The Reduce Function can add, subtract, multiply, divide, take the average, ignore all data, or anything the programmer chooses. reduce() is passed a key and a list of values, all that share the same key. reduce() is expected to return a new Map with reduced values. 8
Reduction Phase For example, if reduce() is passed:Key (word): “the”Values: { 1, 1, 1, 1, 1 } it would probably form a new summed (key, value) pair to the output map:Key (word): “the”Value: 5 9
Reduction Phase Reduction can happen on any number of nodes before forming a final map:Node 1 Node 2 Node 3[“the”, {1, 1, 1}] [“the”, {1, 1, 1, 1}] [“the”, {1}]------------------REDUCE-----------------[“the”, {7}] [“the”, {1}] ---------------------REDUCE------------------ Final Result: [“the”, {8}] 10
Word Count: Reduce Function // key: a word // values: a list of collided values for this word Map<String, Integer> reduce( String key, List<Integer> values ) { // sum the counts for this word int total = 0; // iterate over all integers in the collided value list for ( int count : values ) { total += count; } output.put( key, total ); } 11
Hadoop: Java MapReduce Implementation Hadoop is an open source project run by the Apache Foundation Provides an API to write MapReduce programs easily and efficiently in Java Installed on the Thug (Paranoia) cluster at RIT Used in associated with the Google Filesystem Hadoop has several frameworks for different parts of the MapReduce paradigm. 12
A Summary of: Input Formats Record Readers Mapping Function Reducing Function Output Formats General Program Layout Hadoop API
Hadoop Input Formats Most Hadoop programs read their input from a file. The data from a Hadoop input file must be parsed as map (key, value) pairs, even before the map() function. The key and value types from the input file are separate from the key and value types using for map and reduce. A very simple input format: TextInputFormat Key: Line number Value: A line of text from the input file 14
Hadoop Data Types Special data types are used in Hadoop, for the purpose of serialization between nodes, and data comparison: IntWritable – integer type LongWritable – long integer type Text – string type many, many more The WordCount program used the TextInputFormat, as we saw in the map() function: public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter ) throws IOException { 15
Word Count Example Program public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter ) throws IOException { key = line number value = text from document at line number output = OutputCollector, or special Map that allows collisions output is written to the output collector reporter = special Hadoop object for reporting progress doesn't need to be used – for more advanced programs 16
Hadoop Input Format The InputFormat object of the program reads data on a single node from a file, or other source. The InputFormat is then asked to partition the data into smaller sets of data for each node to process. Let's look at a new sample program, that does not read any input from a file: Sum all prime numbers between 1 and 1000. 17
Prime Number Example We could make a file that lists all numbers from 1 to 1000, but this is unnecessary: we can write our own InputFormat class to generate these numbers and divide them amongst the nodes: PrimeInputFormat The InputFormat should generate numbers 1 to 1000, then split the numbers into n groups. InputFormats generate or read (key, value) pairs, so what should we use? No need to use a key, we only have values (numbers), so we use a dummy key. 18
InputFormat Interface // interface to be used with objects of type K and V public interface InputFormat<K, V> {// validates the input for the specified job (can be ignored)public void validateInput( JobConf job ) throws IOException;// returns an array of “InputSplit” objects that are sent to each node;// the number of splits to be made is represented by numSplitspublic InputSplit[] getSplits( JobConf job, int numSplits ) throws IOException;// returns a “iterator” of sorts for a node to extract (key, value) pairs// from an InputSplitpublic RecordReader<K, V> getRecordReader( InputSplit split, JobConf job, Reporter reporter ) throws IOException; } 19
InputSplit Interface public interface InputSplit extends Writable {// get total bytes of data in this input splitpublic long getLength();// get the hostnames of where the splits are (can be ignored)public String[] getLocations(); } // represents an object that can be serialized/deserialized public interface Writable {// read in the fields of this object from the DataInput objectpublic void readFields( DataInput in );// write the fields of this object to the DataOutput objectpublic void write( DataOutput out ); } 20
Prime Number InputSplit: RangeSplit An InputSplit for our program need only hold the range of numbers for each node (min, max). public class RangeSplit implements InputSplit { int min, max; public RangeSplit() { super(); } public long getLength() { return ( long )( max – min ); } public String[] getLocations() { return new String[]{}; } public void write( DataOutput out ) { out.writeInt( min ); out.writeInt( max ); } public void readFields( DataInput in ) { min = in.readInt(); max = in.readInt(); } } 21
PrimeInputFormat.getSplits(); Create RangeSplit objects for our program: public InputSplit[] getSplits( JobConf job, int numSplits ) throws IOException { RangeSplit[] splits = new RangeSplit[ numSplits ]; // for simplicity sake, we're going to assume 1000 is evenly divisible // by numSplits, but this may not always be the case int len = 1000 / numSplits; for( int i = 0; i < n; i++ ) { splits[ i ].min = ( i * len ) + 1; splits[ i ].max = ( i + 1 ) * len; } // for return splits; } // getSplits 22
Record Reader An InputSplit for our program generates range of numbers for each node (min, max). i.e. for 4 nodes: (1, 250), (251, 500), (501, 750), (751, 1000) RecordReader is then responsible for generating (key, value) pairs from an InputSplit. Our RecordReader will iterate from min to max on each node. One RecordReader is used per Mapper. 23
RecordReader Interface // the record reader is responsible for iterating over (key, value) pairs in // an input split public interface RecordReader<K, V> { // creates a key/value for the record reader (for generics) public K createKey(); public V createValue(); // get the position of the iterator public int getPos(); // get the progress of the iterator (0.0 to 1.0) public float getProgress(); // populate key and value with the next tuple from the InputSplit public void next( K key, V value ); // marks the end of use of the RecordReader public void close(); } // RecordReader 24
PrimeInputFormat.getRecordReader(); // the record reader is responsible for iterating over (key, value) pairs in // an input split public RecordReader<Text, IntWritable> getRecordReader( InputSplit split, JobConf conf, Reporter reporter ) throws IOException { final RangeSplit range = ( RangeSplit )split; // return a new anonymous inner class return new RecordReader<Text, IntWritable>() { int pos = 0; public Text createKey() { return new Text(); } public IntWritable createValue() { return new IntWritable(); } public int getPos() { return pos; } public float getProgress() { return ( float )pos / ( float )( range.max – range.min ); } // continued ... 25
PrimeInputFormat.getRecordReader(); // return the next key, value pair and increment the position public void next( Text key, IntWritable value ) throws IOException { // get the number at this position int val = range.min + pos; // dummy key value key.setText( “key” ); // set the number for the value value.set( val ); // increment the position pos++; } // close the RecordReader public void close() { }; }; } 26
Prime Number Mapper Our program Mapper now reads in a dummy key and a number. What should our new Map data types be? BooleanWritable = prime/not prime IntWritable = the number Reducer can then add together all values with a “true” boolean key, and ignore all “false” values. public static class PrimeMapper extends MapReduceBase implements Mapper<Text, IntWritable, BooleanWritable, IntWritable>// ^mapper input ^ ^ mapper output ^ 27
Prime Sum Mapper public static class PrimeMapper extends MapReduceBase implements Mapper<Text, IntWritable, BooleanWritable, IntWritable> { public void map( Text key, IntWritable value, OutputCollector<BooleanWritable, IntWritable> output, Reporter reporter ) throws IOException { // check if the number is prime – choose your favorite prime number // testing algorithm if ( isPrime( value.get() ) ) { output.collect( new BooleanWritable( true ), value ); } else { output.collect( new BooleanWritable( false ), value ); } } } // PrimeMapper 28
Prime Number Reducer Reducer will take multiple (boolean, int) pairs, and reduce to a single (text, int) pair: “Sum”, int public static class PrimeReducer extends MapReduceBase implements Reducer<BooleanWritable, IntWritable, Text, IntWritable>// ^reducer input^ ^reducer output^ 29
Prime Sum Reducer public static class PrimeReducer extends MapReduceBase implements Reducer<BooleanWritable, IntWritable, Text, IntWritable> { public void reduce( BooleanWritable key, Iterator<IntWritable> values, OutputCollector<BooleanWritable, IntWritable> output, Reporter reporter ) throws IOException { // ignore the “false” values if ( key.get() ) { // sum the values and write to the output collector int sum = 0; for ( IntWritable val : values ) { sum += val.get(); } output.collector( key, new IntWritable( sum ) ); } } } // PrimeReducer 30
Output Formats Hadoop uses an OutputFormat just like an InputFormat. Easiest to use: TextOutputFormat Prints “key: value”, one key per line to an output file File is written to the distributed Google File System. 31
Program Layout Our example Prime Sum program: public static void main( String[] args ) { JobConf job = new JobConf( PrimeSum.class ); job.setJobName( "primesum" ); job.setOutputPath( “primesum-output” ); job.setOutputKeyClass( Text.class ); job.setOutputValueClass( IntWritable.class ); job.setMapperClass( PrimeMapper.class ); job.setReducerClass( PrimeReducer.class ); job.setInputFormat( PrimeInputFormat.class ); job.setOutputFormat( TextOutputFormat.class ); 32
Program Layout Last but not least: // run the job JobClient.runJob( job ); } // main 33