410 likes | 522 Views
http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com. Excel Online Classes offers following services :. Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector. Nagarjuna K. MapReduce Anatomy. AGENDA.
E N D
http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com http://www.excelonlineclasses.co.nr/
Excel Online Classes offers following services: • Online Training • Development • Testing • Job support • Technical Guidance • Job Consultancy • Any needs of IT Sector
Nagarjuna K MapReduce Anatomy
AGENDA • Anatomy of MapReduce • MR work flow • Hadoop data types • Mapper • Reducer • Partitioner • Combiner • Input Split vs Block Size
Anatomy of MR Partitioning Shuffling . INPUT DATA NODE 2 NODE 2 NODE 1 Map Map Map Interim data Interim data Interim data Reduce Reduce Reduce Node to store output Node to store output Node to store output
Hadoop data types • MR has a defined way of keys and values types for it to move across cluster • Values Writable • Keys WritableComparable<T> • WritableComparable = Writable+Comparable<T>
Custom Writable • For any class to be value, ithas to implement org.apache.hadoop.io.Writable • write(DataOutput out) • readFields(DataInput in)
Custom key • For any class to be key, it has to implement org.apache.hadoop.io.WritableComparable<T> • + • compareTo(T o)
Checkout Writables • Check out few of the writables and writable comparable • Time to write your own writables
MapReduce libraries • Two libraries in Hadoop • org.apache.hadoop.mapred.* • org.apache.hadoop.mapreduce.*
Mapper • Should implement org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2> • Void configure(JobConf job) • All the parameters specified in the xmls are available here. • Any parameter explicitly set are also available • Call before data processing starts • Void map (K1 key,V1 value, OutputCollector<K2,V2> output,Reporter reporter) • Data process starts • Void Close() • Should close any files, db connections etc., • Reporter provides extra information of mapper to TT
Reducer • Should implement org.apache.hadoop.mapred.Redcuer • Sorts the incoming data based on key and groups together all the values for a key • Reduce function is called for every key in the sorted order • void reduce(K2 key, Iterator<V2> values,OutputCollector<K3,V3> output, Reporter reporter) • Reporter provides extra information of mapper to TT
Partitioner • implements Partitioner<K,V> • configure() • intgetPartition ( … ) • 0< return<no.of.reducers • Generally, implement Partitioner so same keys go to one reducer
Reading and Writing • Generally two kinds of files in Hadoop • Text (plain , XML, html …. ) • Binary (Sequence) • It is a hadoop specific compressed binary file format. • Optimized to transfer output from one MR to MR • We can customize
Input Format • HDFS block size • Input splits
Blocks in HDFS • Big File is divided into multiple blocks and stored in hdfs. • This is a physical division of data • dfs.block.size(64MB default size) LARGE FILE BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4
Input Splits and Records LOGICAL DIVISION • Input split • A chunk of data processed by a mapper • Further divided into records • Map process these records • Record = key + value • How to correlate to a DB table • Group of rows split • Row record
InputSplit public interface InputSplit extends Writable { long getLength() throws IOException; String[] getLocations() throws IOException; } • It doesn’t contain the data • Only locations where the data is present • Helps jobtracker to arrange tasktrackers (data locality). • getLength greater length split will be executed
InputFormat • How we get the data to mapper • Inputsplits and how the splits are divided into records will be taken care by inputformat. public interface InputFormat<K, V> {InputSplit[] getSplits(JobConf job, intnumSplits) throws IOException; RecordReader<K, V> getRecordReader(InputSplit split, JobConfjob, Reporter reporter) throws IOException; }
InputFormat • Mapper • getRecordReader() is called to get RecordReader • Once the record reader is obtained, • Map method is called recursively until the end of the split
RecordReader K key = reader.createKey();V value = reader.createValue(); while (reader.next(key, value)) { mapper.map(key, value, output, reporter); }
Job Submission -- retrospection • JobClient running the job • Gets inputsplits by calling getSplits() in InputFormat • Determines data locations for the splits • Sends these locations to the JobTracker • JobTracker assigns mappers appropriately. • Data locality
FileInputFormat • Base class for all implementations of InputFormat, which uses files as input • Defines • Which files to include for the job • Implementation for generating splits
FileInputFormat • Set of Files converts to no.of splits • Splits only large files…. HOW LARGE ? • Larger than BlockSize • Can we control it ?
Calculating Split Size • Application may impose minimum split size greater than Block Size. • There is no good reason to that • Data locality is lost
FileInputFormat • Min split size • We might set it to larger than block size • But concept of data locality may be lost to some extent • Split size calculated by formula • max(minimumSize, min(maximumSize, blockSize)) • By default • minimumSize < blockSize < maximumSize
File Information in the mapper • Configure(JobConf job)
TextInputFormat • Default FileInputFormat • Each line is a value • Byte offset is a key • Example • Run identity mapper program
Input Splits and HDFS Blocks • Logical Records defined by FileInputFormat doesn’t usually fit it into HDFS blocks. • EveryFileis written is written as sequence of bytes. • 64 MB reached ? then start the new block • When 64 MB reached, the logical record may be half written • So, the other half of logical record goes into the next HDFS block.
Input Splits and HDFS Blocks • So even in data locality some remote reading is done.. a slight overhead. • Split gives logical record boundaries • Blocks – physical boundaries (size)
Small Files • Files which are very small are inefficient in mapper phase • Imagine 1GB • 64Mb – 16 files – 16 mappers • 100kb – 1000 files – 1000 mappers
CombineFileInputFormat • Packs many files into single split • Data locality is taken into consideration • MR accelerates best if operated at disk transfer rate not at seek rate • This helps in processing large files also
NLineInputFormat • Same as TextInputFormat • Each split guarenteed to have N lines • mapred.line.input.format.linespermap
KeyValueTextInputFormat • Each line in text file is a record • First separator character divides key and value • Default is ‘\t’ • Controller property • key.value.separator.in.input.line
SequenceFileInputFormat<K,V> • InputFormat for reading sequence files • User defined Key K • User defined Value V • They are splittable files. • WellSuited for MR • They store compression • They can store arbitrary types
TextOutFormat • key,values stored as \t separated by default. • mapred.textoutputformat.separator -- parameter CounterPart for KeyValueTextInputFormat • Can suppress key/value by using NullWritable