excelonlineclasses.co.nr/ excel.onlineclasses@gmail

http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com http://www.excelonlineclasses.co.nr/

Excel Online Classes offers following services: • Online Training • Development • Testing • Job support • Technical Guidance • Job Consultancy • Any needs of IT Sector

Nagarjuna K MapReduce Anatomy

AGENDA • Anatomy of MapReduce • MR work flow • Hadoop data types • Mapper • Reducer • Partitioner • Combiner • Input Split vs Block Size

Anatomy of MR Partitioning Shuffling . INPUT DATA NODE 2 NODE 2 NODE 1 Map Map Map Interim data Interim data Interim data Reduce Reduce Reduce Node to store output Node to store output Node to store output

Hadoop data types • MR has a defined way of keys and values types  for it to move across cluster • Values  Writable • Keys  WritableComparable<T> • WritableComparable = Writable+Comparable<T>

Frequently used key/value

Custom Writable • For any class to be value, ithas to implement org.apache.hadoop.io.Writable • write(DataOutput out) • readFields(DataInput in)

Custom key • For any class to be key, it has to implement org.apache.hadoop.io.WritableComparable<T> • + • compareTo(T o)

Checkout Writables • Check out few of the writables and writable comparable • Time to write your own writables

MapReduce libraries • Two libraries in Hadoop • org.apache.hadoop.mapred.* • org.apache.hadoop.mapreduce.*

Mapper • Should implement org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2> • Void configure(JobConf job) • All the parameters specified in the xmls are available here. • Any parameter explicitly set are also available • Call before data processing starts • Void map (K1 key,V1 value, OutputCollector<K2,V2> output,Reporter reporter) • Data process starts • Void Close() • Should close any files, db connections etc., • Reporter provides extra information of mapper to TT

Mappers -default

Reducer • Should implement org.apache.hadoop.mapred.Redcuer • Sorts the incoming data based on key and groups together all the values for a key • Reduce function is called for every key in the sorted order • void reduce(K2 key, Iterator<V2> values,OutputCollector<K3,V3> output, Reporter reporter) • Reporter provides extra information of mapper to TT

Reducer -default

Partitioner • implements Partitioner<K,V> • configure() • intgetPartition ( … ) • 0< return<no.of.reducers • Generally, implement Partitioner so same keys go to one reducer

Reading and Writing • Generally two kinds of files in Hadoop • Text (plain , XML, html …. ) • Binary (Sequence) • It is a hadoop specific compressed binary file format. • Optimized to transfer output from one MR to MR • We can customize

Input Format • HDFS block size • Input splits

Blocks in HDFS • Big File is divided into multiple blocks and stored in hdfs. • This is a physical division of data • dfs.block.size(64MB default size) LARGE FILE BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4

Input Splits and Records LOGICAL DIVISION • Input split • A chunk of data processed by a mapper • Further divided into records • Map process these records • Record = key + value • How to correlate to a DB table • Group of rows  split • Row  record

InputSplit public interface InputSplit extends Writable { long getLength() throws IOException; String[] getLocations() throws IOException; } • It doesn’t contain the data • Only locations where the data is present • Helps jobtracker to arrange tasktrackers (data locality). • getLength greater length split will be executed 

InputFormat • How we get the data to mapper • Inputsplits and how the splits are divided into records will be taken care by inputformat. public interface InputFormat<K, V> {InputSplit[] getSplits(JobConf job, intnumSplits) throws IOException; RecordReader<K, V> getRecordReader(InputSplit split, JobConfjob, Reporter reporter) throws IOException; }

InputFormat • Mapper • getRecordReader() is called to get RecordReader • Once the record reader is obtained, • Map method is called recursively until the end of the split

RecordReader K key = reader.createKey();V value = reader.createValue(); while (reader.next(key, value)) { mapper.map(key, value, output, reporter); }

Job Submission -- retrospection • JobClient running the job • Gets inputsplits by calling getSplits() in InputFormat • Determines data locations for the splits • Sends these locations to the JobTracker • JobTracker assigns mappers appropriately. • Data locality

InBuiltInputFormats

FileInputFormat • Base class for all implementations of InputFormat, which uses files as input • Defines • Which files to include for the job • Implementation for generating splits

FileInputFormat • Set of Files  converts to no.of splits • Splits only large files…. HOW LARGE ? • Larger than BlockSize • Can we control it ?

Calculating Split Size • Application may impose minimum split size greater than Block Size. • There is no good reason to that • Data locality is lost

FileInputFormat • Min split size • We might set it to larger than block size • But concept of data locality may be lost to some extent • Split size calculated by formula • max(minimumSize, min(maximumSize, blockSize)) • By default • minimumSize < blockSize < maximumSize

File Information in the mapper • Configure(JobConf job)

TextInputFormat • Default FileInputFormat • Each line is a value • Byte offset is a key • Example • Run identity mapper program

Input Splits and HDFS Blocks • Logical Records defined by FileInputFormat doesn’t usually fit it into HDFS blocks. • EveryFileis written is written as sequence of bytes. • 64 MB reached ? then start the new block • When 64 MB reached, the logical record may be half written • So, the other half of logical record goes into the next HDFS block.

Input Splits and HDFS Blocks • So even in data locality some remote reading is done.. a slight overhead. • Split gives logical record boundaries • Blocks – physical boundaries (size)

Small Files • Files which are very small are inefficient in mapper phase • Imagine 1GB • 64Mb – 16 files – 16 mappers • 100kb – 1000 files – 1000 mappers 

CombineFileInputFormat • Packs many files into single split • Data locality is taken into consideration • MR accelerates best if operated at disk transfer rate not at seek rate • This helps in processing large files also

NLineInputFormat • Same as TextInputFormat • Each split guarenteed to have N lines • mapred.line.input.format.linespermap

KeyValueTextInputFormat • Each line in text file is a record • First separator character divides key and value • Default is ‘\t’ • Controller property • key.value.separator.in.input.line

SequenceFileInputFormat<K,V> • InputFormat for reading sequence files • User defined Key K • User defined Value V • They are splittable files. • WellSuited for MR • They store compression • They can store arbitrary types

OutputFormat

TextOutFormat • key,values stored as \t separated by default. • mapred.textoutputformat.separator -- parameter CounterPart for KeyValueTextInputFormat • Can suppress key/value by using NullWritable

excelonlineclasses.co.nr/ excel.onlineclasses@gmail

excelonlineclasses.co.nr/ excel.onlineclasses@gmail

Presentation Transcript

nr

EXCEL

Excel

Excel at Excel

Excel

excelonlineclasses/ excel.onlineclasses@gmail

Excel

EXCEL

Excel Workshop Excel Fundamentals

Excel

Excel

EXCEL

How to Import Contacts in Gmail from Excel Spreadsheet

Gmail

Gmail