http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com

http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com

Excel Online Classes offers following services: • Online Training • Development • Testing • Job support • Technical Guidance • Job Consultancy • Any needs of IT Sector

Nagarjuna K MapReduce Anatomy

AGENDA • Anatomy of MapReduce • MR work flow • Hadoop data types • Mapper • Reducer • Partitioner • Combiner • Input Split vs Block Size

Anatomy of MR Partitioning Shuffling . INPUT DATA NODE 2 NODE 2 NODE 1 Map Map Map Interim data Interim data Interim data Reduce Reduce Reduce Node to store output Node to store output Node to store output

Hadoop data types • MR has a defined way of keys and values types  for it to move across cluster • Values  Writable • Keys  WritableComparable<T> • WritableComparable = Writable+Comparable<T>

Frequently used key/value

Custom Writable • For any class to be value, ithas to implement org.apache.hadoop.io.Writable • write(DataOutput out) • readFields(DataInput in)

Custom key • For any class to be key, it has to implement org.apache.hadoop.io.WritableComparable<T> • + • compareTo(T o)

Checkout Writables • Check out few of the writables and writable comparable • Time to write your own writables

MapReduce libraries • Two libraries in Hadoop • org.apache.hadoop.mapred.* • org.apache.hadoop.mapreduce.*

Mapper • Should implement org.apache.hadoop.mapred.Mapper<K1,V1,K2,V2> • Void configure(JobConf job) • All the parameters specified in the xmls are available here. • Any parameter explicitly set are also available • Call before data processing starts • Void Mapper(K1 key,V1 value, OutputCollector<K2,V2> output,Reporter reporter) • Data process starts • Void Close() • Should close any files, db connections etc., • Reporter provides extra information of mapper to TT

Mappers -default

Reducer • Should implement org.apache.hadoop.mapred.Redcuer • Sorts the incoming data based on key and groups together all the values for a key • Reduce function is called for every key in the sorted order • void reduce(K2 key, Iterator<V2> values,OutputCollector<K3,V3> output, Reporter reporter) • Reporter provides extra information of mapper to TT

Reducer -default

Partitioner • implements Partitioner<K,V> • configure() • intgetPartition ( … ) • 0< return<no.of.reducers • Generally, implement Partitioner so same keys go to one reducer

Reading and Writing • Generally two kinds of files in Hadoop • Text (plain , XML, html …. ) • Binary (Sequence) • It is a hadoopspecific compressed binary file format. • Optimized to transfer output from one MR to MR • We can customize

Input Format • HDFS block size • Input splits

Blocks in HDFS • Big File is divided into multiple blocks and stored in hdfs. • This is a physical division of data • dfs.block.size(64MB default size) LARGE FILE BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4

Input Splits and Records LOGICAL DIVISION • Input split • A chunk of data processed by a mapper • Further divided into records • Map process these records • Record = key + value • How to correlate to a DB table • Group of rows  split • Row  record

InputSplit public interface InputSplit extends Writable { long getLength() throws IOException; String[] getLocations() throws IOException; } • It doesn’t contain the data • Only locations where the data is present • Helps jobtracker to arrange tasktrackers (data locality). • getLength greater length split will be executed 

InputFormat • How we get the data to mapper • Inputsplits and how the splits are divided into records will be taken care by inputformat. public interface InputFormat<K, V> {InputSplit[] getSplits(JobConf job, intnumSplits) throws IOException; RecordReader<K, V> getRecordReader(InputSplit split, JobConfjob, Reporter reporter) throws IOException; }

RecordReader K key = reader.createKey();V value = reader.createValue(); while (reader.next(key, value)) { mapper.map(key, value, output, reporter); }

FileInputFormat • Base class for all implementations of InputFormat, which uses files as input • Defines • Which files to include for the job • Implementation for generating splits

FileInputFormat • Set of Files  converts to no.of splits • Splits only large files…. HOW LARGE ? • Larger than BlockSize • Can we control it ?

FileInputFormat • Min split size • We might set it to larger than block size • But concept of data locality may be lost to some extent • Split size calculated by formula • max(minimumSize, min(maximumSize, blockSize)) • By default • minimumSize < blockSize < maximumSize

Calculating Split Size

File Information in the mapper • Configure(JobConf job)

TextInputFormat • Default FileInputFormat • Each line is a value • Byte offset is a key • Example • Run identity mapper program

Input Splits and HDFS Blocks • Logical Records defined by FileInputFormat doesn’t usually fit it into HDFS blocks. • EveryFileis written is written as sequence of bytes. • 64 MB reached ? then start the new block • When 64 MB reached, the logical record may be half written • So, the other half of logical record goes into the next HDFS block.

Input Splits and HDFS Blocks • So even in data locality some remote reading is done.. a slight overhead. • Split gives logical record boundaries • Blocks – physical boundaries (size)

Other default InputFormats

http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com

http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com

Presentation Transcript

Excel

Excel 2007 Basics

Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts

Excel Tutorial 1: Getting Started with Excel

VBA - Excel

Gmail

Microsoft Office Excel