150 likes | 316 Views
MapReduce. Powering Hadoop. Overview. Overview What is MapReduce How Does It Divide Work Example Conclusion References. What Is MapReduce. Originally created by Google Used to query large data-sets Extracts relations from unstructured data Can draw from many disparate data sources.
E N D
MapReduce Powering Hadoop
Overview • Overview • What is MapReduce • How Does It Divide Work • Example • Conclusion • References
What Is MapReduce • Originally created by Google • Used to query large data-sets • Extracts relations from unstructured data • Can draw from many disparate data sources
How It Divides Work http://docs.basho.com/riak/1.3.0/tutorials/querying/MapReduce/
4 Refinements • General algorithms fit most needs • User defined Tweaks to the Map and Reduce functions fit special problems
4.1 Partitioning Function • Users can define the number of reduce tasks to run (R) • We can redefine the intermediate keys • A default function is hash(key) mod R • Sometimes we may want to group output together, such as grouping web data by domain • We can redefine partition to use hash(Hostname(urlkey)) mod R
4.2 Ordering Guarantees • Within each partition, intermediate key/value pairs are always processed in increasing order • This supports efficient lookup of random keys
4.3 Combiner Function • There is sometimes significant repetition in the intermediate keys • This is usually handled in the Reduce function, but sometimes we want to partially combine it in the Map function • The combiner function does this for us, and in some situations grants significant performance gains
4.4 Input and Output Types • MapReduce can take data from a number of formats • The way the data is organized for input greatly effects the output • Adding support for a new data type only requires users to change the reader interface
4.5 Side-effects • Sometimes we want to output additional files from the Map or Reduce functions • Users are responsible for these files, as long as these outputs are deterministic