70 likes | 189 Views
MapReduce : Simplified Data Processing on Large Clusters. Appendix A: Word Frequency Alex Newton Billy Coss. Contents. Abstract Introduction MapReduce Word Frequency Analysis Sample Code. Abstract. MapReduce is a model used to analyze large amounts of data
E N D
MapReduce: Simplified Data Processing on Large Clusters Appendix A: Word Frequency Alex Newton Billy Coss
Contents • Abstract • Introduction • MapReduce • Word Frequency Analysis Sample Code
Abstract • MapReduce is a model used to analyze large amounts of data • Map creates key:value pairs, irrespective of duplicates • Reduce takes the key-value pairs created by the Map function and condenses them down to remove duplicate results
Introduction • Data analysts at Google frequently work on extremely large sets of raw data • Parallel computing is required to process datasets in a useful length of time • MapReduce was created as a form of abstraction for the details of parallelization, fault tolerance, data distribution, and load balancing
MapReduce Image taken from OSDI ‘04 Presentation by Jeff Dean and Sanjay Ghemawat.
Word Frequency Analysis Example Code • Code is divided into three functions • main • WordCounter • Adder • WordCounter is used for the Map function • Skips any leading whitespace and then parses words out of text • The word itself is the key, the value is 1 • Adder is used for the Reduce function • Iterates through keys, and adds the values of the same key together • Since the value is 1, this has the effect of incrementing a counter for the number of times a word is used
Sources J. Dean & S. Ghemawat (2004), MapReduce: Simplified Data Processing on Large Clusters. OSDI ‘04: 6th Symposium on Operating Systems Design and Implementation. pp. 137, 149. http://research.google.com/archive/mapreduce.html