MapReduce : Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters B97902029 葉彥廷 B97902083 林廷韋 B97902085 王頃恩

Outline • Why we choose this topic • Introduction • Programming Model • Example • Implementation • Conclusion

Why we choose this topic • 趨勢騰雲駕霧程式競賽(2010) • Miserable memory in the last summer vacation. • We didn’t design a distributed system successfully in the end. • So we want to learn the ideas of cloud computing more.

Introduction(1) • How long can you stand for searching the answer of automata homework? • A week? • A day? • Or ask Google for instant answers?

Introduction(2) • But how can Google do it so fast? • Google is good at automata? • It’s MapReduce!! • And what can MapReduce do?

Introduction(3) • MapReduce can: • Simplified the procedure of computing large amount of data. • Split works into independent jobs, which can be computed in distributed clusters. • For programmer, he/she only needs to implement the interface of Map and Reduce without much effort. • But how does it work?

Programming Model(1) • Map function: • Take two input parameters : KEY/VALUE • Split the VALUE into several intermediate key/value pairs with user defined implementation. (may use KEY or not) • Send key/value pair to Reduce functions.

Programming Model(2) • Reduce function: • Receive input key/value pairs from Map function. • Merge together these values to form a possibly smaller set of values with the same key. • Collect the output from all clusters, and show the result to the user.

Example • Assume we have a log file of web page requests and it’s name. • We want to know what web page appears in the log file and it’s frequency. • Map function • Input: <logs file name , web page requests> • Output:<URL,1> • Reduce function • Input:<URL,1> • Output:<URL, total counts>

Implementation(1)

Implementation(2) • Master Data Structure • For each map and reduce, it stores the state, and the identity of worker machine. • Fault Tolerance • Worker Failure • Master Failure

Implementation(3) • Locality • Read the input locally without much use of the network. • Task Granularity • Backup Tasks

Conclusion • Please DO NOT assign papers without inform us in the beginning of this semester. • Please stop FLIRTING with CHINA student. • Please PREPARE the course content instead of discussing 5 minutes. • Please OK?

MapReduce : Simplified Data Processing on Large Clusters