130 likes | 143 Views
MapReduce : Simplified Data Processing on Large Clusters. B97902029 葉彥廷 B97902083 林廷韋 B97902085 王頃恩. Outline. Why we choose this topic Introduction Programming Model Example Implementation Conclusion. Why we choose this topic. 趨勢騰雲駕霧程式競賽 (2010 )
E N D
MapReduce: Simplified Data Processing on Large Clusters B97902029 葉彥廷 B97902083 林廷韋 B97902085 王頃恩
Outline • Why we choose this topic • Introduction • Programming Model • Example • Implementation • Conclusion
Why we choose this topic • 趨勢騰雲駕霧程式競賽(2010) • Miserable memory in the last summer vacation. • We didn’t design a distributed system successfully in the end. • So we want to learn the ideas of cloud computing more.
Introduction(1) • How long can you stand for searching the answer of automata homework? • A week? • A day? • Or ask Google for instant answers?
Introduction(2) • But how can Google do it so fast? • Google is good at automata? • It’s MapReduce!! • And what can MapReduce do?
Introduction(3) • MapReduce can: • Simplified the procedure of computing large amount of data. • Split works into independent jobs, which can be computed in distributed clusters. • For programmer, he/she only needs to implement the interface of Map and Reduce without much effort. • But how does it work?
Programming Model(1) • Map function: • Take two input parameters : KEY/VALUE • Split the VALUE into several intermediate key/value pairs with user defined implementation. (may use KEY or not) • Send key/value pair to Reduce functions.
Programming Model(2) • Reduce function: • Receive input key/value pairs from Map function. • Merge together these values to form a possibly smaller set of values with the same key. • Collect the output from all clusters, and show the result to the user.
Example • Assume we have a log file of web page requests and it’s name. • We want to know what web page appears in the log file and it’s frequency. • Map function • Input: <logs file name , web page requests> • Output:<URL,1> • Reduce function • Input:<URL,1> • Output:<URL, total counts>
Implementation(2) • Master Data Structure • For each map and reduce, it stores the state, and the identity of worker machine. • Fault Tolerance • Worker Failure • Master Failure
Implementation(3) • Locality • Read the input locally without much use of the network. • Task Granularity • Backup Tasks
Conclusion • Please DO NOT assign papers without inform us in the beginning of this semester. • Please stop FLIRTING with CHINA student. • Please PREPARE the course content instead of discussing 5 minutes. • Please OK?