MapReduce

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat

Outline • Introduction • Programming Model • Implementation • Refinement • Performance • Related work • Conclusions

Introduction • What is the purpose? • The abstraction Input Data Intermediate Key/value Map Reduce Output File

Programming model • Map • Reduce • Example

Programming model • Real example: make an index

Programming Model • More example • Distributed grep • Count of URL Access Frequency • Reverse Web-link Graph • Term Vector per host • Inverted index • Distributed sort

Implementation • Execution overview

Implementation • Master data structure • Fault tolerance • Worker failure • Master failure • Semantics in the Presence of Failures • Locality • Task Granularity • Back Tasks

Refinements • Partitioning Function • Ordering Guarantees • Combiner Function • Input and Out Types • Side-effect • Skipping Bad Records • Local Execution • Status Information • Counters

Performance • Cluster Configuration • 1800machines • Each 2GHz Intel Xeon processors • 4GB memory • 2*160GB IDE disk • 1 Gbps Ethernet • Arranged in two-level tree-shaped

Performance • Grep • Scan through 1010 100-byte records • Search a relatively rare three-character pattern (occur in 92,337 records) • Data transfer rate over time • The entrie computation takes approximately 150s Peaks at over 30GB/s 1764workers assigned

Performance • Sort • Sorts 1010 100-byte records • Modeled after TeraSort benchmark • Extract a 10-byte sorting key Normal execution 200 tasks killed No backup

Performance • Sort • Input rate is less than for grep • There is a delay • The rate: input > shuffle > output • Effect of backup tasks • Machine failures

Related Work • Restricted programming models • Parallel processing compare to • Bulk Synchronous Programming & MPI primitive • Backup task mechanism compare to • Charlotte System • Sorting facility compare to • NOW-Sort

Related Work • Sending data over distributed queue compare to • River • Programming model compare to • BAD-FS

Conclusion • What is the reason for the sucess of MapReduce? • Easy to use • Problem are easily expressible • Scales to large cluster • Learned from this work • Restriction the programming • Network bandwidth is a scarce resource • Redundant execution

MapReduce

MapReduce

Presentation Transcript

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce:

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka

MapReduce

MapReduce