90 likes | 249 Views
Map Reduce. Dustin Beaupre Thuy Nguyen Relation to our course: Chapter 8 : Physical Data Model - 8.3.2 Hash Tables & Files - 8.6.3 : Parallel Processing Sources : 1. wikipedia entry (en.wikipedia.org/wiki/MapReduce)
E N D
MapReduce Dustin Beaupre Thuy Nguyen Relation to our course: Chapter 8 : Physical Data Model - 8.3.2 Hash Tables & Files - 8.6.3 : Parallel Processing Sources: 1. wikipedia entry (en.wikipedia.org/wiki/MapReduce) 2. Apache MapReduce Tutorial (hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html)
Why is MapReduce useful? • Computes large amounts of data using parallel processing. • Divides the workload across a large number of machines. • If an update in the data is required, you have to re-map. • Useful for data mining. • Has fault tolerance, meaning that if one machine stops working, it will reassign the task to another
What does map do? Distributes the workload to multiple machines. map() performs filtering and sorting. What does reduce do? Combines the output from the mapping into a single output reduce() performs summary operation.
Logical View • (key, value) pair • Map(): take one pair of data in one domain and return a list of pairs in a different domain • Map(k1, v1) -> list (k2, v2) • Reduce(): apply in parallel in each group to produce a collection of value in the same domain • Reduce(k2, list(v2)) -> list (v3) • Transform a list of (key, value) pair into a single list of values
Execution Trace for Wordcount Mapper A Reducer Mapper B
SQL SELECT eyeColor, COUNT(*) FROM worldPopulation GROUP BY eyeColor • Suppose everyone was in this database all ~7,222,157,690 people • Sequential response time is too large! • Map Reduce may help!
Execution Trace for EyeColorCount Mapper A Reducer Mapper B
MapReduce steps • Prepare the Map() input • Run the user-provided Map() code • “Shuttle” the Map output to the Reduce processors • Run the user-provided Reduce() code • Produce the final output
Overall, the goal of MapReduce is to provide correct output of large data sets in the smallest amount of time. Any Questions?