280 likes | 355 Views
Lecture 2 – MapReduce: Theory and Implementation. CSE 490h – Introduction to Distributed Computing, Winter 2008. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Last Class. How do I process lots of data?
E N D
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
Last Class • How do I process lots of data? • Distribute the work • Can I distribute the work? • Maybe… if it’s not dependent on other tasks • Example: Fibonnaci.
Last Class • What problems can occur? • Large tasks • Unpredictable bugs • Machine failure • How do solve / avoid these? • Break up into small chunks? • Restart tasks? • Use known working solutions
MapReduce • Concept from functional programming • Implemented by Google • Applied to large number of problems
Functional Programming Review Java:int fooA(String[] list) { return bar1(list) + bar2(list); } int fooB(String[] list) { return bar2(list) + bar1(list); } Do they give the same result?
Functional Programming Review Functional Programming:fun fooA(l: int list) = bar1(l) + bar2(l) fun fooB(l: int list) = bar2(l) + bar1(l) Do they give the same result?
Functional Programming Review • Operations do not modify data structures: They always create new ones • Original data still exists in unmodified form
Functional Updates Do Not Modify Structures fun foo(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) foo: a’ -> a’ list -> a’ list The foo() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst!
Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice. What is the type of this function? x: a’ f: a’ -> a’ DoDouble: (a’ -> a’) -> a’ -> a’
map (Functional Programming) Creates a new list by applying f to each element of the input list; returns output in order. map f lst: (’a->’b) -> (’a list) -> (’b list)
map Implementation fun map f [] = [] | map f (x::xs) = (f x) :: (map f xs) • This implementation moves left-to-right across the list, mapping elements one at a time • … But does it need to?
Implicit Parallelism In map • In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements • If order of application of f to elements in list is commutative, we can reorder or parallelize execution • This is the “secret” that MapReduce exploits
Fold Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
fold left vs. fold right • Order of list elements can be significant • Fold left moves left-to-right across the list • Fold right moves from right-to-left SML Implementation: fun foldl f a [] = a | foldl f a (x::xs) = foldl f (f(x, a)) xs fun foldr f a [] = a | foldr f a (x::xs) = f(x, (foldr f a xs))
Example fun foo(l: int list) = sum(l) + mul(l) + length(l) How can we implement this?
Example (Solved) fun foo(l: int list) = sum(l) + mul(l) + length(l) fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
Google MapReduce • Input Handling • Map function • Partition Function • Compare Function • Reduce Function • Output Writer
Input Handling • Divides up data into bite-size chunks • Starts up tasks • Assigns tasks to idle workers
Map • Input: Key, Value pair • Output: Key, Value pairs • Example: Annual Rainfall Per City
Map (Example) • Example: Annual Rainfall Per City map(String key, String value): // key: date // value: weather info foreach (City c in value) EmitIntermediate(c, c.temperature)
Partition Function • Allocates map output to particular reduces • Input: key, number of reduces • Output: Index of desired reduce • Typical: hash(key) % numberOfReduces
Comparison • Sorts input for each reduce • Example: Annual rainfall per city • Sorts rainfall data for each city • Seattle: {0, 0, 0, 1, 4, 7, 10, …}
Reduce • Input: Key, Sorted list of values • Output: Single value • Example: Annual rainfall per city
Reduce • Input: Key, Sorted list of values • Output: Single value • Example: Annual rainfall per city
Reduce (Example) • Example: Annual rainfall per city • reduce(String key, Iterator values): // key: city // values: temperature sum = 0, count = 0 for each (v in values) sum += v count = count + 1 Emit(sum / count)
Output • Writes the output to storage (GFS, etc)
MapReduce for Google Local • Intersections • Rendering Tiles • Finding nearest gas stations