80 likes | 97 Views
Explore Data Parallelism, MapReduce, & Distributed Algorithms in Computer Science. Learn parallelization techniques, function calculations, word count, & more in this informative lecture. Understand Google's MapReduce library & execution overview.
E N D
Lecture #4Introduction to Data Parallelism and MapReduce CS492 Special Topics in Computer Science: Distributed Algorithms and Systems
Today’s Topics to Cover • Short quiz on programming in Ocaml
How to parallelize (I) Runlength encoding Fibonacchi function Calculation of π Word count Inverted index
How to parallelize (II) SIMD MIMD via shared memory MIMD via message passing Distributed computing
MapReduce • Functional programming “Map / Reduce” way of thinking about problem solving • Google’s runtime library supporting MR paradigm at a very large scale
Fall 2008 CS492 MapReduce Execution Overview
How popular is MapReduce? • In September 2007, Google used 11,081 “machine-years” (roughly, CPU-years) on MapReduce jobs alone • Assume all machines were busy 100% and ran only MR 11,081 x 365 / 30 = 134,818 • If a rack holds 176 CPUS (88 1U dual-processor) 134,818 / 176 = 766
Reading material “MapReduce: Simplified data processing on large clusters” by J. Dean and S. Ghemawat Communications of the ACM, Jan. 2008/Vol. 51, No. 1 “MapReduce: Simplified data processing on large clusters” by J. Dean and S. Ghemawat USENIX OSDI 2004