170 likes | 436 Views
Introduction to MapReduce and Hadoop. IT 332 Distributed Systems. What is MapReduce ?. Data-parallel programming model for clusters of commodity machines Pioneered by Google Processes 20 PB of data per day Popularized by open-source Hadoop project Used by Yahoo!, Facebook, Amazon, ….
E N D
Introduction to MapReduce and Hadoop IT 332 Distributed Systems
What is MapReduce? • Data-parallel programming model for clusters of commodity machines • Pioneered by Google • Processes 20 PB of data per day • Popularized by open-source Hadoop project • Used by Yahoo!, Facebook, Amazon, …
What is MapReduce used for? • At Google: • Index building for Google Search • Article clustering for Google News • Statistical machine translation • At Yahoo!: • Index building for Yahoo! Search • Spam detection for Yahoo! Mail • At Facebook: • Data mining • Ad optimization • Spam detection
What is MapReduce used for? • In research: • Analyzing Wikipedia conflicts (PARC) • Natural language processing (CMU) • Bioinformatics (Maryland) • Astronomical image analysis (Washington) • Ocean climate simulation (Washington) • <Your application here>
Outline • MapReduce architecture • Fault tolerance in MapReduce • Sample applications • Getting started with Hadoop • Higher-level languages on top of Hadoop: Pig and Hive
MapReduce Design Goals • Scalability to large data volumes: • Scan 100 TB on 1 node @ 50 MB/s = 23 days • Scan on 1000-node cluster = 33 minutes • Cost-efficiency: • Commodity nodes (cheap, but unreliable) • Commodity network • Automatic fault-tolerance (fewer admins) • Easy to use (fewer programmers)
Typical Hadoop Cluster Aggregation switch Rack switch • 40 nodes/rack, 1000-4000 nodes in cluster • 1 GBps bandwidth in rack, 8 GBps out of rack • Node specs (Yahoo terasort):8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
Typical Hadoop Cluster Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
Challenges • Cheap nodes fail, especially if you have many • Mean time between failures for 1 node = 3 years • MTBF for 1000 nodes = 1 day • Solution: Build fault-tolerance into system • Commodity network = low bandwidth • Solution: Push computation to the data • Programming distributed systems is hard • Solution: Data-parallel programming model: users write “map” and “reduce” functions, system handles work distribution and fault tolerance
Hadoop Components • Distributed file system (HDFS) • Single namespace for entire cluster • Replicates data 3x for fault-tolerance • MapReduce implementation • Executes user jobs specified as “map” and “reduce” functions • Manages work distribution & fault-tolerance
Hadoop Distributed File System • Files split into 128MB blocks • Blocks replicated across several datanodes (usually 3) • Single namenode stores metadata (file names, block locations, etc) • Optimized for large files, sequential reads • Files are append-only Namenode File1 1 2 3 4 1 2 1 3 2 1 4 2 4 3 3 4 Datanodes
MapReduce Programming Model • Data type: key-value records • Map function: (Kin, Vin) list(Kinter, Vinter) • Reduce function: (Kinter, list(Vinter)) list(Kout, Vout)
Example: Word Count defmapper(line): foreach word in line.split(): output(word, 1) defreducer(key, values): output(key, sum(values))
Word Count Execution Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 fox, 1 the quick brown fox brown, 2 fox, 2 how, 1 now, 1 the, 3 Map Reduce the, 1 fox, 1 the, 1 the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 ate, 1 cow, 1 mouse, 1 quick, 1 ate, 1 mouse, 1 Reduce how now brown cow Map cow, 1
MapReduce Execution Details • Single master controls job execution on multiple slaves as well as user scheduling • Mappers preferentially placed on same node or same rack as their input block • Push computation to data, minimize network use • Mappers save outputs to local disk rather than pushing directly to reducers • Allows having more reducers than nodes • Allows recovery if a reducer crashes
An Optimization: The Combiner • A combiner is a local aggregation function for repeated keys produced by same map • For associative ops. like sum, count, max • Decreases size of intermediate data • Example: local counting for Word Count: defcombiner(key, values): output(key, sum(values))
Word Count with Combiner Input Map & Combine Shuffle & Sort Reduce Output the, 1 brown, 1 fox, 1 the quick brown fox brown, 2 fox, 2 how, 1 now, 1 the, 3 Map Reduce the, 2 fox, 1 the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 ate, 1 cow, 1 mouse, 1 quick, 1 ate, 1 mouse, 1 Reduce how now brown cow Map cow, 1