260 likes | 411 Views
MapReduce and Hadoop. Frankie Pike. Why care?. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year. Why care?. Why care?. Google’s capacity = 1 exabyte 24 hours of Youtube > Internet in 2000
E N D
MapReduce and Hadoop Frankie Pike
Why care? • 2010: 1.2 zettabytes • 1.2 trillion gigabytes • DVDs past the moon • 2-way = 6 newspapers everyday • ~58% growth per year
Why care? • Google’s capacity = 1 exabyte • 24 hours of Youtube > Internet in 2000 • 4 years of video / day on Youtube • 100 trillion words online
Common Architecture http://www.adopenstatic.com/images/resources/blog/Kerberos6.jpg
Common Architecture • Single point of failure • Space-constraints • Multi-tenancy difficulties • Re-writing of programs or changes to network config
The Promise • High reliability • any node can go down • High scalability • easy to add nodes • Multi-tenancy • Cost Reduction • “Cloud-friendly” • Java, C++, C#, Python, R • Transparent Parallelization
The Kryptonite • Data set needs to be “big enough” • Consistency mid-processing
Two Steps in MapReduce • Map • Reduce
Mapping • Input K/V pairs -> Intermediate K/V Pairs • Input and Intermediate can be different • (Server Key, Blog Data) -> (Blog Key, Post Count) • Sorted and Partitioned for reduction • Number of maps depends on task and cluster • 10TB data with blocksize 128MB = 82,000 maps • 10-100 maps per node ideal
Reducing • Intermediate K/V -> Intermediate K/V (smaller) • Matching keys consolidated • (A, 15); (B, 6); (A, 3) -> (A, 18); (B, 6) • Number of Reductions >= 0 • Hopefully smaller dataset at each iteration • Reduce as much as needed
An Example { "type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "<p>...</p>", "comments": [ { "source_ip": '124.2.21.2', "author": "martin", "text": "..."} ] } Want count of comments for blog http://ayende.com/blog/4435/map-reduce-a-visual-explanation
Step 1: Map to final format http://ayende.com/blog/4435/map-reduce-a-visual-explanation
Step 2: Reduce (Partition) http://ayende.com/blog/4435/map-reduce-a-visual-explanation
Step 3: Reduce (more) http://ayende.com/blog/4435/map-reduce-a-visual-explanation
Step 4: Reduce (most) http://ayende.com/blog/4435/map-reduce-a-visual-explanation
Single Node http://bc.tech.coop/blog/070520.html
Dual Node http://map-reduce.wikispaces.asu.edu/
N-Nodes http://www.inventoland.net/img/blog/mapReduce.png
Dealing with Failure • Workers • Occasional check-in pings by masters • Masters • Data structures get periodic auto-saves and consistency checks. Can restart from periodic saves • Bandwidth • Tasks attempt to pair with local storage
Has it worked? • Patented • Regenerated index
Apache Hadoop • “open source software for reliable, scalable, distributed computing” • Hadoop Distributed File System (HDFS) • HadoopMapReduce • Cassandra (multi-master database) • HBase (scalable, distributed, structured database) • Mahout (data mining and machine learning libs) • ZooKeeper (coordination service)
Sources • Avankipu & Sdsalvi, Cloud Computing - An Overview. http://map-reduce.wikispaces.asu.edu • AyendeRahien, Map/Reduce – A Visual Explanation. http://ayende.com/blog/4435/map-reduce-a-visual-explanation • http://hadoop.apache.org/ • http://en.wikipedia.org/wiki/MapReduce/