550 likes | 678 Views
CSC313: Advanced Programming Topics. Map-Reduce: Win -or- Epic Win. Brief History of Google. BackRub : 1996 4 disk drives 24 GB total storage. Brief History of Google. BackRub : 1996 4 disk drives 24 GB total storage. =. Brief History of Google. Google: 1998 44 disk drives
E N D
CSC313: Advanced Programming Topics Map-Reduce:Win-or-Epic Win
Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage
Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage =
Brief History of Google Google: 1998 44 disk drives 366 GB total storage
Brief History of Google Google: 1998 44 disk drives 366 GB total storage =
Traditional Design Principles • If big enough, supercomputer processes work • Use desktop CPUs, just a lot more of them • But it also provides huge bandwidth to memory • Equivalent to many machines bandwidth at once • But supercomputers are VERY, VERY expensive • Maintenance also expensive once machine bought • But do get something: high-quality == low downtime • Safe, expensive solution to very large problems
How Was Search Performed? http://www.yahoo.com/search?p=pager DNS
How Was Search Performed? http://www.yahoo.com/search?p=pager DNS
How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70 DNS
How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70/search?p=pager DNS
How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70/search?p=pager DNS
Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance
Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality
Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality
Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality
A brief history of Google Google: 2012 ?0,000 total servers ??? PB total storage
How Is Search Performed Now? http://209.85.148.100/search?q=android
How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)
How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)
How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)
How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)
Google’s Processing Model • Buy cheap machines & prepare for worst • Machines going to fail, but still cheaper approach • Important steps keep whole system reliable • Replicate data so that information losses limited • Move data freely so can always rebalance loads • These decisions lead to many other benefits • Scalability helped by focus on balancing • Search speed improved; performance much better • Utilize resources fully, since search demand varies
Heterogeneous processing • By buying cheapest computers, variances are high • Programs must handle homo- & hetero- systems • Centralized workqueuehelps with different speeds
Heterogeneous processing • By buying cheapest computers, variances are high • Programs must handle homo- & hetero- systems • Centralized workqueuehelps with different speeds • This process also leads to a few small downsides • Space • Power consumption • Cooling costs
Complexity at Google Avoid this nightmare using abstractions
Google Abstractions • Google File System • Handles replication to provide scalability & durability • BigTable • Manages largerelational data sets • Chubby • Gonna skip past that joke; distributed locking service • MapReduce • Ifjob fits, easy parallelism possible without much work
Google Abstractions • Google File System • Handles replication to provide scalability & durability • BigTable • Manageslarge relational data sets • Chubby • Gonna skip past that joke; distributed locking service • MapReduce • Ifjob fits, easy parallelism possible without much work
MapReduce Overview • Programming model makes details simple • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail
MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail
MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail • Idea came from 2 Lisp (functional) primitives • Map • Reduce
MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail • Idea came from 2 Lisp (functional) primitives • Map:process each entry in list using some function • Reduce: recombines data using given function
Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in
Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Outline always same; Just map & reduce functions change
Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Algorithm always same; Just map & reduce functions change
Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Template method always same; Just the hook methods change
Ex: Count Word Frequencies • Processes files separately Map Key=URL Value=text on page
Ex: Count Word Frequencies • Processes files separately & count word freq. in each Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count
Ex: Count Word Frequencies • In shuffle step, Maps combined & entries sorted by key • Reduce Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Reduce Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1”
Ex: Count Word Frequencies • In shuffle step, Maps combined & entries sorted by key • Reduce combines key’s results to compute final output Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’’=“be” Value’’=“2” Key’=“or” Value’=“1” Key’’=“or” Value’’=“1” Reduce Key’=“not” Value’=“1” Key’’=“not” Value’’=“1” Key’’=“to” Value’’=“2” Key’=“to” Value’=“1” Key’=“to” Value’=“1”
Word Frequency Pseudo-code Map(String input_key, String input_values) {String[] words = input_values.split(“ ”);foreachw in words{EmitIntermediate(w, "1");} } Reduce(String key, Iterator intermediate_values){intresult = 0;foreachv in intermediate_values{result += ParseInt(v);}Emit(result); }
Ex: Build Search Index • Processes files separately & record words found on each Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=URL
Ex: Build Search Index • Processes files separately & record words found on each • To get search Map, combine key’s results in Reduce Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=count Reduce Key’=word Value’=count Key=word Value=URLs with word Key’=word Value’=count Key’=word Value’=URL
Search Index Pseudo-code Map(String input_key, String input_values) {String[] words = input_values.split(“ ”);foreachw in words{EmitIntermediate(w, input_key);} } Reduce(String key, Iterator intermediate_values){List result = new ArrayList();foreachv in intermediate_values{result.addLast(v);}Emit(result); }
Ex: Page Rank Computation • Google’s algorithm ranking pages’ relevance
Ex: Page Rank Computation Key’=word Value’=count Map Key’=word Value’=count Key=<URL, rank> Value=links on page Key’=word Value’=count Key’=link on page Value’=<URL, rank/N> + Key=<URL, rank> Value=links on page Key=<URL, rank> Value=links on page + Key’=word Value’=count Reduce Key’=word Value’=count Key=<URL, rank> Value=links on page Key’=word Value’=count Key’=link to URL Value’=<src, rank/N>