260 likes | 354 Views
Virtual Clusters Supporting MapReduce in the Cloud. Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing Indiana University Bloomington. Let’s Break this Title Down. Virtual Clusters Supporting MapReduce in the Cloud. Let’s Start with MapReduce.
E N D
Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing Indiana University Bloomington
Let’s Break this Title Down Virtual Clusters Supporting MapReduce in the Cloud
Let’s Start with MapReduce • An example to get us warmed up… Map line = “hello world goodbye world” words = line.split() # [“hello”, “world”, “goodbye”, “world”] map_results= map(lambda x: (x, 1), words) # [('hello', 1), ('world', 1), ('goodbye', 1), ('world', 1)]
Can’t have “MapReduce” without the “Reduce” Reduce from operator import itemgetter from itertools import groupby map_results.sort() # [('goodbye', 1), ('hello', 1), ('world', 1), ('world', 1)] forword, group in groupby(map_results, itemgetter(0)): counts = [countfor(word, count) in group] total = reduce(lambda x, y: x + y, counts) print("{0} {1}".format(word, total)) goodbye1 hello 1 world 2
What Did We Just Do? “hello world goodbye world” Split: “hello”, “world”, “goodbye”, “world” Map: ('hello', 1), ('world', 1), ('goodbye', 1), ('world', 1) Sort: ('goodbye', 1), ('hello', 1), ('world', 1), ('world', 1) Reduce: ('goodbye', 1), ('hello', 1), ('world', 2)
The “Value” of Knowingthe “Key” Pieces* Map – creates (key, value)pairs ('hello', 1), ('world', 1), ('goodbye', 1), ('world', 1) Sort by the key: ('goodbye', 1), ('hello', 1), ('world', 1), ('world', 1) Reduce operation peformed on the value: ('goodbye', 1), ('hello', 1), ('world', 2) * = Pun intended
In General then… Split: Map: Sort: Reduce:
Check “MapReduce” off the List Virtual Clusters Supporting MapReduce in the Cloud
Compute Cluster • Set of computers • Proximity • Networking • Storage • Resource Manager
Breaking Down Large Problems Many compute patterns have emerged one such is… Scatter/Gather:
What if there are a Lot of Data? Network Bottleneck?
What about Local Node Storage? • Distribute the data across the nodes (scatter/split) • Replicate the data to prevent data loss • Have the file system keep track of where the chunks (blocks) are stored • Scheduling resource will schedule jobs to the nodes storing the data
MapReduce on the Cluster Data distributed across the nodes (scatter/split) when loaded into the file system
Check “Clusters” off the List Virtual Clusters Supporting MapReduce in the Cloud
Virtual…and…the Cloud Let’s start with Virtual... • A Virtual Machine (VM) • A “guest” virtualcomputerrunning on a “host” physicalcomputer • A machineimage (MI) is instantiatedinto a running VM • MI = snapshot of operatingsystem (OS) andany software
Virtual…and…the Cloud TheCloud... • Virtualization + Internet Introduction of theCloud • Scalability • Elasticity • Utilitycomputing – not a capitalexpenditure • Three levels of service • Software (SaaS) – e.g., Salesforce.com, Web-basedemail • Platform (PaaS) – e.g., Google App Engine • Infrastructure (IaaS) – e.g., Amazon EC2
Why is the Cloud Interesting? InIndustry • Scalability – getscale not present in internal data centers • Elasticity – changescale as capacitydemands • Utilitycomputing – nocapitalinvestiment Examplesuse-cases: • High Performance/Throughput Computing • On-linegamedevelopment • Scalable web development
Why is the Cloud Interesting? InAcademia • Reproduciblity– resuseMIsbetweenresearchers • EducationalOpportunities • Virtual environment Variety of usesandconfigurations • Learnaboutfoundationalsystemcomponents • Collaboratewithinthesameenvironment
Covered “Virtal” and “the Cloud” VirtualClusters Supporting MapReduce in the Cloud Let’s put it alltogether...
MapReduce Virtual Clusters in the Cloud • CreatevirtualclustersrunningMapReduce • Test algorithms • Test infrastructureandothersystemattributes
MapReduce Virtual Clusters in the Cloud • ResearchAreas • Bioinformatics – e.g., GenomicAlignments • Data/TextMiningandProcessing • Large-scaleGraphAlgorithms
MapReduce Virtual Clusters in the Cloud • ResearchAreas • Bioinformatics – e.g., GenomicAlignments • Data/TextMiningandProcessing • Large-scaleGraphAlgorithms
From Virtual Clustersto a Local Sandbox • Use a localsandboxtocoverMapReducetopics