Cloud Computing

Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University

Cloud Computing • Cloud Introduction • Cloud Service Model • Big Data • Hadoop • MapReduce • HDFS (Hadoop Distributed File System)

MapReduce

MapReduce • Hadoop • HadoopisaReliableSharedStorageandAnalysisSystem • Hadoop=HDFS+MapReduce+α • HDFS providesDataStorage • HDFS: HadoopDistributedFileSystem • MapReduceprovidesDataAnalysis • MapReduce=MapFunction+ReduceFunction

MapReduce • ScalingOut • ScalingoutisdonebytheDFS(DistributedFileSystem),where thedataisdividedandstoredindistributed computers&servers • HadoopusesHDFStomovetheMapReducecomputation toseveraldistributedcomputingmachinesthatwillprocessapartofthe divideddataassigned

MapReduce • Jobs • MapReducejobisaunitofwork thatneedstobe executed • Jobtypes:Datainput,MapReduceprogram,ConfigurationInformation,etc. • Jobisexecutedbydividingitintooneoftwotypesoftasks • MapTask • ReduceTask

MapReduce • NodetypesforJobexecution • Jobexecutioniscontrolledby2typesofnodes • Jobtracker • Tasktracker • Jobtrackercoordinatesalljobs • Jobtrackerschedulesalltasksandassignsthetasks totasktrackers

MapReduce • Tasktrackerwillexecuteitsassignedtask • TasktrackerwillsendaprogressreportstotheJobtracker • Jobtrackerwillkeeparecordoftheprogressofalljobsexecuted

MapReduce • Dataflow • Hadoopdividestheinputintoinputsplits(orsplits) suitablefortheMapReducejob • Splithasafixed-size • SplitsizeiscommonlymatchedtothesizeofaHDFS block(64MB)formaximumprocessingefficiency

MapReduce • Dataflow • MapTaskiscreatedforeachsplit • MapTaskexecutesthemapfunctionforallrecordswithinthesplit • HadoopcommonlyexecutestheMapTaskonthe nodewhere theinputdataresides

MapReduce Dataflow • Data-LocalMapTask • Datalocalityoptimization • doesnotneedtousetheclusternetwork • Data-localflowprocessshowswhythe • OptimalSplit Size=64MBHDFSBlockSize

MapReduce Dataflow Node Rack Data Center • Rack-LocalMapTask • Anodehostingthe HDFSblockreplicasfor amaptask’sinputsplit MapTask HDFSBlock • couldberunningothermaptasks • JobSchedulerwilllookforafreemapslotonanodeinthesamerackasoneoftheblocks

MapReduce Dataflow • Off-RackMapTask • Neededwhenthe JobScheduler • cannotperformdata-localorrack-localmaptasks • Usesinter-racknetworktransfer

MapReduce • Map • Maptaskwillwriteitsoutputtothelocaldisk • Maptaskoutput isnotthefinaloutput,it isonlythe intermediateoutput • Reduce • Maptaskoutput isprocessedbyReduceTaskstoproduce thefinaloutput • ReduceTaskoutputisstoredinHDFS • Foracompletedjob, theMapTaskoutputcanbe discarded

MapReduce SingleReduceTask • NodeincludesSplit,Map, Sort,andOutputunit • Lightbluearrowsshowdatatransfersinanode • Blackarrowsshowdatatransfersbetweennodes

MapReduce SingleReduceTask • Numberofreducetasksisspecifiedindependently,andisnotbasedon thesizeoftheinput

MapReduce • CombinerFunction • UserspecifiedfunctiontorunontheMapoutput • FormstheinputtotheReducefunction • SpecificallydesignedtominimizethedatatransferredbetweenMapTasksandReduceTasks • Solvestheproblemoflimitednetworkspeedonthe clusterandhelpstoreducethetimeincompletingMapReducejobs

MapReduce • MultipleReducer • Maptaskspartitiontheiroutput,eachcreatingone partitionforeachreducetask • Eachpartitionmayusemanykeysandkey associatedvalues • Allrecordsforakeyarekeptinasinglepartition

MapReduce MultipleReducers Shuffle • Shuffleprocessisusedinthedataflow betweentheMaptasksandReducetasks

MapReduce ZeroReducer • Zeroreducerusesnoshuffleprocess • Appliedwhenall ofthe processingcanbe carried out inparallelMap tasks

HDFS

HDFS • Hadoop • HadoopisaReliableSharedStorageandAnalysisSystem • Hadoop=HDFS+MapReduce+α • HDFS providesDataStorage • HDFS: HadoopDistributedFileSystem • MapReduceprovidesDataAnalysis • MapReduce=MapFunction+ReduceFunction

HDFS • HDFS:HadoopDistributedFileSystem • DFS(DistributedFileSystem)isdesignedforstorage managementofanetworkofcomputers • HDFSisoptimizedtostorelargeterabytesizefiles withstreamingdataaccesspatterns

HDFS • HDFS:HadoopDistributedFileSystem • HDFSwas designedtobeoptimalinperformancefor aWORM(WriteOnce,ReadManytimes)pattern • HDFSisdesignedtorunonclustersofgeneral computers&serversfrommultiplevendors

HDFS • HDFSCharacteristics • HDFSisoptimizedfor largescaleandhighthroughputdataprocessing • HDFSdoesnotperformwell insupportingapplications thatrequireminimumdelay(e.g.,tensofmillisecondsrange)

HDFS • Blocks • FilesinHDFSaredividedintoblocksizechunks • 64Megabytedefaultblocksize • Blockistheminimumsizeofdatathatitcanreadorwrite • Blockssimplifiesthestorageandreplicationprocess • Providesfaulttolerance&processingspeed enhancementforlargerfiles

HDFS • HDFS • HDFSclustersuse2typesofnodes • Namenode(masternode) • Datanode(worker node)

HDFS • Namenode • Managesthefilesystemnamespace • Namenodekeepstrackofthedatanodesthathave blocksofadistributedfileassigned • Maintainsthefilesystemtreeandthemetadataforall thefilesanddirectoriesinthetree • Storesonthelocaldiskusing2fileforms • NamespaceImage • EditLog

HDFS • Namenode • Namenodeholdsthefilesystemmetadatainitsmemory • Namenode’smemory sizedeterminesthelimittothe numberoffilesinafilesystem • Butthen,whatisMetadata?

HDFS • Metadata • Traditionalconceptofthelibrarycardcatalogs • Categorizesanddescribesthecontentsandcontextof thedatafiles • Maximizestheusefulnessoftheoriginaldatafileby makingiteasytofindanduse

HDFS • MetadataTypes • StructuralMetadata • Focuses on the data structure’s design and specification • DescriptiveMetadata • Focusesontheindividualinstancesofapplication dataorthedatacontent

HDFS • Datanodes • Workhorseofthefilesystem • Storeandretrieveblockswhenrequestedbytheclient orthenamenode • Periodicallyreportsbacktothenamenodewithlistsof blocksthatwere stored

HDFS • ClientAccess • Clientcanaccessthefilesystem(onbehalfoftheuser) bycommunicatingwiththenamenodeanddatanodes • Clientcanuseafilesysteminterface(similartoaPOSIX (PortableOperatingSystemInterface))sotheusercode doesnotneedtoknowaboutthenamenodeand datanodestofunctionproperly

HDFS • NamenodeFailure • Namenodekeepstrackofthedatanodesthathaveblocks ofadistributedfileassigned • Withoutthenamenode,the filesystemcannotbeused • Ifthecomputerrunningthenamenodemalfunctionsthenreconstructionofthefiles(fromtheblocksonthe datanodes)wouldnotbepossible • Filesonthe filesystemwouldbelost

HDFS • NamenodeFailure Resilience • Namenodefailurepreventionschemes • NamenodeFileBackup • SecondaryNamenode

HDFS • NamenodeFileBackup • Backupthenamenodefilesthatformthepersistent stateofthefilesystem’smetadata • Configurethenamenodetowriteitspersistentstatetomultiplefilesystems • Synchronousandatomicbackup • Commonbackupconfiguration • CopytoLocalDiskandRemoteFileSystem

HDFS • SecondaryNamenode • Secondarynamenodedoesnotactthesame way asthe namenode • Secondarynamenodeperiodicallymergesthenamespaceimagewiththeeditlogtopreventtheeditlogfrombecomingtoolarge • Secondarynamenodeusuallyrunsonaseparate computertoperformthemergeprocessbecausethis requiressignificantprocessingcapabilityandmemory

HDFS • Hadoop2.xReleaseSeriesHDFSReliability Enhancements • HDFSFederation • HDFSHA(High-Availability)

HDFS • HDFSFederation • Allowsaclustertoscalebyaddingnamenodes • Eachnamenodemanagesa • namespacevolumeandablockpool • Namespacevolumeismadeupofthemetadatafor thenamespace • Blockpoolcontainsalltheblocksforthefilesinthe namespace

HDFS • HDFSFederation • Namespacevolumesareallindependent • Namenodesdonotcommunicatewith eachother • Failureofanamenodeisalsoindependenttoother namenodes • Anamenodefailuredoesnotinfluencetheavailabilityofanothernamenode’snamespace

HDFS • HDFSHigh-Availability • Pairofnamenodes(Primary&Standby)aresettobein Active-Standbyconfiguration • Secondarynamenodestoresthelatesteditlogentriesandanup-to-dateblockmapping • Whentheprimarynamenodefails,thestandby namenodetakesoverservingclientrequests

HDFS • HDFSHigh-Availability • Althoughtheactive-standbynamenodecantakeover operationquickly(e.g.,fewtensofseconds),toavoidunnecessarynamenodeswitching,standby namenodeactivationwill beexecutedafterasufficientobservationperiod • (e.g.,approximatelyaminuteorafewminutes)

References • V.Mayer-Schönberger,andK.Cukier,Bigdata:Arevolutionthat will transformhowwelive,work,andthink.HoughtonMifflinHarcourt,2013. • T. White,Hadoop:TheDefinitiveGuide.O'ReillyMedia,2012. • J.Venner,ProHadoop.Apress,2009. • S.LaValle,E.Lesser,R.Shockley,M.S.Hopkins,andN.Kruschwitz,“BigData, AnalyticsandthePathFrom Insights toValue,”MITSloanManagementReview, vol.52,no.2,Winter2011. • B.Randal,R. H.Katz,andE.D.Lazowska,"Big-dataComputing:Creating revolutionarybreakthroughsincommerce,scienceandsociety,"ComputingCommunityConsortium,pp.1-15,Dec.2008. • G. Linden,B.Smith, andJ.York."Amazon.comRecommendations:Item-to-ItemCollaborativeFiltering,"IEEEInternetComputing,vol.7,no.1, pp.76-80,Jan/Feb. 2003.

References • J.R.GalbRaith,"OrganizationalDesignChallengesResultingFromBigData," • JournalofOrganizationDesign,vol.3, no.1,pp.2-13,Apr. 2014. • S.SagirogluandD.Sinanc,“Bigdata:Areview,”Proc.IEEEInternational ConferenceonCollaborationTechnologiesandSystems,pp.42-47,May2013. • M.Chen,S.Mao,andY. Liu,“BigData:ASurvey,”MobileNetworksand Applications,vol.19,no.2, pp.171-209,Jan.2014. • X.Wu,X.Zhu,G. Q. Wu,andW. Ding,‘‘DataMiningwithBigData,’’IEEETransactionsonKnowledgeandDataEngineering,vol.26,no.1,pp.97–107,Jan. 2014. • Z. Zheng,J.Zhu, andM.R.Lyu,‘‘Service-GeneratedBigDataandBigData-as-a- Service:AnOverview,’’Proc.IEEEInternationalCongressonBigData,pp.403–410,Jun/Jul.2013.

References • I. PalitandC.K.Reddy,“ScalableandParallelBoostingwithMapReduce,”IEEETransactionsonKnowledgeandDataEngineering,vol.24,no.10,pp.1904-1916, 2012. • M.-YChoi,E.-A.Cho,D.-H. Park,C.-JMoon,andD.-K.Baik,“ADatabase SynchronizationAlgorithmforMobileDevices,”IEEETransactionsonConsumer Electronics,vol.56,no.2, pp.392-398,May2010. • IBM,Whatisbigdata?,http://www.ibm.com/software/data/bigdata/what-is-big-data.html[AccessedJune1,2015] • HadoopApache,http://hadoop.apache.org • Wikipedia,http://www.wikipedia.org • Imagesources • WalmartLogo,ByWalmart[Publicdomain],viaWikimediaCommons • AmazonLogo,ByBalajimuthazhagan(Ownwork)[CCBY-SA3.0(http://creativecommons.org/licenses/by-sa/3.0)],viaWikimediaCommons

Cloud Computing

Cloud Computing

Presentation Transcript

cloud computing

Cloud Computing

Cloud Computing

CLOUD COMPUTING

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

CLOUD COMPUTING

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

cloud computing