1 / 45

Cloud Computing

Explore the powerful capabilities of Hadoop and MapReduce in cloud computing for efficient storage and analysis of big data. This reference highlights key concepts and features of Hadoop Distributed File System (HDFS) and how MapReduce helps in data processing. Strengthen your knowledge with Professor Jong-Moon Chung’s lecture notes at Yonsei University.

gborrelli
Download Presentation

Cloud Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University

  2. Cloud Computing • Cloud Introduction • Cloud Service Model • Big Data • Hadoop • MapReduce • HDFS (Hadoop Distributed File System)

  3. MapReduce

  4. MapReduce • Hadoop • HadoopisaReliableSharedStorageandAnalysisSystem • Hadoop=HDFS+MapReduce+α • HDFS providesDataStorage • HDFS: HadoopDistributedFileSystem • MapReduceprovidesDataAnalysis • MapReduce=MapFunction+ReduceFunction

  5. MapReduce • ScalingOut • ScalingoutisdonebytheDFS(DistributedFileSystem),where thedataisdividedandstoredindistributed computers&servers • HadoopusesHDFStomovetheMapReducecomputation toseveraldistributedcomputingmachinesthatwillprocessapartofthe divideddataassigned

  6. MapReduce • Jobs • MapReducejobisaunitofwork thatneedstobe executed • Jobtypes:Datainput,MapReduceprogram,ConfigurationInformation,etc. • Jobisexecutedbydividingitintooneoftwotypesoftasks • MapTask • ReduceTask

  7. MapReduce • NodetypesforJobexecution • Jobexecutioniscontrolledby2typesofnodes • Jobtracker • Tasktracker • Jobtrackercoordinatesalljobs • Jobtrackerschedulesalltasksandassignsthetasks totasktrackers

  8. MapReduce • Tasktrackerwillexecuteitsassignedtask • TasktrackerwillsendaprogressreportstotheJobtracker • Jobtrackerwillkeeparecordoftheprogressofalljobsexecuted

  9. MapReduce • Dataflow • Hadoopdividestheinputintoinputsplits(orsplits) suitablefortheMapReducejob • Splithasafixed-size • SplitsizeiscommonlymatchedtothesizeofaHDFS block(64MB)formaximumprocessingefficiency

  10. MapReduce • Dataflow • MapTaskiscreatedforeachsplit • MapTaskexecutesthemapfunctionforallrecordswithinthesplit • HadoopcommonlyexecutestheMapTaskonthe nodewhere theinputdataresides

  11. MapReduce Dataflow • Data-LocalMapTask • Datalocalityoptimization • doesnotneedtousetheclusternetwork • Data-localflowprocessshowswhythe • OptimalSplit Size=64MBHDFSBlockSize

  12. MapReduce Dataflow Node Rack Data Center • Rack-LocalMapTask • Anodehostingthe HDFSblockreplicasfor amaptask’sinputsplit MapTask HDFSBlock • couldberunningothermaptasks • JobSchedulerwilllookforafreemapslotonanodeinthesamerackasoneoftheblocks

  13. MapReduce Dataflow • Off-RackMapTask • Neededwhenthe JobScheduler • cannotperformdata-localorrack-localmaptasks • Usesinter-racknetworktransfer

  14. MapReduce • Map • Maptaskwillwriteitsoutputtothelocaldisk • Maptaskoutput isnotthefinaloutput,it isonlythe intermediateoutput • Reduce • Maptaskoutput isprocessedbyReduceTaskstoproduce thefinaloutput • ReduceTaskoutputisstoredinHDFS • Foracompletedjob, theMapTaskoutputcanbe discarded

  15. MapReduce SingleReduceTask • NodeincludesSplit,Map, Sort,andOutputunit • Lightbluearrowsshowdatatransfersinanode • Blackarrowsshowdatatransfersbetweennodes

  16. MapReduce SingleReduceTask • Numberofreducetasksisspecifiedindependently,andisnotbasedon thesizeoftheinput

  17. MapReduce • CombinerFunction • UserspecifiedfunctiontorunontheMapoutput • FormstheinputtotheReducefunction • SpecificallydesignedtominimizethedatatransferredbetweenMapTasksandReduceTasks • Solvestheproblemoflimitednetworkspeedonthe clusterandhelpstoreducethetimeincompletingMapReducejobs

  18. MapReduce • MultipleReducer • Maptaskspartitiontheiroutput,eachcreatingone partitionforeachreducetask • Eachpartitionmayusemanykeysandkey associatedvalues • Allrecordsforakeyarekeptinasinglepartition

  19. MapReduce MultipleReducers Shuffle • Shuffleprocessisusedinthedataflow betweentheMaptasksandReducetasks

  20. MapReduce ZeroReducer • Zeroreducerusesnoshuffleprocess • Appliedwhenall ofthe processingcanbe carried out inparallelMap tasks

  21. HDFS

  22. HDFS • Hadoop • HadoopisaReliableSharedStorageandAnalysisSystem • Hadoop=HDFS+MapReduce+α • HDFS providesDataStorage • HDFS: HadoopDistributedFileSystem • MapReduceprovidesDataAnalysis • MapReduce=MapFunction+ReduceFunction

  23. HDFS • HDFS:HadoopDistributedFileSystem • DFS(DistributedFileSystem)isdesignedforstorage managementofanetworkofcomputers • HDFSisoptimizedtostorelargeterabytesizefiles withstreamingdataaccesspatterns

  24. HDFS • HDFS:HadoopDistributedFileSystem • HDFSwas designedtobeoptimalinperformancefor aWORM(WriteOnce,ReadManytimes)pattern • HDFSisdesignedtorunonclustersofgeneral computers&serversfrommultiplevendors

  25. HDFS • HDFSCharacteristics • HDFSisoptimizedfor largescaleandhighthroughputdataprocessing • HDFSdoesnotperformwell insupportingapplications thatrequireminimumdelay(e.g.,tensofmillisecondsrange)

  26. HDFS • Blocks • FilesinHDFSaredividedintoblocksizechunks • 64Megabytedefaultblocksize • Blockistheminimumsizeofdatathatitcanreadorwrite • Blockssimplifiesthestorageandreplicationprocess • Providesfaulttolerance&processingspeed enhancementforlargerfiles

  27. HDFS • HDFS • HDFSclustersuse2typesofnodes • Namenode(masternode) • Datanode(worker node)

  28. HDFS • Namenode • Managesthefilesystemnamespace • Namenodekeepstrackofthedatanodesthathave blocksofadistributedfileassigned • Maintainsthefilesystemtreeandthemetadataforall thefilesanddirectoriesinthetree • Storesonthelocaldiskusing2fileforms • NamespaceImage • EditLog

  29. HDFS • Namenode • Namenodeholdsthefilesystemmetadatainitsmemory • Namenode’smemory sizedeterminesthelimittothe numberoffilesinafilesystem • Butthen,whatisMetadata?

  30. HDFS • Metadata • Traditionalconceptofthelibrarycardcatalogs • Categorizesanddescribesthecontentsandcontextof thedatafiles • Maximizestheusefulnessoftheoriginaldatafileby makingiteasytofindanduse

  31. HDFS • MetadataTypes • StructuralMetadata • Focuses on the data structure’s design and specification • DescriptiveMetadata • Focusesontheindividualinstancesofapplication dataorthedatacontent

  32. HDFS • Datanodes • Workhorseofthefilesystem • Storeandretrieveblockswhenrequestedbytheclient orthenamenode • Periodicallyreportsbacktothenamenodewithlistsof blocksthatwere stored

  33. HDFS • ClientAccess • Clientcanaccessthefilesystem(onbehalfoftheuser) bycommunicatingwiththenamenodeanddatanodes • Clientcanuseafilesysteminterface(similartoaPOSIX (PortableOperatingSystemInterface))sotheusercode doesnotneedtoknowaboutthenamenodeand datanodestofunctionproperly

  34. HDFS • NamenodeFailure • Namenodekeepstrackofthedatanodesthathaveblocks ofadistributedfileassigned • Withoutthenamenode,the filesystemcannotbeused • Ifthecomputerrunningthenamenodemalfunctionsthenreconstructionofthefiles(fromtheblocksonthe datanodes)wouldnotbepossible • Filesonthe filesystemwouldbelost

  35. HDFS • NamenodeFailure Resilience • Namenodefailurepreventionschemes • NamenodeFileBackup • SecondaryNamenode

  36. HDFS • NamenodeFileBackup • Backupthenamenodefilesthatformthepersistent stateofthefilesystem’smetadata • Configurethenamenodetowriteitspersistentstatetomultiplefilesystems • Synchronousandatomicbackup • Commonbackupconfiguration • CopytoLocalDiskandRemoteFileSystem

  37. HDFS • SecondaryNamenode • Secondarynamenodedoesnotactthesame way asthe namenode • Secondarynamenodeperiodicallymergesthenamespaceimagewiththeeditlogtopreventtheeditlogfrombecomingtoolarge • Secondarynamenodeusuallyrunsonaseparate computertoperformthemergeprocessbecausethis requiressignificantprocessingcapabilityandmemory

  38. HDFS • Hadoop2.xReleaseSeriesHDFSReliability Enhancements • HDFSFederation • HDFSHA(High-Availability)

  39. HDFS • HDFSFederation • Allowsaclustertoscalebyaddingnamenodes • Eachnamenodemanagesa • namespacevolumeandablockpool • Namespacevolumeismadeupofthemetadatafor thenamespace • Blockpoolcontainsalltheblocksforthefilesinthe namespace

  40. HDFS • HDFSFederation • Namespacevolumesareallindependent • Namenodesdonotcommunicatewith eachother • Failureofanamenodeisalsoindependenttoother namenodes • Anamenodefailuredoesnotinfluencetheavailabilityofanothernamenode’snamespace

  41. HDFS • HDFSHigh-Availability • Pairofnamenodes(Primary&Standby)aresettobein Active-Standbyconfiguration • Secondarynamenodestoresthelatesteditlogentriesandanup-to-dateblockmapping • Whentheprimarynamenodefails,thestandby namenodetakesoverservingclientrequests

  42. HDFS • HDFSHigh-Availability • Althoughtheactive-standbynamenodecantakeover operationquickly(e.g.,fewtensofseconds),toavoidunnecessarynamenodeswitching,standby namenodeactivationwill beexecutedafterasufficientobservationperiod • (e.g.,approximatelyaminuteorafewminutes)

  43. References • V.Mayer-Schönberger,andK.Cukier,Bigdata:Arevolutionthat will transformhowwelive,work,andthink.HoughtonMifflinHarcourt,2013. • T. White,Hadoop:TheDefinitiveGuide.O'ReillyMedia,2012. • J.Venner,ProHadoop.Apress,2009. • S.LaValle,E.Lesser,R.Shockley,M.S.Hopkins,andN.Kruschwitz,“BigData, AnalyticsandthePathFrom Insights toValue,”MITSloanManagementReview, vol.52,no.2,Winter2011. • B.Randal,R. H.Katz,andE.D.Lazowska,"Big-dataComputing:Creating revolutionarybreakthroughsincommerce,scienceandsociety,"ComputingCommunityConsortium,pp.1-15,Dec.2008. • G. Linden,B.Smith, andJ.York."Amazon.comRecommendations:Item-to-ItemCollaborativeFiltering,"IEEEInternetComputing,vol.7,no.1, pp.76-80,Jan/Feb. 2003.

  44. References • J.R.GalbRaith,"OrganizationalDesignChallengesResultingFromBigData," • JournalofOrganizationDesign,vol.3, no.1,pp.2-13,Apr. 2014. • S.SagirogluandD.Sinanc,“Bigdata:Areview,”Proc.IEEEInternational ConferenceonCollaborationTechnologiesandSystems,pp.42-47,May2013. • M.Chen,S.Mao,andY. Liu,“BigData:ASurvey,”MobileNetworksand Applications,vol.19,no.2, pp.171-209,Jan.2014. • X.Wu,X.Zhu,G. Q. Wu,andW. Ding,‘‘DataMiningwithBigData,’’IEEETransactionsonKnowledgeandDataEngineering,vol.26,no.1,pp.97–107,Jan. 2014. • Z. Zheng,J.Zhu, andM.R.Lyu,‘‘Service-GeneratedBigDataandBigData-as-a- Service:AnOverview,’’Proc.IEEEInternationalCongressonBigData,pp.403–410,Jun/Jul.2013.

  45. References • I. PalitandC.K.Reddy,“ScalableandParallelBoostingwithMapReduce,”IEEETransactionsonKnowledgeandDataEngineering,vol.24,no.10,pp.1904-1916, 2012. • M.-YChoi,E.-A.Cho,D.-H. Park,C.-JMoon,andD.-K.Baik,“ADatabase SynchronizationAlgorithmforMobileDevices,”IEEETransactionsonConsumer Electronics,vol.56,no.2, pp.392-398,May2010. • IBM,Whatisbigdata?,http://www.ibm.com/software/data/bigdata/what-is-big-data.html[AccessedJune1,2015] • HadoopApache,http://hadoop.apache.org • Wikipedia,http://www.wikipedia.org • Imagesources • WalmartLogo,ByWalmart[Publicdomain],viaWikimediaCommons • AmazonLogo,ByBalajimuthazhagan(Ownwork)[CCBY-SA3.0(http://creativecommons.org/licenses/by-sa/3.0)],viaWikimediaCommons

More Related