450 likes | 459 Views
Explore the powerful capabilities of Hadoop and MapReduce in cloud computing for efficient storage and analysis of big data. This reference highlights key concepts and features of Hadoop Distributed File System (HDFS) and how MapReduce helps in data processing. Strengthen your knowledge with Professor Jong-Moon Chung’s lecture notes at Yonsei University.
E N D
Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
Cloud Computing • Cloud Introduction • Cloud Service Model • Big Data • Hadoop • MapReduce • HDFS (Hadoop Distributed File System)
MapReduce • Hadoop • HadoopisaReliableSharedStorageandAnalysisSystem • Hadoop=HDFS+MapReduce+α • HDFS providesDataStorage • HDFS: HadoopDistributedFileSystem • MapReduceprovidesDataAnalysis • MapReduce=MapFunction+ReduceFunction
MapReduce • ScalingOut • ScalingoutisdonebytheDFS(DistributedFileSystem),where thedataisdividedandstoredindistributed computers&servers • HadoopusesHDFStomovetheMapReducecomputation toseveraldistributedcomputingmachinesthatwillprocessapartofthe divideddataassigned
MapReduce • Jobs • MapReducejobisaunitofwork thatneedstobe executed • Jobtypes:Datainput,MapReduceprogram,ConfigurationInformation,etc. • Jobisexecutedbydividingitintooneoftwotypesoftasks • MapTask • ReduceTask
MapReduce • NodetypesforJobexecution • Jobexecutioniscontrolledby2typesofnodes • Jobtracker • Tasktracker • Jobtrackercoordinatesalljobs • Jobtrackerschedulesalltasksandassignsthetasks totasktrackers
MapReduce • Tasktrackerwillexecuteitsassignedtask • TasktrackerwillsendaprogressreportstotheJobtracker • Jobtrackerwillkeeparecordoftheprogressofalljobsexecuted
MapReduce • Dataflow • Hadoopdividestheinputintoinputsplits(orsplits) suitablefortheMapReducejob • Splithasafixed-size • SplitsizeiscommonlymatchedtothesizeofaHDFS block(64MB)formaximumprocessingefficiency
MapReduce • Dataflow • MapTaskiscreatedforeachsplit • MapTaskexecutesthemapfunctionforallrecordswithinthesplit • HadoopcommonlyexecutestheMapTaskonthe nodewhere theinputdataresides
MapReduce Dataflow • Data-LocalMapTask • Datalocalityoptimization • doesnotneedtousetheclusternetwork • Data-localflowprocessshowswhythe • OptimalSplit Size=64MBHDFSBlockSize
MapReduce Dataflow Node Rack Data Center • Rack-LocalMapTask • Anodehostingthe HDFSblockreplicasfor amaptask’sinputsplit MapTask HDFSBlock • couldberunningothermaptasks • JobSchedulerwilllookforafreemapslotonanodeinthesamerackasoneoftheblocks
MapReduce Dataflow • Off-RackMapTask • Neededwhenthe JobScheduler • cannotperformdata-localorrack-localmaptasks • Usesinter-racknetworktransfer
MapReduce • Map • Maptaskwillwriteitsoutputtothelocaldisk • Maptaskoutput isnotthefinaloutput,it isonlythe intermediateoutput • Reduce • Maptaskoutput isprocessedbyReduceTaskstoproduce thefinaloutput • ReduceTaskoutputisstoredinHDFS • Foracompletedjob, theMapTaskoutputcanbe discarded
MapReduce SingleReduceTask • NodeincludesSplit,Map, Sort,andOutputunit • Lightbluearrowsshowdatatransfersinanode • Blackarrowsshowdatatransfersbetweennodes
MapReduce SingleReduceTask • Numberofreducetasksisspecifiedindependently,andisnotbasedon thesizeoftheinput
MapReduce • CombinerFunction • UserspecifiedfunctiontorunontheMapoutput • FormstheinputtotheReducefunction • SpecificallydesignedtominimizethedatatransferredbetweenMapTasksandReduceTasks • Solvestheproblemoflimitednetworkspeedonthe clusterandhelpstoreducethetimeincompletingMapReducejobs
MapReduce • MultipleReducer • Maptaskspartitiontheiroutput,eachcreatingone partitionforeachreducetask • Eachpartitionmayusemanykeysandkey associatedvalues • Allrecordsforakeyarekeptinasinglepartition
MapReduce MultipleReducers Shuffle • Shuffleprocessisusedinthedataflow betweentheMaptasksandReducetasks
MapReduce ZeroReducer • Zeroreducerusesnoshuffleprocess • Appliedwhenall ofthe processingcanbe carried out inparallelMap tasks
HDFS • Hadoop • HadoopisaReliableSharedStorageandAnalysisSystem • Hadoop=HDFS+MapReduce+α • HDFS providesDataStorage • HDFS: HadoopDistributedFileSystem • MapReduceprovidesDataAnalysis • MapReduce=MapFunction+ReduceFunction
HDFS • HDFS:HadoopDistributedFileSystem • DFS(DistributedFileSystem)isdesignedforstorage managementofanetworkofcomputers • HDFSisoptimizedtostorelargeterabytesizefiles withstreamingdataaccesspatterns
HDFS • HDFS:HadoopDistributedFileSystem • HDFSwas designedtobeoptimalinperformancefor aWORM(WriteOnce,ReadManytimes)pattern • HDFSisdesignedtorunonclustersofgeneral computers&serversfrommultiplevendors
HDFS • HDFSCharacteristics • HDFSisoptimizedfor largescaleandhighthroughputdataprocessing • HDFSdoesnotperformwell insupportingapplications thatrequireminimumdelay(e.g.,tensofmillisecondsrange)
HDFS • Blocks • FilesinHDFSaredividedintoblocksizechunks • 64Megabytedefaultblocksize • Blockistheminimumsizeofdatathatitcanreadorwrite • Blockssimplifiesthestorageandreplicationprocess • Providesfaulttolerance&processingspeed enhancementforlargerfiles
HDFS • HDFS • HDFSclustersuse2typesofnodes • Namenode(masternode) • Datanode(worker node)
HDFS • Namenode • Managesthefilesystemnamespace • Namenodekeepstrackofthedatanodesthathave blocksofadistributedfileassigned • Maintainsthefilesystemtreeandthemetadataforall thefilesanddirectoriesinthetree • Storesonthelocaldiskusing2fileforms • NamespaceImage • EditLog
HDFS • Namenode • Namenodeholdsthefilesystemmetadatainitsmemory • Namenode’smemory sizedeterminesthelimittothe numberoffilesinafilesystem • Butthen,whatisMetadata?
HDFS • Metadata • Traditionalconceptofthelibrarycardcatalogs • Categorizesanddescribesthecontentsandcontextof thedatafiles • Maximizestheusefulnessoftheoriginaldatafileby makingiteasytofindanduse
HDFS • MetadataTypes • StructuralMetadata • Focuses on the data structure’s design and specification • DescriptiveMetadata • Focusesontheindividualinstancesofapplication dataorthedatacontent
HDFS • Datanodes • Workhorseofthefilesystem • Storeandretrieveblockswhenrequestedbytheclient orthenamenode • Periodicallyreportsbacktothenamenodewithlistsof blocksthatwere stored
HDFS • ClientAccess • Clientcanaccessthefilesystem(onbehalfoftheuser) bycommunicatingwiththenamenodeanddatanodes • Clientcanuseafilesysteminterface(similartoaPOSIX (PortableOperatingSystemInterface))sotheusercode doesnotneedtoknowaboutthenamenodeand datanodestofunctionproperly
HDFS • NamenodeFailure • Namenodekeepstrackofthedatanodesthathaveblocks ofadistributedfileassigned • Withoutthenamenode,the filesystemcannotbeused • Ifthecomputerrunningthenamenodemalfunctionsthenreconstructionofthefiles(fromtheblocksonthe datanodes)wouldnotbepossible • Filesonthe filesystemwouldbelost
HDFS • NamenodeFailure Resilience • Namenodefailurepreventionschemes • NamenodeFileBackup • SecondaryNamenode
HDFS • NamenodeFileBackup • Backupthenamenodefilesthatformthepersistent stateofthefilesystem’smetadata • Configurethenamenodetowriteitspersistentstatetomultiplefilesystems • Synchronousandatomicbackup • Commonbackupconfiguration • CopytoLocalDiskandRemoteFileSystem
HDFS • SecondaryNamenode • Secondarynamenodedoesnotactthesame way asthe namenode • Secondarynamenodeperiodicallymergesthenamespaceimagewiththeeditlogtopreventtheeditlogfrombecomingtoolarge • Secondarynamenodeusuallyrunsonaseparate computertoperformthemergeprocessbecausethis requiressignificantprocessingcapabilityandmemory
HDFS • Hadoop2.xReleaseSeriesHDFSReliability Enhancements • HDFSFederation • HDFSHA(High-Availability)
HDFS • HDFSFederation • Allowsaclustertoscalebyaddingnamenodes • Eachnamenodemanagesa • namespacevolumeandablockpool • Namespacevolumeismadeupofthemetadatafor thenamespace • Blockpoolcontainsalltheblocksforthefilesinthe namespace
HDFS • HDFSFederation • Namespacevolumesareallindependent • Namenodesdonotcommunicatewith eachother • Failureofanamenodeisalsoindependenttoother namenodes • Anamenodefailuredoesnotinfluencetheavailabilityofanothernamenode’snamespace
HDFS • HDFSHigh-Availability • Pairofnamenodes(Primary&Standby)aresettobein Active-Standbyconfiguration • Secondarynamenodestoresthelatesteditlogentriesandanup-to-dateblockmapping • Whentheprimarynamenodefails,thestandby namenodetakesoverservingclientrequests
HDFS • HDFSHigh-Availability • Althoughtheactive-standbynamenodecantakeover operationquickly(e.g.,fewtensofseconds),toavoidunnecessarynamenodeswitching,standby namenodeactivationwill beexecutedafterasufficientobservationperiod • (e.g.,approximatelyaminuteorafewminutes)
References • V.Mayer-Schönberger,andK.Cukier,Bigdata:Arevolutionthat will transformhowwelive,work,andthink.HoughtonMifflinHarcourt,2013. • T. White,Hadoop:TheDefinitiveGuide.O'ReillyMedia,2012. • J.Venner,ProHadoop.Apress,2009. • S.LaValle,E.Lesser,R.Shockley,M.S.Hopkins,andN.Kruschwitz,“BigData, AnalyticsandthePathFrom Insights toValue,”MITSloanManagementReview, vol.52,no.2,Winter2011. • B.Randal,R. H.Katz,andE.D.Lazowska,"Big-dataComputing:Creating revolutionarybreakthroughsincommerce,scienceandsociety,"ComputingCommunityConsortium,pp.1-15,Dec.2008. • G. Linden,B.Smith, andJ.York."Amazon.comRecommendations:Item-to-ItemCollaborativeFiltering,"IEEEInternetComputing,vol.7,no.1, pp.76-80,Jan/Feb. 2003.
References • J.R.GalbRaith,"OrganizationalDesignChallengesResultingFromBigData," • JournalofOrganizationDesign,vol.3, no.1,pp.2-13,Apr. 2014. • S.SagirogluandD.Sinanc,“Bigdata:Areview,”Proc.IEEEInternational ConferenceonCollaborationTechnologiesandSystems,pp.42-47,May2013. • M.Chen,S.Mao,andY. Liu,“BigData:ASurvey,”MobileNetworksand Applications,vol.19,no.2, pp.171-209,Jan.2014. • X.Wu,X.Zhu,G. Q. Wu,andW. Ding,‘‘DataMiningwithBigData,’’IEEETransactionsonKnowledgeandDataEngineering,vol.26,no.1,pp.97–107,Jan. 2014. • Z. Zheng,J.Zhu, andM.R.Lyu,‘‘Service-GeneratedBigDataandBigData-as-a- Service:AnOverview,’’Proc.IEEEInternationalCongressonBigData,pp.403–410,Jun/Jul.2013.
References • I. PalitandC.K.Reddy,“ScalableandParallelBoostingwithMapReduce,”IEEETransactionsonKnowledgeandDataEngineering,vol.24,no.10,pp.1904-1916, 2012. • M.-YChoi,E.-A.Cho,D.-H. Park,C.-JMoon,andD.-K.Baik,“ADatabase SynchronizationAlgorithmforMobileDevices,”IEEETransactionsonConsumer Electronics,vol.56,no.2, pp.392-398,May2010. • IBM,Whatisbigdata?,http://www.ibm.com/software/data/bigdata/what-is-big-data.html[AccessedJune1,2015] • HadoopApache,http://hadoop.apache.org • Wikipedia,http://www.wikipedia.org • Imagesources • WalmartLogo,ByWalmart[Publicdomain],viaWikimediaCommons • AmazonLogo,ByBalajimuthazhagan(Ownwork)[CCBY-SA3.0(http://creativecommons.org/licenses/by-sa/3.0)],viaWikimediaCommons