250 likes | 277 Views
Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media, etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies. Frameworks come into picture in such scenarios. Frameworks are nothing but toolsets that offer innovative, cost-effective solutions to the problems posed by Big Data processing and helps in providing insights, incorporating metadata and aids decision making aligned to the business needs.
E N D
BIGDATA FRAMEWORKS Presented by CuelogicTechnologies
Introduction Thereare3V’sthatarevitalforclassifyingdataasBigData. These include Volume, Velocityand Veracity. Volume: Datavolumesitisintermsofterabytes,petabytesandsoon. Velocity: Velocityistodowiththehighspeedofdatamovementlike real-timedatastreamingata rapidrateinmicroseconds. Veracity: Veracityinvolvesthehandlingapproachforbothstructured and unstructureddata.
ImplementationofBigDatainfrastructureandtechnology canbeseeninvariousindustrieslikebanking, retail, insurance, healthcare, media,etc. Big Data management functions like storage, sorting, processingandanalysisforsuchcolossalvolumescannotbe handledbytheexistingdatabasesystemsortechnologies. IT ABOUT THINK
Therearemanyframeworkspresentlyexistinginthisspace.Someof thepopularonesareSpark,Hadoop,HiveandStorm. SomescorehighonutilityindexlikePrestowhileframeworkslikeFlink have greatpotential. TherearestillotherswhichneedsomementionliketheSamza,Impala, Apache Pig,etc. Someoftheseframeworkshavebeenbrieflydiscussedbelow.
ApacheHadoop HadoopisaJava-basedplatformfoundedbyMikeCafarellaandDoug Cutting. Thisopen-sourceframeworkprovidesbatchdataprocessingaswell as data storage services across a group of hardware machines arranged inclusters. HadoopconsistsofmultiplelayerslikeHDFSandYARNthatwork togethertocarryoutdataprocessing.
HDFS(HadoopDistributedFileSystem)isthehardwarelayerthat ensures coordination of data replication and storage activities across various data clusters. In the event of a cluster node failure,real-timecanstillbemadeavailableforprocessing. YARN(YetAnotherResourceNegotiator)isthelayerresponsible forresourcemanagementandjobscheduling. MapReduceisthesoftwarelayerthatfunctionsasthebatch processingengine.
Cons Pros Include vulnerability tosecurity breaches, does not perform in- memory computation hence suffers processing overheads, not suited for stream processing and real-time processing, issues in processing small files in large numbers. Includecost-effectivesolution, highthroughput,multi-language support, compatibilitywithmost emerging technologies inBigData services, highscalability,fault tolerance, better suitedforR&D, high availabilitythroughexcellent failure handlingmechanism.
ApacheSpark Itis a batchprocessingframeworkwithenhanceddatastreaming processing. Withfullin-memorycomputationandprocessingoptimisation,it promises a lightningfastclustercomputingsystem.
Sparkframeworkiscomposedoffivelayers. HDFSandHBASE:Theyformthefirstlayerofdatastorage systems. YARNandMesos:Theyformtheresourcemanagementlayer. Coreengine:Thisformsthethirdlayer. Library: This forms the fourth layer containing Spark SQL for SQL queries while stream processing, GraphX and Spark R utilities for processing graph data and MLlib for machine learningalgorithms. Thefifthlayercontainsanapplicationprograminterfacesuchas Java orScala.
Cons Pros Includescalability,lightning processingspeedsthrough reduced number ofI/Ooperations to disk, faulttolerance,supports advancedanalyticsapplications with superiorAIimplementation and seamless integrationwith Hadoop Include complexity of setup and implementation, language support limitation, nota genuine streaming engine.
Storm It is an application development platform-independent, can be used withanyprogramminglanguageandguaranteesdeliveryofdatawith the leastlatency. InStormarchitecture,thereare2nodes Master Node and Worker/ Supervisor Node. The master node monitors the failures of machines and is responsible for task allocation.Incaseof a clusterfailure,thetaskisreassignedto anotherone.
Cons Pros Include ease insetupand operation, highscalability,good speed, fault tolerance,supportfor a wide range oflanguages Include compleximplementation, debugging issues and not very learner-friendly
ApacheFlink ApacheFlink,anopen-sourceframeworkisequallygoodforbothbatch aswellasstreamdataprocessing. Itissuitedforclusterenvironments.Itisbasedontransformations- streamsconcept. Itisalsothe4GofBigData.Itisthe100timesfasterthanHadoop- MapReduce.
Flinksystemcontainsmultiplelayers DeployLayer RuntimeLayer LibraryLayer
Cons Pros Include lowlatency,high throughput,faulttolerance, entry byentryprocessing, ease ofbatch and stream dataprocessing, compatibility withHadoop. Include few scalabilityissues.
Hive Apache Hive, designed by Facebook, is an ETL (Extract / Transform/ Load)anddatawarehousingsystem.ItisbuiltontopoftheHadoop– HDFSplatform. ThekeycomponentsoftheHiveArchitectureinclude Deploy Layer RuntimeLayer
ThekeycomponentsoftheHiveArchitectureinclude HiveClients HiveServices Hive Storage andComputing The Hive engine converts SQL- queries or requests to MapReduce taskchains. The engine comprises of, Parser: It goes through the incoming SQL-requests and sorts ThemOptimizer: It goes through the sorted requests and optimises ThemExecutor:ItsendstaskstotheMapReduceframework
Cons Pros Include lowlatency,high throughput,faulttolerance, entry byentryprocessing, ease ofbatch and stream dataprocessing, compatibility withHadoop. Include few scalabilityissues.
Presto Prestoistheopen-sourcedistributedSQLtoolmostsuitedforsmaller datasets up to 3Tb.Presto engine includes a coordinator and multiple workers. When client submits queries, these are parsed, analysed, their executionplannedanddistributedforprocessingamongtheworkers by thecoordinator.
Cons Pros Includeleastquery degradation even intheevent ofincreasedconcurrent query workload. Ithas aquery execution rate thatisthree times fasterthan Hive.Ease in addingimagesand embedding links. Highlyuser- friendly. Include reliabilityissues
Impala Impalaisanopen-sourceMPP(MassiveParallelProcessing)query enginethatrunsonmultiplesystemsunder aHadoopcluster. IthasbeenwritteninC++andJava.
It is not coupled with its storage engine. It includes 3main components ImpalaDaemon(Impalad):Itisexecutedonevery node where Impala isinstalled. ImpalaStateStore ImpalaMetaStore ImpalahasitsquerylanguagelikeSQL.
Cons Pros Includesupportsin-memory computationhenceaccesses datawithoutmovement directly fromHadoopnodes, smooth integrationwithBI tools likeTableau,ZoomData, etc., supportsa wide range of fileformats. Include no support forserialisation and deserialization ofdata, inability to read custom binary files, table refresh needed for every record addition.
ContactUs +1 347 3748437 info@cuelogic.com https://www.cuelogic.com/ Unit610,134W29thSt, New York, NY10001 Content Source: CuelogicBlog