Big Data Frameworks

BIGDATA FRAMEWORKS Presented by CuelogicTechnologies

Introduction Thereare3V’sthatarevitalforclassifyingdataasBigData. These include Volume, Velocityand Veracity. Volume: Datavolumesitisintermsofterabytes,petabytesandsoon. Velocity: Velocityistodowiththehighspeedofdatamovementlike real-timedatastreamingata rapidrateinmicroseconds. Veracity: Veracityinvolvesthehandlingapproachforbothstructured and unstructureddata.

ImplementationofBigDatainfrastructureandtechnology canbeseeninvariousindustrieslikebanking, retail, insurance, healthcare, media,etc. Big Data management functions like storage, sorting, processingandanalysisforsuchcolossalvolumescannotbe handledbytheexistingdatabasesystemsortechnologies. IT ABOUT THINK

Therearemanyframeworkspresentlyexistinginthisspace.Someof thepopularonesareSpark,Hadoop,HiveandStorm. SomescorehighonutilityindexlikePrestowhileframeworkslikeFlink have greatpotential. TherearestillotherswhichneedsomementionliketheSamza,Impala, Apache Pig,etc. Someoftheseframeworkshavebeenbrieflydiscussedbelow.

ApacheHadoop HadoopisaJava-basedplatformfoundedbyMikeCafarellaandDoug Cutting. Thisopen-sourceframeworkprovidesbatchdataprocessingaswell as data storage services across a group of hardware machines arranged inclusters. HadoopconsistsofmultiplelayerslikeHDFSandYARNthatwork togethertocarryoutdataprocessing.

HDFS(HadoopDistributedFileSystem)isthehardwarelayerthat ensures coordination of data replication and storage activities across various data clusters. In the event of a cluster node failure,real-timecanstillbemadeavailableforprocessing. YARN(YetAnotherResourceNegotiator)isthelayerresponsible forresourcemanagementandjobscheduling. MapReduceisthesoftwarelayerthatfunctionsasthebatch processingengine.

Cons Pros Include vulnerability tosecurity breaches, does not perform in- memory computation hence suffers processing overheads, not suited for stream processing and real-time processing, issues in processing small files in large numbers. Includecost-effectivesolution, highthroughput,multi-language support, compatibilitywithmost emerging technologies inBigData services, highscalability,fault tolerance, better suitedforR&D, high availabilitythroughexcellent failure handlingmechanism.

ApacheSpark Itis a batchprocessingframeworkwithenhanceddatastreaming processing. Withfullin-memorycomputationandprocessingoptimisation,it promises a lightningfastclustercomputingsystem.

Sparkframeworkiscomposedoffivelayers. HDFSandHBASE:Theyformthefirstlayerofdatastorage systems. YARNandMesos:Theyformtheresourcemanagementlayer. Coreengine:Thisformsthethirdlayer. Library: This forms the fourth layer containing Spark SQL for SQL queries while stream processing, GraphX and Spark R utilities for processing graph data and MLlib for machine learningalgorithms. Thefifthlayercontainsanapplicationprograminterfacesuchas Java orScala.

Cons Pros Includescalability,lightning processingspeedsthrough reduced number ofI/Ooperations to disk, faulttolerance,supports advancedanalyticsapplications with superiorAIimplementation and seamless integrationwith Hadoop Include complexity of setup and implementation, language support limitation, nota genuine streaming engine.

Storm It is an application development platform-independent, can be used withanyprogramminglanguageandguaranteesdeliveryofdatawith the leastlatency. InStormarchitecture,thereare2nodes Master Node and Worker/ Supervisor Node. The master node monitors the failures of machines and is responsible for task allocation.Incaseof a clusterfailure,thetaskisreassignedto anotherone.

Cons Pros Include ease insetupand operation, highscalability,good speed, fault tolerance,supportfor a wide range oflanguages Include compleximplementation, debugging issues and not very learner-friendly

ApacheFlink ApacheFlink,anopen-sourceframeworkisequallygoodforbothbatch aswellasstreamdataprocessing. Itissuitedforclusterenvironments.Itisbasedontransformations- streamsconcept. Itisalsothe4GofBigData.Itisthe100timesfasterthanHadoop- MapReduce.

Flinksystemcontainsmultiplelayers DeployLayer RuntimeLayer LibraryLayer

Cons Pros Include lowlatency,high throughput,faulttolerance, entry byentryprocessing, ease ofbatch and stream dataprocessing, compatibility withHadoop. Include few scalabilityissues.

Hive Apache Hive, designed by Facebook, is an ETL (Extract / Transform/ Load)anddatawarehousingsystem.ItisbuiltontopoftheHadoop– HDFSplatform. ThekeycomponentsoftheHiveArchitectureinclude Deploy Layer RuntimeLayer

ThekeycomponentsoftheHiveArchitectureinclude HiveClients HiveServices Hive Storage andComputing The Hive engine converts SQL- queries or requests to MapReduce taskchains. The engine comprises of, Parser: It goes through the incoming SQL-requests and sorts ThemOptimizer: It goes through the sorted requests and optimises ThemExecutor:ItsendstaskstotheMapReduceframework

Cons Pros Include lowlatency,high throughput,faulttolerance, entry byentryprocessing, ease ofbatch and stream dataprocessing, compatibility withHadoop. Include few scalabilityissues.

Presto Prestoistheopen-sourcedistributedSQLtoolmostsuitedforsmaller datasets up to 3Tb.Presto engine includes a coordinator and multiple workers. When client submits queries, these are parsed, analysed, their executionplannedanddistributedforprocessingamongtheworkers by thecoordinator.

Cons Pros Includeleastquery degradation even intheevent ofincreasedconcurrent query workload. Ithas aquery execution rate thatisthree times fasterthan Hive.Ease in addingimagesand embedding links. Highlyuser- friendly. Include reliabilityissues

Impala Impalaisanopen-sourceMPP(MassiveParallelProcessing)query enginethatrunsonmultiplesystemsunder aHadoopcluster. IthasbeenwritteninC++andJava.

It is not coupled with its storage engine. It includes 3main components ImpalaDaemon(Impalad):Itisexecutedonevery node where Impala isinstalled. ImpalaStateStore ImpalaMetaStore ImpalahasitsquerylanguagelikeSQL.

Cons Pros Includesupportsin-memory computationhenceaccesses datawithoutmovement directly fromHadoopnodes, smooth integrationwithBI tools likeTableau,ZoomData, etc., supportsa wide range of fileformats. Include no support forserialisation and deserialization ofdata, inability to read custom binary files, table refresh needed for every record addition.

ContactUs +1 347 3748437 info@cuelogic.com https://www.cuelogic.com/ Unit610,134W29thSt, New York, NY10001 Content Source: CuelogicBlog

Big Data Frameworks

Big Data Frameworks

Presentation Transcript

Big Data

Big Data

Big Data

„Big data ”

Data Cloud Frameworks

Big Data

Big Data

Big Data – Big ROI

Big Data

Big Data

Big Data

Big Data

BIG DATA

Big Data

Big Data

Big Data Training | Big Data Courses | Big Data Online Courses

Big Data Big Data

Big Data

Open Source Big Data Analytics Frameworks Written in Scala