A framework for easy development of Big Data applications

A framework for easy development of Big Data applications Rubén Casado ruben.casado@treelogic.com @ruben_casado

Agenda • Big Data processing • Lambdoopframework • Lambdoopecosystem • Case studies • Conclusions

About me :-)

PhD in Software Engineering • MSc in Computer Science • BSc in Computer Science Academics Work Experience

AboutTreelogic

Treelogic is an R&D intensive company with the mission of creating, boosting, developing and adapting scientific and technological knowledgeto improve quality standards in our daily life

TREELOGIC – Distributor and Sales

Research Lines Solutions Computer Vision Big Data Teraherzt technology Data science Social Media Analysis Semantics Security & Safety Justice Health Transport Financial services ICT tailored solutions R&D

7 ongoing FP7projects ICT, SEC, OCEAN Coordinating 5 of them 3 ongoing Eurostarsprojects Coordinating all of them

More than 300 partners in last 3 years More than 40 projects with budget over 120 MEUR 7 years’ experience in R&D projects Overall participation in 11 European projects Project coordinator in 7 European projects Research & INNOVATION

www.datadopter.com

Agenda • Big Data processing • Lambdoop framework • Lambdoop ecosystem • Case studies • Conclusions

What is Big Data? A massive volume of both structured and unstructured data that is so large to process with traditional database and software techniques

How is Big Data? Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization - Gartner IT Glossary -

3 problems Volume Variety Velocity

3 solutions Batch processing Real-time processing NoSQL

Batch processing • Scalable • Large amount of staticdata • Distributed • Parallel • Fault tolerant • High latency Volume

Real-time processing • Low latency • Continuous unbounded streams of data • Distributed • Parallel • Fault-tolerant Velocity

Hybrid computation model • Low latency • Massivedata + Streamingdata • Scalable • Combine batch and real-time results Volume Velocity

Hybrid computation model All data Batch Batch processing results Final results Combination New data Stream Real-time processing results

Processing Paradigms 2003 Inception • Batch processing • Large amount of statics data • Scalable solution • Volume • Real-time processing • Computing streaming data • Low latency • Velocity • Hybrid computation • Lambda Architecture • Volume + Velocity 2006 1ª Generation 2010 2ª Generation 2014 3ª Generation

Processing Pipeline DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS

Agenda • Big Data processing • Lambdoopframework • Lambdoopecosystem • Case studies • Conclusions

What is Lambdoop? • Open sourceframework • Software abstraction layer over Open Source technologies • Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis • Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process • Same single API for the three processing paradigms • Batch processing similar to Pig / Cascading • Real time processing using built-in functions easier than Trident • Hybrid computation model transparent for the developer

Why Lambdoop? • Building a batch processing application requires • MapReduce developing • Use other Hadoop related tools (Sqoop, Zookeper, HCatalog …) • Storage systems (Hbase, MongoDB, HDFS, Cassandra…) • Real-time processing requires • Streaming computing (S4, Storm, Samza) • Unboundend input (Flume, Scribe) • Temporal data stores (In-memory, Kafka, Kestrel)

Why Lambdoop? • Building a hybrid computation system (Lambda Architecture) requires • Application logic has to be defined in two different systems using different frameworks • Data must be serialized consistently and kept in sync between each system • Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results

Why Lambdoop? “One of the most interesting areas of future work is high level abstractions that map to a batch processing component and a real-time processing component. There's no reason why you shouldn't have the conciseness of a declarative language with the robustness of the batch/real-time architecture”. Nathan Marz “Lambda Architecture is a implementation challenge. In many real-world situations a stumbling block for switching to a Lambda Architecture lies with a scalable batch processing layer. Technologies like Hadoop (…) are there but there is a shortage of people with the expertise to leverage them. Rajat Jain

Lambdoop Streaming data Workflow Operation Data Data Static data

Lambdoop Batch Hybrid Real-Time

Data Input • Informationrepresentedas Dataobjects • Types: • StaticData • StreamingData • Every Dataobject has a Schema to describe the Datafields (types, nulleables, keys…) • A Data object is composed by Datasets.

Data Input • Dataset • A Data object is formed by one or more Datasets. • All Datasets of a Data object share the same Schema • Datasets are formed by Register objects, • A Register is composed by RegisterFields.

Data Input • Schema • Very similar to Avro definition schemas. • Allow to define input data’s structure, fields, types, nulleables… • Json format { "type": "csv", "name": "AirQuality records", "fieldSeparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, … {"name": "PRB","type": "double","index": 20,"nullable": "true"} ] }

Data Input • Importingdata intoLambdoop • Loaders: Import information from multiple sources and store it into the HDFS as Data objects • Producers: Get streaming data and represent it as Data objects • Heterogeneous sources. • Serialize information into Avro format

Data Input • Static Data example: Importing a Air Qualitydatasetfrom local logsto HDFS • Loader • Schema’s path is files/csv/Air_quality_schema //Readschemafrom a file Stringschema = readSchemaFile(schema_file); Loaderloader = new CSVLoader("AQ.avro",uri, schema) Datainput = new StaticData(loader);

Data Input • Streaming Data example: Reading streaming sensor data from TCP port • Producer • Weather stations emit messages to port 8080 • Schema’s path is files/csv/Air_quality_schema intport = 8080; //Readschema Stringschema = readSchemaFile (schema_file); Producer producer = newTCPProducer ("AirQualityListener", refresh, port, schema); // Create Data object Data data = newStreamingData(producer)

Data Input • Extensibility • Users can implement their own data loaders/producers • Extend Loader/Producer interface • Read data from original source • Get and serialize information (Avro format) considering Schemas

Operations • Unitary actions to process data • An Operation takes Data as input, processestheData and produces another Data as output • Types of operations: • Aggregation: Produces a single value per DataSet • Filter: Output data has the same schema as input data • Group: Produces several DataSet, grouping registers together • Projection: Changes the Data schema, but preserves the records and their values • Join: Combines different Data objects

Operations

Operations • Extensibility(User Defined Operations):New operations can be defined implementing a set of interfaces: • OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed • BatchOperation: Provides MapReduce logic to process the input Data • StreamingOperation: Provides Storm/Trident based functions to process streaming registers • HybridOperation: Provides merging logic between streaming and batch results

Operations • User Defined Operation interfaces

Workflows • Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations • BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output • StreamingWorkflow: Operates on a StreamingData to produce another StreamingData • HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData) • Workflow connections Data Workflow Data Data Workflow Workflow Workflow Data Data Workflow Workflow Data

Workflows // Batch processing example Stringschema = readSchemaFile(schema_file); Loaderloader = new CSVLoader("AQ.avro",uri, schema) Datainput = new StaticData(loader); Workflowwf = new BatchWorkflow(input); //Add a filteroperation Filterfilter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue(«street 45")); //Calculate SO2 averageonfiltered input data Avgavg = new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Runtheworkflow wf.run(); //Gettheresults Data output = wf.getResults();

Workflows //Real-time processing example Producer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer); Workflowwf = new StreamingWorkflow(input); //Add a filteroperation Filterfilter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("Estación Av. Castilla")); //Calculate SO2 averageonfiltered input data Avgavg= new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Runstheworkflow wf.run(); //Getstheresults While (!stop){ Data output = wf.getResults(); … }

Workflows // Hybridcomputationexample Producer producer = new PortProducer("catest", schema1, config); StreamingDatastreamInput = new StreamingData(producer); Loaderloader = new CSVLoader("AQ.avro",uri, schema2) StaticDatabatchInput= new StaticData(loader); Data input = new HybridData(streamInput, batchInput); Workflowwf = new HybridWorkflow(input); //Add a filteroperation Filterfilter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34")); wf.addOperation(filter); //Calculate SO2 averageonfiltered input data Avgavg = new Avg(new RegisterField("SO2")); wf.addOperation(avg); //Runtheworkflow wf.run(); //Gettheresults While (!stop) { Data output = wf.getResults();}

Results exploitation Filter RollUp StdError Avg Select Cube Variance join … VISUALIZATION EXPORT CSV, JSON, … ALARM SYSTEM

Results exploitation • Visualization /* Produce from Twitter */ TwitterProducerproducer = new TwitterProducer(…); Data data = new StreamingData(producer); StreamingWorkflowwf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addOperation(new Count()); … /* Get results from workflow*/ Data results = wf.getResults(); /* Show results. Set dashboard refresh*/ Dashboard d = new Dashboard(config); d.addChart(LambdoopChart.createBarChart(results, new RegisterField("count"), “Tweetscount");

Results exploitation • Visualization

A framework for easy development of Big Data applications

A framework for easy development of Big Data applications

Presentation Transcript

Hadoop , a distributed framework for Big Data

A Direction for Big Data

Big Data Conference: Analytics and Applications for Federal Big Data

Big data analytics for DEVELOPMENT

Hadoop : A Software Framework for Data Intensive Computing Applications

BIG DATA IN ENGINEERING APPLICATIONS

Big Data Symposium: Analytics and Applications for Federal Big Data – Bureau of Justice Statistics

An Virtualization based Data Management Framework for Big Data Applications

PAGE: A Framework for Easy Parallelization of Genomic Applications

Big data analytics for Development

Big Data Symposium: Analytics and Applications for Federal Big Data - FEMA

A Framework for Mobile Applications

Big Data as a Service for B2B Applications

Hadoop: A Software Framework for Data Intensive Computing Applications

DataLines a framework for building steaming data applications

Networking Architectures for Big-Data Applications

Applications of Big Data & Hadoop

Applications of Big Data Marketing Analytics

Technical walk-through of a big data framework

Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for Beginners | Edureka

HadoopViz: A MapReduce Framework for Extensible Visualization of Big Spatial Data

FuzzyWorld - a framework for expert applications