660 likes | 837 Views
A framework for easy development of Big Data applications. Rubén Casado ruben.casado@treelogic.com @ ruben_casado. Agenda. Big Data processing Lambdoop framework Lambdoop ecosystem Case studies Conclusions. About me :-). PhD in Software Engineering MSc in Computer Science
E N D
A framework for easy development of Big Data applications Rubén Casado ruben.casado@treelogic.com @ruben_casado
Agenda • Big Data processing • Lambdoopframework • Lambdoopecosystem • Case studies • Conclusions
PhD in Software Engineering • MSc in Computer Science • BSc in Computer Science Academics Work Experience
Treelogic is an R&D intensive company with the mission of creating, boosting, developing and adapting scientific and technological knowledgeto improve quality standards in our daily life
Research Lines Solutions Computer Vision Big Data Teraherzt technology Data science Social Media Analysis Semantics Security & Safety Justice Health Transport Financial services ICT tailored solutions R&D
7 ongoing FP7projects ICT, SEC, OCEAN Coordinating 5 of them 3 ongoing Eurostarsprojects Coordinating all of them
More than 300 partners in last 3 years More than 40 projects with budget over 120 MEUR 7 years’ experience in R&D projects Overall participation in 11 European projects Project coordinator in 7 European projects Research & INNOVATION
Agenda • Big Data processing • Lambdoop framework • Lambdoop ecosystem • Case studies • Conclusions
What is Big Data? A massive volume of both structured and unstructured data that is so large to process with traditional database and software techniques
How is Big Data? Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization - Gartner IT Glossary -
3 problems Volume Variety Velocity
3 solutions Batch processing Real-time processing NoSQL
3 solutions Batch processing Real-time processing NoSQL
Batch processing • Scalable • Large amount of staticdata • Distributed • Parallel • Fault tolerant • High latency Volume
Real-time processing • Low latency • Continuous unbounded streams of data • Distributed • Parallel • Fault-tolerant Velocity
Hybrid computation model • Low latency • Massivedata + Streamingdata • Scalable • Combine batch and real-time results Volume Velocity
Hybrid computation model All data Batch Batch processing results Final results Combination New data Stream Real-time processing results
Processing Paradigms 2003 Inception • Batch processing • Large amount of statics data • Scalable solution • Volume • Real-time processing • Computing streaming data • Low latency • Velocity • Hybrid computation • Lambda Architecture • Volume + Velocity 2006 1ª Generation 2010 2ª Generation 2014 3ª Generation
Processing Pipeline DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS
Agenda • Big Data processing • Lambdoopframework • Lambdoopecosystem • Case studies • Conclusions
What is Lambdoop? • Open sourceframework • Software abstraction layer over Open Source technologies • Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis • Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process • Same single API for the three processing paradigms • Batch processing similar to Pig / Cascading • Real time processing using built-in functions easier than Trident • Hybrid computation model transparent for the developer
Why Lambdoop? • Building a batch processing application requires • MapReduce developing • Use other Hadoop related tools (Sqoop, Zookeper, HCatalog …) • Storage systems (Hbase, MongoDB, HDFS, Cassandra…) • Real-time processing requires • Streaming computing (S4, Storm, Samza) • Unboundend input (Flume, Scribe) • Temporal data stores (In-memory, Kafka, Kestrel)
Why Lambdoop? • Building a hybrid computation system (Lambda Architecture) requires • Application logic has to be defined in two different systems using different frameworks • Data must be serialized consistently and kept in sync between each system • Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results
Why Lambdoop? “One of the most interesting areas of future work is high level abstractions that map to a batch processing component and a real-time processing component. There's no reason why you shouldn't have the conciseness of a declarative language with the robustness of the batch/real-time architecture”. Nathan Marz “Lambda Architecture is a implementation challenge. In many real-world situations a stumbling block for switching to a Lambda Architecture lies with a scalable batch processing layer. Technologies like Hadoop (…) are there but there is a shortage of people with the expertise to leverage them. Rajat Jain
Lambdoop Streaming data Workflow Operation Data Data Static data
Lambdoop Batch Hybrid Real-Time
Data Input • Informationrepresentedas Dataobjects • Types: • StaticData • StreamingData • Every Dataobject has a Schema to describe the Datafields (types, nulleables, keys…) • A Data object is composed by Datasets.
Data Input • Dataset • A Data object is formed by one or more Datasets. • All Datasets of a Data object share the same Schema • Datasets are formed by Register objects, • A Register is composed by RegisterFields.
Data Input • Schema • Very similar to Avro definition schemas. • Allow to define input data’s structure, fields, types, nulleables… • Json format { "type": "csv", "name": "AirQuality records", "fieldSeparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, … {"name": "PRB","type": "double","index": 20,"nullable": "true"} ] }
Data Input • Importingdata intoLambdoop • Loaders: Import information from multiple sources and store it into the HDFS as Data objects • Producers: Get streaming data and represent it as Data objects • Heterogeneous sources. • Serialize information into Avro format
Data Input • Static Data example: Importing a Air Qualitydatasetfrom local logsto HDFS • Loader • Schema’s path is files/csv/Air_quality_schema //Readschemafrom a file Stringschema = readSchemaFile(schema_file); Loaderloader = new CSVLoader("AQ.avro",uri, schema) Datainput = new StaticData(loader);
Data Input • Streaming Data example: Reading streaming sensor data from TCP port • Producer • Weather stations emit messages to port 8080 • Schema’s path is files/csv/Air_quality_schema intport = 8080; //Readschema Stringschema = readSchemaFile (schema_file); Producer producer = newTCPProducer ("AirQualityListener", refresh, port, schema); // Create Data object Data data = newStreamingData(producer)
Data Input • Extensibility • Users can implement their own data loaders/producers • Extend Loader/Producer interface • Read data from original source • Get and serialize information (Avro format) considering Schemas
Operations • Unitary actions to process data • An Operation takes Data as input, processestheData and produces another Data as output • Types of operations: • Aggregation: Produces a single value per DataSet • Filter: Output data has the same schema as input data • Group: Produces several DataSet, grouping registers together • Projection: Changes the Data schema, but preserves the records and their values • Join: Combines different Data objects
Operations • Extensibility(User Defined Operations):New operations can be defined implementing a set of interfaces: • OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed • BatchOperation: Provides MapReduce logic to process the input Data • StreamingOperation: Provides Storm/Trident based functions to process streaming registers • HybridOperation: Provides merging logic between streaming and batch results
Operations • User Defined Operation interfaces
Workflows • Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations • BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output • StreamingWorkflow: Operates on a StreamingData to produce another StreamingData • HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData) • Workflow connections Data Workflow Data Data Workflow Workflow Workflow Data Data Workflow Workflow Data
Workflows // Batch processing example Stringschema = readSchemaFile(schema_file); Loaderloader = new CSVLoader("AQ.avro",uri, schema) Datainput = new StaticData(loader); Workflowwf = new BatchWorkflow(input); //Add a filteroperation Filterfilter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue(«street 45")); //Calculate SO2 averageonfiltered input data Avgavg = new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Runtheworkflow wf.run(); //Gettheresults Data output = wf.getResults();
Workflows //Real-time processing example Producer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer); Workflowwf = new StreamingWorkflow(input); //Add a filteroperation Filterfilter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("Estación Av. Castilla")); //Calculate SO2 averageonfiltered input data Avgavg= new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Runstheworkflow wf.run(); //Getstheresults While (!stop){ Data output = wf.getResults(); … }
Workflows // Hybridcomputationexample Producer producer = new PortProducer("catest", schema1, config); StreamingDatastreamInput = new StreamingData(producer); Loaderloader = new CSVLoader("AQ.avro",uri, schema2) StaticDatabatchInput= new StaticData(loader); Data input = new HybridData(streamInput, batchInput); Workflowwf = new HybridWorkflow(input); //Add a filteroperation Filterfilter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34")); wf.addOperation(filter); //Calculate SO2 averageonfiltered input data Avgavg = new Avg(new RegisterField("SO2")); wf.addOperation(avg); //Runtheworkflow wf.run(); //Gettheresults While (!stop) { Data output = wf.getResults();}
Results exploitation Filter RollUp StdError Avg Select Cube Variance join … VISUALIZATION EXPORT CSV, JSON, … ALARM SYSTEM
Results exploitation • Visualization /* Produce from Twitter */ TwitterProducerproducer = new TwitterProducer(…); Data data = new StreamingData(producer); StreamingWorkflowwf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addOperation(new Count()); … /* Get results from workflow*/ Data results = wf.getResults(); /* Show results. Set dashboard refresh*/ Dashboard d = new Dashboard(config); d.addChart(LambdoopChart.createBarChart(results, new RegisterField("count"), “Tweetscount");
Results exploitation • Visualization