DataLines a framework for building steaming data applications

DataLinesa framework for building steaming data applications Mike Haberman Senior Software/Network Engineer mikeh@ncsa.edu

The Problem • Data deluge: routers, switches, IDS, servers (web, mail, logs, etc), software (tcpdump, web100, SNMP, tarpit, etc), sensors, taps, … (help me) ? ? ?

The problem (continues) • Disparate data formats • Software (sometimes) to manage each • Tweaking to get what you want (custom software) • Correlating data (more custom software)

DataLines • Can we build a framework that can remove all (most) of the tedium of working with all these disparate data formats?

DataLines Framework • designed to manage and build streaming data processing applications

DataLines Framework • designed to manage and buildstreaming dataprocessing applications

DataLines Framework designed to manage and build streaming data processing applications • Manage: would like one tool to handle all these different data sources.

DataLines Framework designed to manage and build streaming data processing applications • Build: uniform way of creating a data processing application.

DataLines Framework designed to manage and build streaming data processing applications • Streaming data: • Never ending stream of ‘manageable’ chunks of data • No random access, no blocking operators • One look, linear or sub-linear algorithms/data ops • Each data item (a tuple in DataLines) is an independent entity • Many tools were not designed for streaming data

DataLines Framework designed to manage and build streaming data processing applications • Processing: • Something you want to do to the data (e.g. reading, writing, parsing, event generation, filtering, statistics, reports, data synopsis, …)

DataLines • Creating a DataLines application: DataLines Application “compile” XML

DataLines • XML file defines 3 major components: • Data Processors • What one does with the data • Processing Order • The order in which the processors will operate on the data • Event Management • What to do when a processor generates an event

DataLines Processors • Data Processors are the heart of D.L. • I/O: socket, file • Filters: inline, dispatch • Collectors: binning, windowing (w/operators) • Gui: charts, picture taking • Converters: binary to tuple • Misc: printers, counters, iterators, timers, data generators, gates, delays • Processors can generate events • Processors can drop, mutate, mutilate the tuple being processed, generate new tuples

DataLines Pipelines • Control tuple movement among processors • Can connect either processors or other pipelines • Two paths within a pipeline: binary and tuple

Event Management • Allow processors to signal an event • timers, open/close, client connects, etc • Allow the user to tie in domain logic • Allow the user to call a processor specific API

DataLines Data • The generalization of data is a DlTuple • Tuple is just a set of values • DlTuple is the interface processors use • String[] <-- getFieldNames() • DlValue <-- getValue(fieldname)

DataLines Data • Tuples can have virtual fields • calculated values, static values • Tuples can have composite fields • The creation of the tuple is left to the processor in charge of conversion

XML Syntax … run away! <application> <dataline name =“dl”> <processor name=“reader” type=“FileReader”> <configInfo> </configInfo> </processor> <pipeline name =“p1”> <pipe from = “reader” to = “parser” /> <pipe from = “parser” to = “printer” /> </pipeline> <eventManagement> <event name=“start”> <call method = “start” target = “reader”/> </event> <event name=“alert” from = “reader”> <call method=“stop” target=“parser” /> </event> </eventManagement> <dataline> </application>

Data Example <arg name = “tupleField”> <map name = “name” value = “Src Ip”/> <map name = “peer” value = “IpV4AddressPeer” /> <map name = “length” value = “4” /> </arg>

Data Example <arg name = “tupleField”> <map name = “name” value = “A”/> <map name = “peer” value = “IntegerPeer” /> <map name = “length” value = “4” /> </arg> <arg name = “tupleField”> <map name = “name” value = “B”/> <map name = “peer” value = “IntegerPeer” /> <map name = “length” value = “4” /> </arg> <arg name = “tupleField”> <map name = “name” value = “C”/> <map name = “peer” value = “JepPeer” /> <data name = “expression”> ${A} + ${B} </data> </arg>

DataLines Tutorial • Fast forward past a painful 3 hour tutorial covering each of those sections in detail (tuples, processors, pipelines, event management, configurations) • You have seen all the XML though!

DataLines Distilled • A library of data processors that operate on “Tuples” • one of the processors takes the raw data and creates the tuple • An XML compiler that takes the xml file, the library, and creates an application

DataLines Example

DataLines in use • DataLines does make it easier to hit the ground running. Much of the tedious work you need to do is taken care of • For highly specific needs, you still need to write code. But that code then becomes part of the DataLines lib. That others can build on

Balance Sheet • Negative • Positive • Flexible (vendor neutral, data, debugging) • Reusable (pipelines, processors) • Fast development time • “easy” to change the client (cli, desktop, web page) • May need to write domain specific code • Learning curve -- processors config, data expectations, events

DataLines in Action • Network Engineering group • Monitor router, tar pit, IDS, packet sampling, L2/L3 mappings • Security Group • Network forensics • Intergroup Wiring • Use DataLines to share data between groups/projects

DataLines in Action • Network Research group • Monitor cluster network activity from MPI layer • Data Mining • Misc. NSF data oriented projects

Future • Open Source • More Info: mikeh@ncsa.edu • http://datalines.ncsa.uiuc.edu (a work in progress)

DataLines a framework for building steaming data applications

DataLines a framework for building steaming data applications

Presentation Transcript

ASP.NET 2.0 AJAX Microsoft’s Framework for building AJAX Applications

Building a Framework for Indiana GeoSpatial Education

Hadoop : A Software Framework for Data Intensive Computing Applications

Steaming

An Virtualization based Data Management Framework for Big Data Applications

Building a Reusable Data Integration Framework

Building a Framework for Historical Understanding

A framework for easy development of Big Data applications

A Generic Software Framework for building Hybrid Ontology-Backed Models for Driving Applications

Building a Framework for Learning

A Framework for Mobile Applications

A Framework for Testing Database Applications

Building a Framework for Learning

A Framework for Composing Pervasive Applications

Operational Data Fusion Framework for Building

A Framework for Building MAS

Hadoop: A Software Framework for Data Intensive Computing Applications

Building a Demo Framework

A lightweight framework for testing database applications

FuzzyWorld - a framework for expert applications