280 likes | 386 Views
DataLines a framework for building steaming data applications. Mike Haberman Senior Software/Network Engineer mikeh@ncsa.edu. The Problem. Data deluge: routers, switches, IDS, servers (web, mail, logs, etc), software (tcpdump, web100, SNMP, tarpit, etc), sensors, taps, … (help me). ?. ?.
E N D
DataLinesa framework for building steaming data applications Mike Haberman Senior Software/Network Engineer mikeh@ncsa.edu
The Problem • Data deluge: routers, switches, IDS, servers (web, mail, logs, etc), software (tcpdump, web100, SNMP, tarpit, etc), sensors, taps, … (help me) ? ? ?
The problem (continues) • Disparate data formats • Software (sometimes) to manage each • Tweaking to get what you want (custom software) • Correlating data (more custom software)
DataLines • Can we build a framework that can remove all (most) of the tedium of working with all these disparate data formats?
DataLines Framework • designed to manage and build streaming data processing applications
DataLines Framework • designed to manage and buildstreaming dataprocessing applications
DataLines Framework designed to manage and build streaming data processing applications • Manage: would like one tool to handle all these different data sources.
DataLines Framework designed to manage and build streaming data processing applications • Build: uniform way of creating a data processing application.
DataLines Framework designed to manage and build streaming data processing applications • Streaming data: • Never ending stream of ‘manageable’ chunks of data • No random access, no blocking operators • One look, linear or sub-linear algorithms/data ops • Each data item (a tuple in DataLines) is an independent entity • Many tools were not designed for streaming data
DataLines Framework designed to manage and build streaming data processing applications • Processing: • Something you want to do to the data (e.g. reading, writing, parsing, event generation, filtering, statistics, reports, data synopsis, …)
DataLines • Creating a DataLines application: DataLines Application “compile” XML
DataLines • XML file defines 3 major components: • Data Processors • What one does with the data • Processing Order • The order in which the processors will operate on the data • Event Management • What to do when a processor generates an event
DataLines Processors • Data Processors are the heart of D.L. • I/O: socket, file • Filters: inline, dispatch • Collectors: binning, windowing (w/operators) • Gui: charts, picture taking • Converters: binary to tuple • Misc: printers, counters, iterators, timers, data generators, gates, delays • Processors can generate events • Processors can drop, mutate, mutilate the tuple being processed, generate new tuples
DataLines Pipelines • Control tuple movement among processors • Can connect either processors or other pipelines • Two paths within a pipeline: binary and tuple
Event Management • Allow processors to signal an event • timers, open/close, client connects, etc • Allow the user to tie in domain logic • Allow the user to call a processor specific API
DataLines Data • The generalization of data is a DlTuple • Tuple is just a set of values • DlTuple is the interface processors use • String[] <-- getFieldNames() • DlValue <-- getValue(fieldname)
DataLines Data • Tuples can have virtual fields • calculated values, static values • Tuples can have composite fields • The creation of the tuple is left to the processor in charge of conversion
XML Syntax … run away! <application> <dataline name =“dl”> <processor name=“reader” type=“FileReader”> <configInfo> </configInfo> </processor> <pipeline name =“p1”> <pipe from = “reader” to = “parser” /> <pipe from = “parser” to = “printer” /> </pipeline> <eventManagement> <event name=“start”> <call method = “start” target = “reader”/> </event> <event name=“alert” from = “reader”> <call method=“stop” target=“parser” /> </event> </eventManagement> <dataline> </application>
Data Example <arg name = “tupleField”> <map name = “name” value = “Src Ip”/> <map name = “peer” value = “IpV4AddressPeer” /> <map name = “length” value = “4” /> </arg>
Data Example <arg name = “tupleField”> <map name = “name” value = “A”/> <map name = “peer” value = “IntegerPeer” /> <map name = “length” value = “4” /> </arg> <arg name = “tupleField”> <map name = “name” value = “B”/> <map name = “peer” value = “IntegerPeer” /> <map name = “length” value = “4” /> </arg> <arg name = “tupleField”> <map name = “name” value = “C”/> <map name = “peer” value = “JepPeer” /> <data name = “expression”> ${A} + ${B} </data> </arg>
DataLines Tutorial • Fast forward past a painful 3 hour tutorial covering each of those sections in detail (tuples, processors, pipelines, event management, configurations) • You have seen all the XML though!
DataLines Distilled • A library of data processors that operate on “Tuples” • one of the processors takes the raw data and creates the tuple • An XML compiler that takes the xml file, the library, and creates an application
DataLines in use • DataLines does make it easier to hit the ground running. Much of the tedious work you need to do is taken care of • For highly specific needs, you still need to write code. But that code then becomes part of the DataLines lib. That others can build on
Balance Sheet • Negative • Positive • Flexible (vendor neutral, data, debugging) • Reusable (pipelines, processors) • Fast development time • “easy” to change the client (cli, desktop, web page) • May need to write domain specific code • Learning curve -- processors config, data expectations, events
DataLines in Action • Network Engineering group • Monitor router, tar pit, IDS, packet sampling, L2/L3 mappings • Security Group • Network forensics • Intergroup Wiring • Use DataLines to share data between groups/projects
DataLines in Action • Network Research group • Monitor cluster network activity from MPI layer • Data Mining • Misc. NSF data oriented projects
Future • Open Source • More Info: mikeh@ncsa.edu • http://datalines.ncsa.uiuc.edu (a work in progress)