210 likes | 229 Views
Telegraph Continuously Adaptive Dataflow. Joe Hellerstein. Scenarios. Ubiquitous computing: more than clients sensors and their data feeds are key smart dust, biomedical (MEMS sensors) each consumer good records (mis)use disposable computing
E N D
TelegraphContinuously Adaptive Dataflow Joe Hellerstein
Scenarios • Ubiquitous computing: more than clients • sensors and their data feeds are key • smart dust, biomedical (MEMS sensors) • each consumer good records (mis)use • disposable computing • video from surveillance cameras, broadcasts, etc. • Global Data Federation • all the data is online – what are we waiting for? • The plumbing is coming • XML/HTTP, etc. give LCD communication • but how do you flow, summarize, query and analyze data robustly over many sources in the wide area?
Dataflow in Volatile Environments • Federated query processors a reality • Cohera, IBM DataJoiner • No control over stats, performance, administration • Large Cluster Systems “Scaling Out” • No control over “system balance” • User “CONTROL” of running dataflows • Long-running dataflow apps are interactive • No control over user interaction • Sensor Nets: the next killer app • E.g. “Smart Dust” • No control over anything! • Telegraph • Dataflow Engine for these environments
Data Flood: Main Features • What does it look like? • Never ends: interactivity required • Online, controllable algorithms for all tasks! • Big: data reduction/aggregation is key • Volatile: this scale of devices and nets will not behave nicely
The Telegraph Dataflow Engine • Key technologies • Interactive Control • interactivity with early answers and examples • online aggregation for data reduction • Dataflow programming via paths/iterators • Elevate query processing frameworks out of DBMSs • Long tradition of static optimization here • Suggestive, but not sufficient for volatile environments • Continuously adaptive flow optimization • massively parallel, adaptive dataflow via Rivers and Eddies
CONTROLContinuous Output and Navigation Technology with Refinement On Line • Data-intensive jobs are long-running. How to give early answers and interactivity? • online interactivity over feeds • pipelining “online” operators, data “juggle” • online data correlation algs: ripple joins, online mining and aggregation • statistical estimators, and their performance implications • Deliver data to satisfy statistical goals • Appreciate interplay of massive data processing, stats, and HCI • “Of all men's miseries, the bitterest is this: to know so much and have control over nothing” • Herodotus
Performance Regime for CONTROL • New “Greedy” Performance Regime • Maximize 1st derivative of the user-happiness function 100% CONTROL Traditional Time
CONTROLContinuous Output and Navigation Technology with Refinement On Line
CONTROLContinuous Output and Navigation Technology with Refinement On Line
Q River • We built the world’s fastest sorting machine • On the “NOW”: 100 Sun workstations + SAN • But it only beat the record under ideal conditions! • River: performance adaptivity for data flows on clusters • simplifies management and programming • perfect for sensor-based streams
Declarative Dataflow: NOT new • Database Systems have been doing this for years • Xlate declarative queries into an efficient dataflow plan • “query optimization” considers: • Alternate data sources (“access methods”) • Alternate implementations of operators • Multiple orders of operators • A space of alternatives defined by transformation rules • Estimate costs and “data rates”, then search space • But in a very static way! • Gather statistics once a week • Optimize query at submission time • Run a fixed plan for the life of the query • And these ideas are ripe to elevate out of DBMSs • And outside of DBMSs, the world is very volatile • There are surely going to be lessons “outside the box”
Static Query Plans • Volatile environments like sensors need to adapt at a much finer grain
Continuous Adaptivity: Eddies Eddy • How to order and reorder operators over time • based on performance, economic/admin feedback • Vs.River: • River optimizes each operator “horizontally” • Eddies optimize a pipeline “vertically”
s s block index1 hash index2 Eddy S3 R3 R1 R2 S1 S2 Competitive Eddies
Telegraph: Putting it Together • Scalable, adaptive dataflow infrastructure. Apps include… • sensor nets • massively parallel and wide-area query engines • net appliances: chaining xform8n/aggreg8n/compression/ etc. in proxies • any volatile dataflow scenario • Technology: a marriage of… • CONTROL, Rivers & Eddies • Many research questions here • E.g. how to combine River and Eddy adaptivity • E.g. how to tune Eddies for statistical performance goals • Combinations of browse/query/mine at UI • Storage management to handle new hardware realities • Look for a live service this summer!
Integration with Endeavour • Give • Be data-intensive backbone to diverse clients • Be replication/delivery dataflow engine for OceanStore • Telegraph Storage Manager provides storage (xactional/otherwise) for OceanStore • Provide platform for data-intensive “tacit info mining” • Take • Leverage OceanStore to manage distributed metadata, security • Leverage protocols out of TinyOS for sensors
Connectivity & Heterogeneity • Lots of folks working on data format translation, parsing • we will borrow, not build • currently using JDBC & Cohera Net Query • commercial tool, donated by Cohera Corp. • gateways XML/HTML (via http) to ODBC/JDBC • we may write “Teletalk” gateways from sensors • Heterogeneity • never a simple problem • Control project developed interactive, online data transformation tool: ABC
More Info • Collaborators: • Mike Franklin, Eric Brewer, Christos Papadimitriou • Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, Mehul Shah • Me: jmh@cs.berkeley.edu • Web: • http://db.cs.berkeley.edu/telegraph • http://control.cs.berkeley.edu