1 / 33

Adaptive Dataflow: A Database/Networking Cosmic Convergence

Adaptive Dataflow: A Database/Networking Cosmic Convergence. Joe Hellerstein UC Berkeley. Road Map. How I got started on this CONTROL project Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: Sensor networks P2P networks.

Download Presentation

Adaptive Dataflow: A Database/Networking Cosmic Convergence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Dataflow: A Database/NetworkingCosmic Convergence Joe Hellerstein UC Berkeley

  2. Road Map • How I got started on this • CONTROL project • Eddies • Tie-ins to Networking Research • Telegraph & ongoing adaptive dataflow research • New arenas: • Sensor networks • P2P networks

  3. Background: CONTROL project • Online/Interactive query processing • Online aggregation • Scalable spreadsheets & refining visualizations • Online data cleaning (Potter’s Wheel) • Pipelining operators (ripple joins, online reordering) over streaming samples

  4. Example: Online Aggregation

  5. Online Data Visualization • CLOUDS

  6. Potter’s Wheel

  7. Goals for Online Processing • Performance metric:  • Statistical (e.g. conf. intervals) • User-driven (e.g. weighted by widgets) • New “greedy” performance regime • Maximize 1st derivative of the “mirth index” • Mirth defined on-the-fly • Therefore need FEEDBACK and CONTROL 100% Online  Traditional Time

  8. CONTROL  Volatility • Goals and data may change over time • User feedback, sample variance • Goals and data may be different in different “regions” • Group-by, scrollbar position • [An aside: dependencies in selectivity estimation] • Q: Query optimization in this world? • Or in any pipelining, volatile environment?? • Where else do we see volatility?

  9. Continuous Adaptivity: Eddies • A little more state per tuple • Ready/done bits (extensible a la Volcano/Starburst) • Query processing = dataflow routing!! • We'll come back to this! Eddy

  10. Eddies: Two Key Observations • Break the set-oriented boundary • Usual DB model: algebra expressions: (R S) T • Usual DB implementation: pipelining operators! • Subexpressions never materialized • Typical implementation is more flexible than algebra • We can reorder in-flight operators • Other gains possible by breaking the set-oriented boundary… • Don’t rewrite graph. Impose a router • Graph edge = absence of routing constraint • Observe operator consumption/production rates • Consumption: cost • Production: cost*selectivity

  11. Road Map • How I got started on this • CONTROL project • Eddies • Tie-ins to Networking Research • Telegraph & ongoing adaptive dataflow research • New arenas: • Sensor networks • P2P networks

  12. Coincidence: Eddie Comes to Berkeley • CLICK: a NW router is a query plan! • “The Click Modular Router”, Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek, SOSP ‘99

  13. Also Scout • Paths the key to comm-centric OS • “Making Paths Explicit in the Scout Operating System”, David Mosberger and Larry L. Peterson. OSDI ‘96. Figure 3:Example Router Graph

  14. More Interaction: CS262 Experiment w/ Eric Brewer • Merge OS & DBMS grad class, over a year • Eric/Joe, point/counterpoint • Some tie-ins were obvious: • memory mgmt, storage, scheduling, concurrency • Surprising: QP and networks go well side by side • E.g. eddies and TCP Congestion Control • Both use back-pressure and simple Control Theory to “learn” in an unpredictable dataflow environment • Eddies close to the n-armed bandit problem

  15. Networking Overview for DB People Like Me • Core function of protocols: data xfer • Data Manipulation (buffer, checksum, encryption, xfer to/fr app space, presentation) • Transfer Control (flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framing)-- Clark & Tennenhouse, “Architectural Considerations for a New Generation of Protocols”, SIGCOMM ‘90 • Basic Internet assumption: • “a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson)

  16. Data Modeling! Query Opt! Exchange! C & T’s Wacky Ideas • Thesis: nets are good at xfer control, not so good at data manipulation • Some C&T wacky ideas for better data manipulation • Xfer semantic units, not packets (ALF) • Auto-rewrite layers to flatten them (ILP) • Minimize cross-layer ordering constraints • Control delivery in parallel via packet content

  17. What if… We had unbounded data producers and consumers (“streams” … “continuous queries”) We couldn’t know our producers’ behavior or contents?? (“federation” … “mediators”) We couldn’t predict user behavior? (“control”) We couldn’t predict behavior of components in the dataflow? (“networked services”) We had partial failure as a given? (oops, have we ignored this?) Yes … networking people have been here! Remember Van Jacobson’s quote? Wacky New Ideas in QP

  18. The Cosmic Convergence Data Models, Query Opt, DataScalability DATABASE RESEARCH Adaptive QueryProcessing ContinuousQueries Approximate/Interactive QP SensorDatabases Content-Based Routing Router Toolkits Content Addressable Networks Directed Diffusion NETWORKING RESEARCH Adaptivity, Federated Control, GeoScalability

  19. DATABASE RESEARCH Adaptive QueryProcessing ContinuousQueries Approximate/Interactive QP SensorDatabases Content-Based Routing Router Toolkits Content Addressable Networks Directed Diffusion NETWORKING RESEARCH The Cosmic Convergence Data Models, Query Opt, DataScalability Telegraph Adaptivity, Federated Control, GeoScalability

  20. Road Map • How I got started on this • CONTROL project • Eddies • Tie-ins to Networking Research • Telegraph & ongoing adaptive dataflow research • New arenas: • Sensor networks • P2P networks

  21. What’s in the Sweet Spot? • Scenarios with: • Structured Content • Volatility • Rich Queries • Clearly: • Long-running data analysis a la CONTROL • Continuous queries • Queries over Internet sources and services • Two emerging scenarios: • Sensor networks • P2P query processing

  22. Telegraph: Engineering the Sweet Spot • An adaptive dataflow system • Dataflow programming model • A la Volcano, CLICK: push and pull. “Fjords”, ICDE02 • Extensible set of pipelining operators, including relational ops, grouped filters (e.g. XFilter) • SQL parser for convenience (looking at XQuery) • Adaptivity operators • Eddies • + Extensible rules for routing constraints, Competition • SteMs (state modules) • FLuX (Fault-tolerant Load-balancing eXchange) • Bounded and continuous: • Data sources • Queries

  23. State Modules (SteMs) static dataflow • Goal: Further adaptivity through competition • Multiple mirrored sources • Handle rate changes, failures, parallelism • Multiple alternate operators • Join = Routing + State • SteM operator manages tradeoffs • State Module, unifies caches, rendezvous buffers, join state • Competitive sources/operators share building/probing SteMs • Join algorithm hybridization! Vijayshankar Raman eddy eddy + stems

  24. Fault Tolerance, Load Balancing Continuous/long-running flows need high availability Big flows need parallelism Adaptive Load-Balancing req’d FLuX operator: Exchange plus… Adaptive flow partitioning (River) Transient state replication & migration RAID for SteMs Needs to be extensible to different ops: Content-sensitivity History-sensitivity Dataflow semantics Optimize based on edge semantics Networking tie-in again: At-least-once delivery? Exactly-once delivery? In/Out of order? Migration policy: the ski rental analogy Mehul Shah FLuX: Routing Across Cluster

  25. Continuously AdaptiveContinuous Queries (CACQ) • Continuous Queries clearly need all this stuff! Address adaptivity 1st. • 4 Ideas in CACQ: • Use eddies to allow reordering of ops. • But one eddy will serve for all queries • Explicit tuple lineage • Mark each tuple with per-op ready/done bits • Mark each tuple with per-query completed bits • Queries are data: join with Grouped Filter • Much like XFilter, but for relational queries • Joins via SteMs, shared across all queries • Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions! • Delete a tuple from flow only if it matches no query • Next: F.T. CACQ via FLuXen Sam Madden, Mehul Shah, Vijayshankar Raman

  26. Road Map • How I got started on this • CONTROL project • Eddies • Tie-ins to Networking Research • Telegraph & ongoing adaptive dataflow research • New arenas: • Sensor networks • P2P networks

  27. Sensor Nets • “Smart Dust” + TinyOS • Thousands of “motes” • Expensive communication • Power constraints • Query workload: • Aggregation & approximation • Queries and Continuous Queries • Challenges: • Push the processing into the network • Deal with volatility & failure • CONTROL issues: data variance, user desires • Joint work with Ramesh Govindan, Sam Madden, Wei Hong and David Culler (Intel Berkeley Lab) Simple example: Aggregation query

  28. P2P QP • Starting point: P2P as grassroots phenomenon • Outrageous filesharing volume (1.8Gfiles in October 2001) • No business case to date • Challenge: scale DDBMS QP ideas to P2P • Motivate why • Pick the right parts of DBMS research to focus on • Storage: no! QP: yes. • Make it work: • Scalability well beyond our usual target • Admin constraints • Unknown data distributions, load • Heterogeneous comm/processing • Partial failure • Joint work with Scott Shenker, Ion Stoica, Matt Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo

  29. A Grassroots Example: TeleNap

  30. Themes Throughout • Adaptivity • Requires clever system design • The Exchange model: encapsulate in ops? • Interesting adaptive policy problems • E.g. eddy routing, flux migration • Control Theory, Machine Learning • Encompasses another CS goal? • “No-knobs”, “Autonomic”, etc. • New performance regimes • Decent performance in the common case • Mean/Variance more important than MAX • Interactive Metrics • Time to completion often unimportant/irrelevant

  31. More Themes • Set-valued thinking as albatross? • E.g. eddies vs. Kabra/DeWitt or Tukwila • E.g. SteMs vs. Materialized Views • E.g. CACQ vs. NiagaraCQ • Some clean theory here would be nice • Current routing correctness proofs are inelegant • Extensibility • Model/language of choice is not clear • SEQ? Relational? XQuery? • Extensible operators, edge semantics • [A whine about VLDB’s absurd “Specificity Factor”]

  32. Conclusions? • Too early for technical conclusions • Of this I’m sure: • The CS262 experiment is a success • Our students are getting a bigger picture than before • I’m learning, finding new connections • May morph to OS/Nets, Nets/DB • Eventually rethink the systems software curriculum at the undergraduate level too • Nets folks are coming our way • Doing relevant work, eager to collaborate • DB community needs to branch out • Outbound: Better proselytizing in CS • Inbound: Need new ideas

  33. Conclusions, cont. • Sabbatical is a good invention • Hasn’t even started, I’m already grateful!

More Related