1 / 39

The Stanford Data Streams Research Project

The Stanford Data Streams Research Project. Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma st anfordst re amdat am anager.

kaycee
Download Presentation

The Stanford Data Streams Research Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma stanfordstreamdatamanager

  2. Data Streams • Traditional DBMS -- data stored in finite, persistent data sets • New applications -- data as multiple, continuous, rapid, time-varying data streams • Network monitoring and traffic engineering • Security applications • Telecom call records • Financial applications • Web logs and click-streams • Sensor networks • Manufacturing processes

  3. Challenges • Multiple, continuous, rapid, time-varying streams of data • Queries may be continuous (not just one-time) • Evaluated continuously as stream data arrives • Answer updated over time • Queries may be complex • Beyond element-at-a-time processing • Beyond stream-at-a-time processing

  4. Query Result Query … Result … Using Traditional Database User/Application Loader

  5. Register Query Result New Approach for Data Streams User/Application Stream Query Processor

  6. Register Query Data Stream Management System (DSMS) Scratch Space (Memory and/or Disk) New Approach for Data Streams User/Application Result Stream Query Processor

  7. DBMS versus DSMS

  8. Persistent relations Transient streams (and persistent relations) DBMS versus DSMS

  9. Persistent relations One-time queries Transient streams (and persistent relations) Continuous queries DBMS versus DSMS

  10. Persistent relations One-time queries Random access Transient streams (and persistent relations) Continuous queries Sequential access DBMS versus DSMS

  11. Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data arrival and characteristics DBMS versus DSMS

  12. Persistent relations One-time queries Random access Access plan determined by query processor and physical DB design “Unbounded” disk store Transient streams (and persistent relations) Continuous queries Sequential access Unpredictable data arrival and characteristics Bounded main memory DBMS versus DSMS

  13. Sample Applications • Network management and traffic engineering (e.g., Sprint) • Streams of measurements and packet traces • Queries: detect anomalies, adjust routing • Telecom call data (e.g., AT&T) • Streams of call records • Queries: fraud detection, customer call patterns, billing

  14. Sample Applications (cont’d) • Network security (e.g., iPolicy, NetForensics/Cisco, Netscreen) • Network packet streams, user session information • Queries: URL filtering, detecting intrusions & DOS attacks & viruses • Financial applications (e.g., Traderbot) • Streams of trading data, stock tickers, news feeds • Queries: arbitrage opportunities, analytics, patterns

  15. Sample Applications (cont’d) • Web tracking and personalization (e.g., Yahoo, Google, Akamai) • Clickstreams, user query streams, log records • Queries: monitoring, analysis, personalization • Truly massive databases (e.g., Astronomy Archives) • Stream the data by once (or over and over) • Queries do the best they can

  16. Making Things Concrete • Database = two streams of mobile call records • Outgoing(connectionID, caller, start, end) • Incoming(connectionID, callee, start, end) • Query language = SQL FROM clauses can refer to streams and/or relations

  17. Query Example 1 • Find all outgoing calls longer than 2 minutes (relational selection) SELECT O.connectionID, O.caller FROM Outgoing O WHERE O.end – O.start > 2 • Result requires unbounded storage • Can provide result as data stream

  18. Query Example 2 • Pair up callers and callees (relational join) SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.connectionID = I.connectionID • Can still provide result as data stream • Requires unbounded temporary storage (without additional assumptions)

  19. Query Example 3 • Find total connection time for each caller (relational grouping and aggregation) SELECT O.caller, sum(O.end – O.start) FROM Outgoing O GROUP BY O.caller • Cannot provide result in (append-only) stream

  20. Project Goal • Reconsider all aspects of data management and processing in presence of data streams

  21. Remainder of Talk • Data stream model • Queries over data streams • Language, semantics, evaluation & optimization • DSMS query processing architecture and system internals • Results to date • Ongoing work • Related work

  22. Data Model • Database: relations + data streams • Stream characteristics • Type of data (schema) • Data distribution • Flow rate • Stability of distribution and flow • Ordering and other constraints • Synchronization of multiple streams • Distributed streams

  23. Data Stream Queries -- Basic Issues • Answer availability • One-time • Multiple-time • Continuous (“standing”), stored or streamed • Registration time • Predefined • Ad-hoc • Stream access • Arbitrary • Sliding window (special case: size = 1)

  24. Data Stream Queries -- Basic Issues • Answer availability • One-time • Multiple-time • Continuous (“standing”), stored or streamed • Registration time • Predefined • Ad hoc • Stream access • Arbitrary • Sliding window (special case: size = 1)

  25. Query Language & Semantics • Specifying queries over streams • SQL-like versus dataflow network of operators • Sliding windows as first-class query construct • Semantic issues • Blocking operators, e.g., aggregation, order-by • Streams as sets versus lists • Timestamping

  26. Query Evaluation -- Approximation • Why approximate? • Streams are coming too fast • Exact answer requires unbounded storage or significant computational resources • Ad hoc queries reference history • Issues in approximation • Sliding windows, sampling, synopses, … • How is approximation controlled? • How is it understood by user? • Accuracy-efficiency-storage tradeoff

  27. Query Evaluation -- Adaptivity • Why adaptivity? • Queries are long-running • Fluctuating stream arrival & data characteristics • Evolving query loads • Issues in adaptivity • Adaptive resource allocation (memory, computation) • Adaptive query execution plans

  28. Query Evaluation -- Multiple Queries • Possibly large number of continuous queries • Long-running • Shared resources • Multi-query optimization

  29. Query Evaluation -- Distributed Streams • Many physical streams but one logical stream • E.g., maintain top 100 visited pages at Yahoo • Correlate streams at distributed servers • E.g., network monitoring • Many streams controlled by a few servers • E.g., sensor networks • Issues • Move processing to streams, not streams to processor • Approximation-bandwidth tradeoff

  30. Synopses Running Op Ready Op Waiting Op Query Processing Architecture Output Stream Query Plans Applications register continuous queries p X Users issue continuous and ad-hoc queries s s X Input Data Streams Administrator can monitor query execution and adjust run-time parameters

  31. DSMS Internals • Query plans: operators, synopses, queues • Memory management • Dynamic allocation to buffers, queues, synopses • Accuracy vs. memory tradeoff • Operators adapt gracefully to memory reallocation • Scheduler • Handles variable-rate input streams • Handles varying operator and query requirements

  32. Some Results to Date • Algorithms on data streams • Online clustering [FOCS 2000, ICDE 2002] • Online quantiles [SIGMOD 98, SIGMOD 99] • Statistics over sliding windows [SODA 2002] • Online frequency counting • Theory of stream query processing • Memory requirements of stream queries [PODS02] • System design • STREAM: stanfordstreamdatamanager

  33. STREAM System Implementation • Comprehensive DSMS query processor • Broad suite of operators and synopses • Sophisticated “developer’s workbench” interface • Submit queries in extended SQL or algebra • Submit or edit query plans in XML or GUI • Query plan execution visualizer • On-the-fly modification of memory allocation, scheduling policies, etc.

  34. Ongoing Work • Algebra for streams • Synopses and algorithmic issues • Memory management issues • Exploiting constraints on streams • Approximation in query processing • Distributed stream processing • System development

  35. Ongoing Work • Algebra for streams • Synopses and algorithmic issues • Memory management issues • Exploiting constraints on streams • Approximation in query processing • Distributed stream processing • System development

  36. Ongoing Work -- Constraints • Exploiting constraints on streams in query processing • Foreign-key joins, referential integrity, clustering, ordering • Need not be exact (e.g., k-clustered) • Reduce memory requirements • Unblock blocking operators

  37. Ongoing Work -- Approximation in Query Processing • Understanding behavior of approximate operators when composed • Memory allocation to operators in a plan, given per-operator memory-accuracy curve • Best query plan, assuming best memory allocation • Multiple (weighted) queries sharing resources

  38. Related Work • Triggers, alerters, materialized views, continuous queries on conventional DBs, pub/sub, sequence & temporal databases, … • Telegraph project at UC Berkeley • Niagara project at Wisconsin/OGI • Amazon project at Cornell • Aurora project at Brown/MIT • And others

  39. For Papers and General Info. http://www-db.stanford.edu/stream stanfordstreamdatamanager

More Related