1 / 26

Continuously Adaptive Continuous Queries (CACQ) over Streams

CACQ introduces continuous adaptivity to handle long-running continuous queries over data streams efficiently. It integrates dynamic operator ordering and enables sharing of work and storage. This paper outlines the background, motivation, concepts of continuous queries and Eddies, and presents contributions along with example-driven explanations and experimental results.

grimsley
Download Presentation

Continuously Adaptive Continuous Queries (CACQ) over Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Continuously Adaptive Continuous Queries (CACQ) over Streams Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman Presented by: Bhuvan Urgaonkar

  2. CACQ Introduction • Proposed continuous query (CQ) systems are based on static plans • But, CQs are long running • Initially valid assumptions less so over time • Static optimizers at their worst! • CACQ insight: apply continuous adaptivity of eddies to continuous queries • Dynamic operator ordering avoids static optimizer danger • Process multiple queries simultaneously • Interestingly, enables sharing of work & storage

  3. Outline • Background • Motivation • Continuous Queries • Eddies • CACQ • Contributions • Example driven explanation • Results & Experiments

  4. Outline • Background • Motivation • Continuous Queries • Eddies • CACQ • Contributions - Example driven explanation • Results & Experiments

  5. Motivating Applications • “Monitoring” queries look for recent events in data streams • Sensor data processing • Stock analysis • Router, web, or phone events • In CACQ, we confine our view to queries over ‘recent-history’ • Only tuples currently entering the system • Stored in in-memory data tables for time-windowed joins between streams

  6. Continuous Queries • Long running, “standing queries”, similar to trigger systems • Installed; continuously produce results until removed • Lots of queries, over the same data sources • Opportunity for work sharing! • Idea: adaptive heuristics

  7. Eddies & Adaptivity • Eddies (Avnur & Hellerstein, SIGMOD 2000): Continuous Adaptivity • No static ordering of operators • Policy dynamically orders operators on a per tuple basis • done and ready bits encode where tuple has been, where it can go

  8. Outline • Background • Motivation • Continuous Queries • Eddies • CACQ • Contributions - Example driven explanation • Results & Experiments

  9. CACQ Contributions • Adaptivity • Policies for continuous queries • Single eddy for multiple queries • Tuple Lineage • In addition to ready and done, encode output history in tuple in queriesCompleted bits • Enables flexible sharing of operators between queries • Grouped Filter • Efficiently compute selections over multiple queries • Join Sharing through State Modules (SteMs)

  10. Explication By Example • First, example with just one query and only selections • Then, add multiple queries • Then, (briefly) discuss joins

  11. R R  (R.b < 15)  (R.b < 15)  (R.a > 10)  (R.a > 10) R1 Eddy Eddy a b a b Ready Done Eddies & CACQ : Single Query, Single Source SELECT * FROM R WHERE R.a > 10 AND R.b < 15 • Use ready bits to track what to do next • All 1’s in single source • Use done bits to track what has been done • Tuple can be output when all bits set • Routing policy dynamically orders tuples R2 R2 R1 R2 R2 R2 R1 R2 1 1 11 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0

  12. Q1 Q2 Q3 R1 a a a SELECT * FROM R WHERE R.a > 10 AND R.b < 15 Q1 a R1 R1 b b b Q2 SELECT * FROM R WHERE R.a > 20 AND R.b = 25 R1 R1 R1 R1 R1 R1 R R R SELECT * FROM R WHERE R.a = 0 AND R.b <> 50 Q3 b R a b Q1 Q2 Q3 Done QueriesCompleted Multiple Queries R.a > 10 R.a > 20 R1 R.a = 0 Grouped Filters R1 R.b < 15 R1 R.b = 25 R1 R.b <> 50 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1

  13. Q1 Q1 Q2 Q2 Q3 Q3 a b a b b a SELECT * FROM R WHERE R.a > 10 AND R.b < 15 Q1 a R2 a b a b a b Q2 SELECT * FROM R WHERE R.a > 20 AND R.b = 25 R1 R1 R2 R2 R2 R R R R R R SELECT * FROM R WHERE R.a = 0 AND R.b <> 50 Q3 b R a b Q1 Q2 Q3 Done QueriesCompleted Multiple Queries R.a > 10 R2 R.a > 20 R2 R.a = 0 R2 Grouped Filters R2 R2 R.b < 15 Reorder Operators! R2 R.b = 25 R.b <> 50 11 0 1 1 1111 1 0 0 0 1 1 1 0 0 1 1 0 0 0 0 0

  14. Query 1 SELECT * FROM R WHERE R.a > 10 AND R.b < 15 Query 2 SELECT * FROM R WHERE R.b < 15 AND R.c <> 5 AND R.d = 10 Tuple 110 0 0 0 110 0 1 0 110 0 1 0 a b c d Q1 Q2 Done QC Outputting Tuples • Store a completionMask bitmap for each query • One bit per operator • Set if the operator in the query • To determine if a tuple t can be output to query q: • Eddy ANDs q’s completionMask with t’s done bits • Output only if q’s bit not set in t’s queriesCompleted bits • Every time a tuple returns from an operator completionMasks & Done == 1100 && QueriesCompleted[0] == 0 Q1: 1100 Q2: 0111 & Done == 0111

  15. 7 S.a 8 1 11 >1 >7 >11 Q2 Q3 Q1 Grouped Filter • Use binary trees to efficiently index range predicates • Two trees (LT & GT) per attribute • Insert constant • When tuple arrives • Scan everything to right (for GT) or left (for LT) of the tuple-attribute in the tree • Those are the queries that the tuple does not pass • Hash tables to index equality, inequality predicates Greater-than tree over S.a S.a > 1 S.a > 7 S.a > 11

  16. CACQ - Adaptivity Shared Subexpressions Query 1 Query 2 A A A A D D D 0 or 1 | 0 or 1 0 or 1 | 0 or 1 0 or 1 | 0 or 1 B B B B Reject? QC QC QC C D C C C 0 or 1 | 0 1 | 1 0 | 0 QC QC QC Data Stream S Data Stream S Work Sharing via Tuple Lineage Q1: SELECT * FROM s WHERE A, B, C Q2: SELECT * FROM s WHERE A, B, D Conventional Queries Query 1 Query 2 Lineage (Queries Completed) Enables Any Ordering! sCDBA Intersection of CD goes through AB an extra time! sBC sCDB sBD sAB sAB sCD AB must be applied first! sc sD sC sB s s s s Data Stream S

  17. Tradeoff: Overhead vs. Shared Work • Overhead in additional bits per tuple • Experiments studying performance, size in paper • Bit / query / tuple is most significant • Trading accounting overhead for work sharing • 100 bits / tuple allows a tuple to be processed once, not 100 times • Reduce overhead by not keeping state about operators tuple will never pass through

  18. Joins in CACQ • Use symmetric hash join to avoid blocking • Use State Modules (SteMs) to share storage between joins with a common base relation • Detail about effect on implementation & benefit in paper • See Raman, UC Berkeley Ph.D. Thesis, 2002.

  19. Routing Policies • Previous system provides correctness; policy responsible for performance • Consult the policy to determine where to route every tuple that: • Enters the system • Returns from an operator • Basic Ticket Policy • Give operators tickets for consuming tuples, take away tickets for producing them • To choose the next operator to route, run a lottery • More selective operators scheduled earlier • Modification for CACQ • Give more tickets to operators shared by multiple queries (e.g. grouped filters) • When a shared operator outputs a tuple, charge it multiple tickets • Intuition: cardinality reducing shared operators reduce global work more than unshared operators • Not optimizing for the throughput of a single query!

  20. Outline • Background • Motivation • Continuous Queries • Eddies • CACQ • Contributions - Example driven explanation • Results & Experiments

  21. Evaluation • Real Java implementation on top of Telegraph QP • 4,000 new lines of code in 75,000 line codebase • Server Platform • Linux 2.4.10 • Pentium III 733, 756 MB RAM • Queries posed from separate workstation • Output suppressed • Lots of experiments in paper, just a few here

  22. Results: Routing Policy All attributes uniformly distributed over [0,100] Query 1 2 3 4 5

  23. Query 1 Query 3 Query 2 Stocks.sym = Articles.sym Stocks.sym = Articles.sym Stocks.sym = Articles.sym  UDF(stocks.sym)  UDF(stocks.sym)  UDF(stocks) Articles Articles Articles Stocks Stocks Stocks CACQ vs. NiagaraCQ* • Performance Competitive with Workload from NCQ Paper • Different workload where CACQ outperforms NCQ: |result| > |stocks| Expensive SELECT stocks.sym, articles.text FROM stocks,articles WHERE stocks.sym = articles.sym AND UDF(stocks) *See Chen et al., SIGMOD 2000, ICDE 2002

  24. Query 1 Query 2 Query 3 UDF1 UDF2 UDF3 Expensive Query 1 Query 2 Query 3 S1|2|3 S1A S2A S3A |result| > |stocks| UDF3 Stocks.sym = Articles.sym S1|2 S1 S2 S3 UDF3 UDF2 UDF1 UDF2 Articles Stocks S1 UDF1 Niagara Option #2 CACQ Niagara Option #1 CACQ vs. NiagaraCQ #2 SA SA SA Lineage Allows Join To Be Applied Just Once S A No shared subexpressions, so no shared work!

  25. CACQ vs. NiagaraCQ Graph

  26. Conclusion • CACQ: sharing and adaptivity for high performance monitoring queries over data streams • Features • Adaptivity • Adapt to changing query workload without costly multi-query reoptimization • Work sharing via tuple lineage • Without constraining the available plans • Computation sharing via grouped filter • Storage sharing via SteMs • Future Work • More sophisticated routing policies • Batching & query grouping • Better integration with historical results (Chandrasekaran, VLDB 2002)

More Related