400 likes | 560 Views
Adaptive Query Processing in the Looking Glass. Shivnath Babu (Stanford Univ.) Pedro Bizarro (Univ. of Wisconsin, Madison). Adaptive Query Processing (AQP) Systems: Publication Timeline. STREAM. Tukwila. POP. Re-Opt. River. Eddies. Query Scrambling. CAPE. NiagaraCQ. Parametric opt.
E N D
Adaptive Query Processing in the Looking Glass Shivnath Babu (Stanford Univ.) Pedro Bizarro (Univ. of Wisconsin, Madison)
Adaptive Query Processing (AQP) Systems:Publication Timeline STREAM Tukwila POP Re-Opt River Eddies Query Scrambling CAPE NiagaraCQ Parametric opt. 1976 1977 … 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Pipeline sch. Ingres DEC-Rdb DQE Corrective processing RedBrick Conquest Expected cost opt. Memory adap. Introduction
Motivation • Plenty of recent work on Adaptive Query Processing (AQP) in different contexts • Conventional DBMS query processing, data integration, continuous queries in stream systems • No exhaustive, in-depth categorization and comparison of AQP systems to date • Difficult to answer questions like: • Will techniques from one system work on another? • What are the shortcomings of each system? • Which system is best for a new application domain? Introduction
Our Contributions • Detailed study of current AQP systems • Classification of AQP systems into 3 families • Comparison across families in terms of AQP tasks • Identification of shortcomings & new approaches to address them Introduction
Roadmap • Introduction to AQP • The three AQP system families • Comparison across families in terms of AQP tasks • Summary of what we learned
Uses stats to cost plans Chosen plan Catalog (table sizes, histograms) Executor: Runs chosen plan Runstats Statistics Tracker: Creates/updates stats Primer on Traditional Query Processing Query Optimizer: Chooses best plan Introduction
Need for Adaptive Query Processing Detect plan suboptimality, re-optimize Errors in stats estimates, optimizer mistakes Correlated & skewed data distributions Stats & system conditions may change while query is running Monitor for changes, re-optimize Continuous queries, long-running queries AQP is integral to the current CS-wide push towards autonomic computing Introduction
Our Focus: AQP for a Single Query • AQP System: • A system that interleaves the optimization and execution aspects of query processing, possibly multiple times, during the processing of a single query Introduction
Roadmap • Introduction to AQP • The three AQP system families • Comparison across families in terms of AQP tasks • Summary of what we learned
AQP System Families • Plan-based AQP systems • AQP for traditional plan-based DBMSs • Continuous-Query-based (CQ-based) AQP systems • AQP for long-running continuous queries over data streams • Routing-based AQP systems • AQP for DBMSs and continuous queries based on adaptive tuple routing AQP Families
Uses stats to cost plans + Extra operators Catalog (table sizes, histograms) Collected stats AQP in Plan-based Systems Query Optimizer: Chooses best plan Chosen plan Executor: Runs chosen plan Runstats Statistics Tracker: Creates/updates stats AQP Families
Uses stats to cost plans + Re-optimize Extra operators Catalog (Original + observed stats) Collected stats AQP in Plan-based Systems Query Optimizer: Chooses best plan Chosen plan Executor: Runs chosen plan Runstats Statistics Tracker: Creates/updates stats AQP Families
Example Plan-based AQP Systems STREAM Tukwila POP Re-Opt River Eddies Query Scrambling CAPE NiagaraCQ Parametric opt. 1976 1977 … 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Pipeline sch. Ingres DEC-Rdb DQE Corrective processing RedBrick Conquest Expected cost opt. Memory adap. AQP Families
Primer on Continuous Query Processing Chosen packets • Continuous Queries (CQs) are long-running queries usually over data streams • Example CQ: Filtering packet streams • Stream properties or system conditions may change while query is running best plan may change σ3 σ2 σ1 Packets AQP Families
Uses stats to cost plans Chosen plan Catalog (table sizes, histograms) Executor: Runs chosen plan AQP in CQ-based Systems Query Optimizer: Chooses best plan Statistics Tracker: Creates/updates stats Runstats AQP Families
Uses stats to cost plans Chosen plan Executor: Runs chosen plan AQP in CQ-based Systems Continuous Query Optimizer: Chooses best plan Catalog (stream rates, data distr.) Statistics Tracker: Monitors stream stats and system conditions AQP Families
Chosen plan Executor: Runs chosen plan AQP in CQ-based Systems Continuous Query Optimizer: Ensures that plan is best for current stats Uses stats to cost plans Catalog (stream rates, data distr.) Statistics Tracker: Monitors stream stats and system conditions AQP Families
Chosen plan Executor: Runs chosen plan Combined in-part for efficiency AQP in CQ-based Systems Continuous Query Optimizer: Ensures that plan is best for current stats Uses stats to cost plans Re-optimize Catalog (stream rates, data distr.) Stats to track Statistics Tracker: Monitors stream stats and system conditions AQP Families
Example CQ-based AQP Systems STREAM Tukwila POP Re-Opt River Eddies Query Scrambling CAPE NiagaraCQ Parametric opt. 1976 1977 … 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Pipeline sch. Ingres DEC-Rdb DQE Corrective processing RedBrick Conquest Expected cost opt. Memory adap. AQP Families
Chosen packets Chosen packets σ3 σ2 σ3 σ2 σ1 Tuple Router σ1 Packets Packets Using tuple routing Using a plan Primer on Routing-based Processing • Non-plan-based architecture where tuples are routed individually through operators • No optimizer • Exemplified by Eddies [AH00] AQP Families
Uses stats to cost plans Chosen plan Catalog (table sizes, histograms) Executor: Runs chosen plan Runstats Statistics Tracker: Creates/updates stats AQP in Routing-based Systems Query Optimizer: Chooses best plan AQP Families
Uses stats to choose efficient routes Selective routing of tuples Chosen plan In-memory catalog (operator costs, selectivities, etc.) Executor: Pool of operators Executor: Runs chosen plan AQP in Routing-based Systems Query or Continuous Query Tuple Router: Integrated Optimizer & Stats Tracker AQP Families
Example Routing-based AQP Systems STREAM Tukwila POP Re-Opt River Query Scrambling Eddies CAPE NiagaraCQ Parametric opt. 1976 1977 … 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Pipeline sch. Ingres DEC-Rdb DQE Corrective processing RedBrick Conquest Expected cost opt. Memory adap. AQP Families
Roadmap • Introduction to AQP • The three AQP system families • Comparison across families in terms of AQP tasks • Summary of what we learned
Comparison Across AQP System Families • Goal: To bring out AQP algorithms and features, not performance numbers • Models, assumptions, and approach • Techniques for tracking statistics • Re-optimization subtasks • When and how to re-optimize • Switching between plans • Pros & cons of using a conventional optimizer • Performance issues • Quality of re-optimization • Run-time overhead & thrashing • Scalability Comparison
Comparison Across AQP System Families • Goal: To bring out AQP algorithms and features, not performance numbers • Models, assumptions, and approach • Techniques for tracking statistics • Re-optimization subtasks • When and how to re-optimize • Switching between plans • Pros & cons of using a conventional optimizer • Performance issues • Quality of re-optimization • Run-time overhead & thrashing • Scalability Comparison
Techniques for Tracking Statistics • Observation • Mostly in Plan-based systems • Competition • Mostly in Plan-based systems • Profiling • Mostly in CQ-based systems • Exploration • In Routing-based systems Comparison
Chosen packets σ3 σ2 Selectivity of s1 on input stream can be observed here σ1 Packets Tracking Statistics: Observation [KD98] • Collect statistics on operator behavior or intermediate subexpressions in a plan Comparison
Chosen packets σ3 σ2 Selectivity of s1 on input stream Selectivity of s2 on input stream σ1 σ2 Packets Tracking Statistics: Competition [A93] • Extra processing to collect statistics Comparison
Tracking Statistics: Profiling [BMM+04] • Extra processing on a fraction of the input tuples (e.g., a random sample) to collect statistics • Builds a “statistical profile” that can be used to estimate many individual statistics σ3 σ2 σ1 Profiled tuples Comparison
Tracking Statistics: Exploration [AH00] Chosen packets • A fraction of tuples are routed along routes different from the current best route to track statistics along those routes • No redundant processing σ2 σ3 σ1 Tuple Router Packets Comparison
Comparing Statistics-Tracking Techniques: Extra Overhead Introduced • Observation • Exploration (inefficient routes for some tuples) Increasing overhead • Profiling (extra processing on some tuples) • Competition (lots of extra work) Comparison
Comparing Statistics-Tracking Techniques: Coverage of Different Statistics • Observation & Competition (limited by plan) Increasing coverage • Exploration (limited by large number of routes) • Profiling (highest since it builds statistics profile) Comparison
Comparing Statistics-Tracking Techniques: Accuracy of Estimation Increasing accuracy • Exploration (but, susceptible to routing bias) • Profiling (depends on sampling fraction) • Observation & Competition Comparison
Roadmap • Introduction to AQP • The three AQP system families • Comparison across families in terms of AQP tasks • Summary of what we learned
⋈ s INLJ S Unclustered index R Example Query: sp1and p2 (R) S ⋈ What have we learned? (1) • Many similarities in internals of different AQP families • Can re-use many current (and new) AQP techniques across families • Ex: Profiling from CQ-based systems • Enables, e.g., faster detection of plan suboptimality in Plan-based systems • Generates more accurate statistics at lower cost in Routing-based systems New Ideas
⋈ ⋈ Example Query: sp1and p2 (R) S ⋈ s s Hash Join INLJ S INLJ S Cost Unclustered index R Hash Join R |σ(R)| What have we learned? (2) • Current AQP systems are reactive • E.g., do not consider sensitivity to errors/changes in stats Proactive Re-optimization New Ideas
What have we learned? (3) • Challenging meta problems in AQP for continuous queries need to be addressed • Larger and more complex plan spaces higher costs for statistics tracking and re-optimization • Tracking “Return-of-Investment” on AQP • Avoiding thrashing, e.g., on bursty changes in statistics Proposal: Plan Logging for Continuous Queries New Ideas
P1 P2 (R,S) Rate(R) Plan Logging for Continuous Queries • Log the statistics and re-optimization history • Query is long-running • Example view over log for R S T ⋈ ⋈ time Plans lying in a high-dimensional space of statistics New Ideas
Summary • AQP is becoming important: • New data and application trends • CS-wide push towards Autonomic Computing • Significant amount of work on AQP in recent years • Our contributions: • In-depth categorization and comparison of AQP systems and techniques • Identified current shortcomings and new approaches to AQP Conclusions