320 likes | 460 Views
Semantic Query Optimization Techniques. November 16, 2005. By : Mladen Kovacevic. Background. 1980's, semantic information stored in dbs as integrity constraints could be used for query optimization semantic : “of or relating to meaning or the study of meaning”
E N D
Semantic Query Optimization Techniques November 16, 2005 By : Mladen Kovacevic
Background • 1980's, semantic information stored in dbs as integrity constraints could be used for query optimization • semantic: “of or relating to meaning or the study of meaning” (http://wordnet.princeton.edu) • integrity : preserve data consistency when changes made in db. • no extensive implementation existing today (1999)
Introduction • Key factor in relational database system’s improvement in query execution time, is query optimization. • Query execution can be improved by: • Analyzing integrity information, and rewriting queries exploiting this information (JE & PI) • Avoid expensive sorting costs (Order Optimization) • Exploiting uniqueness by knowing rows will be unique, thus, avoiding extra sorts. (EU)
Presentation Overview • Semantic Query Optimization techniques • Join Elimination (JE) • Predicate Introduction (PI) • Order Optimization (OO) • Exploiting Uniqueness (EU)
Some Motivation • Describing two techniques in SQO, demonstrated in DB2 UDB. • Predicate Introduction • Join Elimination • Reasons: • rewriting queries by hand showed that these two provided consistent optimization. • practical to implement • extendible to other DBMS’s. • Data sets used : TPC-D and APB-1 OLAP benchmarks onlyREFERENTIAL INTEGRITY constraints and CHECK CONSTRAINTS used!
Semantic Query Optimization (SQO) Techniques • Join Elimination: Some joins need NOT be evaluated since the result may be known apriori (more on this later) • Join Introduction: Adding a join can help if relation is small compared to original relations and highly selective. • Predicate Elimination : If predicate known to be always true, can be eliminated from query (DISTINCT clause on Primary Key – Uniqueness exploitation!) • Predicate Introduction: New predicates on indexed attributes can result in a much faster access plan. • Detecting the Empty Answer Set : If query predicates inconsistent with integrity constraints, the query does not have answer.
Why SQO implementations not used? • Deductive Databases : Many cases SQO techniques were designed for deductive databases, thus not appearing to be useful in relational database context. • CPU & I/O Speeds similar : When being developed, CPU & I/O speeds were not as dramatically different • (savings in I/O not worth the CPU time added) • Lack of Integrity Constraints : Thought that many integrity constraints are needed for SQO to be useful
Two-stage Optimizer • Examples of SQO techniques always designed for a two-stage optimizer • Stage 1 : logically equivalent queries created (DB2’s query rewrite optimization) • Stage 2 : generate plans of all these queries, choosing the one with lowest estimated cost. (DB2’s query plan optimization) • Join order, join methods, join site in a distributed database, method for accessing input table, etc.
Join Elimination • Simple : Eliminate relation where join is over tables related through referential integrity constraint, and primary key table referenced only in the join VIEW DEFINITION CREATE VIEW Supplier_Info (n, a, c) as SELECT s_name, s_address, n_name FROM tpcd.supplier, tpcd.nation WHERE s_nationkey = n_nationkey QUERY SELECT s_n, s_a FROM Supplier_Info
Join Elimination (con’t) • Query can be rewritten internally as: SELECT s_n, s_a FROM tpcd.supplier • Why do such a simple rewrite? • User may not have access to the supplier table, • and/or may only know about the view. • Sometimes GUI managers create these “dumb” • queries so need to optimize • Non-programmers write queries often, and may • not even think about this. • Algorithm for generic redundant join removal • provided in paper.
Example – Join Elimination SELECT p_name, p_retailprice, s_name, s_address FROM tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier WHERE p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey and l_shipdate between '1994-01-01' and '1996-06-30' and l_discount >= 0.1 GROUP BY p_name, p_retailprice, s_name, s_address ORDER BY p_name, s_name PART PARTKEY PARTSUPP LINEITEM PARTKEY PARTKEY SUPPKEY SUPPKEY SUPPLIER SUPPKEY 1 – many relationship
Example : Join Elimination • Any immediate improvements that can be seen here? p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey PS_PARTKEY = L_PARTKEY P_PARTKEY PS_PARTKEY L_PARTKEY PS_PARTKEY = L_PARTKEY P_PARTKEY = PS_PARTKEY S_SUPPKEY PS_SUPPKEY L_SUPPKEY PS_SUPPKEY = L_SUPPKEY S_SUPPKEY = PS_SUPPKEY S_SUPPKEY = L_SUPPKEY
Results • 100 MB db size • Execution Time : 58.5 sec -> 38.25 sec (35 % improvement) • I/O Cost: 4631 -> 1498 page reads (67 % improvement)
Results – OLAP Environment • In OLAP (online analytical processing) servers, using a star schema (one fact table, with several dimension tables) improvements ranged from 2% to 96 %. • In these cases, much improvement came from CPU cost instead of I/O, because dimension tables were small enough to fit into memory...
Predicate Introduction • Techniques discussed : • Index Introduction : add new predicate on attribute if index exists on that attribute. • Assumption : index retrieval is better than table scan, is this always good? • Scan Reduction : reduce number of tuples that qualify for a join. • Problem : Not very common; unlikely that there will be any check constraints or predicates with inequalities about join columns • Detecting empty query answer set (not shown as query execution time essentially 0)
Example - Predicate Introduction SELECT sum(l_extendedprice * l_discount) as revenue FROM tpcd.lineitem WHERE l_shipdate >= date(‘1994-01-01’) and l_shipdate < date(‘1994-01-01’)+ 1 year and l_discount between .06 – 0.01 and .06 + 0.01 and l_quantity < 24; Check Constraint : l_shipdate <= l_receiptdate Index : l_receiptdate • Maintaining semantics, we can add : • l_receiptdate >= date(‘1994-01-01’)
Example - Predicate Introduction SELECT sum(l_extendedprice * l_discount) as revenue FROM tpcd.lineitem WHERE l_shipdate >= date(‘1994-01-01’) and l_shipdate < date(‘1994-01-01’)+ 1 year and l_receiptdate >= date(‘1994-01-01’) and l_discount between .06 – 0.01 and .06 + 0.01 and l_quantity < 24; Check Constraint : l_shipdate <= l_receiptdate Index : l_receiptdate • Maintaining semantics, we can add : • l_receiptdate >= date(‘1994-01-01’) • Why would we want to do this? In order to have optimizer choose a plan using the index. Is this always good? • NO! What if most of the rows in the table need to be returned? We should use a tablescan instead.
Predicate Introduction - Algorithm • Input : set of all check constraints defined for a database and the set of all predicates in query • Output: set of all non-redundant formulas derivable from the input set. This answer set can then be added to the query, but only a few are potentially useful. • The goal in the paper was to choose additions that would guarantee improvement. • Conditions in paper: Conservative approach of introducing predicates that will have the plan optimizer use an index. Insist on only one index available with the query predicate.
Why Longer Execution for P3/P5? • P2 and P3 are the same except for the following P2 : SELECT ... FROM ... WHERE l_shipdate >= date ('1998-09-01') and l_shipdate < date ('1998-09-01') + 1 month P3 : SELECT ... FROM ... WHERE l_shipdate >= date ('1995-09-01') and l_shipdate < date ('1995-09-01') + 1 month • Difference in table shows that P2 has 2 % of the tuples falling in the range while P3 has 48 % of the tuples fall in the category : BOTH plans will choose index scan! P3 is so large that tablescan is better in this case. • Cost model underestimates cost of locking/unlocking index pages • Estimated number of tuples goes down because of the reduction factor problem (multiply in the new predicate added)
Adjustments for Reduction Factor Problem • Add new predicate only when it contains a major column of an index and a scan of that index is sufficient to answer the query (thus, no table scan necessary) • Original Index : <receiptdate, discount, quantity, extendedprice> • New Index : <receiptdate, discount, quantity, extendedprice, shipdate, partkey, suppkey, orderkey>
Order Optimization Techniques • Access plan strategies exploit the physical orderings provided either by indexes or sorting • GOAL: optimize the sorting strategy • Techniques • Pushing down sorts in joins • Minimizing the number of sorting columns • Detecting when sorting can be avoided because of predicates, keys or indexes • Order Optimization : detecting when indexes provide an interesting order, so that sorting can be either avoided, and used as sparingly as possible. • Interesting Orders : when the side effect of a join produces rows in sorted order, which can be taken advantage of later (if another join needed, ORDER BY, GROUP BY, DISTINCT)
Fundamental Operators • Order optimization requires the following operations • Reduce Order • Test Order • Cover Order • Homogenize Order
Exploiting Uniqueness • Checking to see if query contains unnecessary DISTINCT clauses • How does this make improvements? • Removing duplicates is performed by SORTING, a costly operation. • Example is removing DISTINCT keyword from query if it is applied onto the primary key itself (since primary keys are, by definition, distinct)
How to exploit uniqueness? • Using knowledge about: • Keys • Table Constraints • Query Predicates • Cannot always be tested efficiently, so we look for a sufficient solution.
Summary • Important Outcome : experimental evidence showing SQO can provide effective enhancement to the traditional query optimization. • Join Elimination : geared towards OLAP environment (where very useful) • Independent on existence of complex integrity constraint – semantic reasoning used about referential integrity constraints • Easy to implement and execute • Predicate Introduction : guaranteeing improvements more difficult, needing rather severe restrictions imposed (limits the applicability of this approach) • Order Optimization : utilizing functional dependencies and table information, we use it in creating a “smart” access plan, avoiding or optimizing sort operations. • Exploiting Uniqueness : uniqueness is powerful when it reduces the number of expensive sorts. Discovering true ways of exploiting this technique are quite tricky and specific.
References • Qi Cheng, Jarek Gryz, Fred Koo, et al: Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database. Proceedings of the 25th VLDB Conference, Edinburg, Scotland,1999. • David E. Simmen, Eugene J. Shekita, Timothy Malkemus: FundamentalTechniques for Order Optimization. SIGMOD Conference 1996: 57-67 • G. N. Paulley, Per-ke Larson: Exploiting Uniqueness in Query Optimization. ICDE 1994: 68-79