490 likes | 590 Views
Data Mining for Query Optimization. Jarek Gryz. Outline. Semantic Query Optimization Soft Constraints Query Optimization via Soft Constraints Selectivity Estimation via Soft Constraints. Use integrity constraints associated with a database to rewrite
E N D
Data Mining for Query Optimization Jarek Gryz
Outline • Semantic Query Optimization • Soft Constraints • Query Optimization via Soft Constraints • Selectivity Estimation via Soft Constraints
Use integrity constraints associated with a database to rewrite a query into a form that may be evaluated more efficiently Semantic Query Optimization Some Techniques: Join Elimination Predicate Elimination Join Introduction Predicate Introduction Detecting an Empty Answer Set
Few (if any!) Commercial implementations of SQO Early Experiences: • Could not spend too much time on optimization • Few integrity constraints are ever defined • Association with deductive databases
Join elimination: example • select p_name, p_retailprice, s_name, s_address • from tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier • where p_partkey = ps_partkey and s_suppkey = ps_suppkey and • ps_partkey = l_partkey and ps_suppkey = l_suppkey; RI constraints: part-partsupp (on partkey) supplier-partsupp (on partkey) partsupp-lineitem (on partkey and suppkey) select p_name, p_retailprice, s_name, s_address from tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier where p_partkey = l_partkey and s_suppkey = l_suppkey;
Algorithm for join elimination • 1. Derive column transitivity classes from the join predicates in the query • 2. Divide the relations in the query that are related through RI constraints into removable and non-removable • 3. Eliminate all removable relations from the query • 4. Add is not null predicate to foreign key columns of all tables whose RI parents were removed
S.S PS.S PS.S O.C C.C O.C O.C Algorithm for join elimination: example S.S PS.S C.C
Predicate Introduction: Example • select sum(l_extendedprice * l_discount) as revenue • from tpcd.lineitem • where shipdate >date('1994-01-01'); Check constraint: receiptdate >= shipdate Clustered Index on receiptdate select sum(l_extendedprice * l_discount) as revenue from tpcd.lineitem where shipdate >date('1994-01-01') and receiptdate >= date('1994-01-01');
N - set of predicates derivable from the query and check constraints • If N is inconsistent, stop. • Else, for each predicate A op B in N, add it to the query if: • A or B is a join column • B is a major column of an index • no other index on B’s table can be used in the plan for the original query Algorithm for Predicate Introduction
Queries • select 100.00 * sum • (case • when p_type like 'PROMO%' • then l_extendedprice * (1 - l_discount) • else 0 • end) • / sum(l_extendedprice * (1 - l_discount)) as promo_revenue • from tpcd.lineitem, tpcd.part • where l_partkey = p_partkey and • l_shipdate >= date('1998-09-01') and • l_shipdate < date('1998-09-01') + 1 month; Given the check constraint l_receiptdate >= l_shipdate we may add a new predicate to the query: l_receiptdate >= date(‘1998-09-01’)
The Culprit New query plan uses an index, but the original table scan is still better! Why did this happen: • incorrect estimate of the filter factor • underestimation of the CPU cost of locking index pages
Soft Constraints • Traditional (“hard”) integrity constraints are defined to prevent incorrect updates. A soft constraint is a statement that is true about the current state of the database, but does not verify updates. In fact, a soft constraint can be invalidated by an update.
Soft Constraints (cont.) • Absolute soft constraints – no violation in the current state of the database • Absolute soft constraints can be used for optimization in exactly the same way traditional constraints are. Statistical soft constraints – can have some (small) degree of violation • Statistical soft constraints can be used for improved selectivity estimation
Implementation of Soft Constraints • In Oracle the standard integrity constraints are marked with a rely option, so that they are not verified on updates. • In DB2 soft constraints are called informational constraints.
Informational Check Constraint • Example 1: Create an employee table where a minimum salary of $25,000 is guaranteed by the application • CREATE TABLE emp(empno INTEGER NOT NULL PRIMARY KEY, • name VARCHAR(20), • firstname VARCHAR(20), • salary INTEGER CONSTRAINT minsalary • CHECK (salary >= 25000) • NOT ENFORCED • ENABLE QUERY OPTIMIZATION);
Enforcing Validation • Example 2: Alter the employee table to start enforcing the minimum wage of $25,000 using DB2. DB2 will also verify existing data right away. • ALTER TABLE emp ALTER CONSTRAINT minsalary ENFORCED
Informational RI Constraint • Example 3: Create a department table where the application ensures the existence of departments to which the employees belong. • CREATE TABLE dept(deptno INTEGER NOT NULL PRIMARY KEY, • deptName VARCHAR(20), • budget INTEGER); • ALTER TABLE emp ADD COLUMN dept INTEGER NOT NULL • CONSTRAINT dept_exist • REFERENCES dept • NOT ENFORCED • ENABLE QUERY OPTIMIZATION);
select Model from Tickets T, Registration R where T.RegNum = R.RegNum and T.date > “1990-01-01” and R.Model LIKE “BMW Z3%” Example First BMW Z3 series cars were made in 1997. select Model from Tickets T, Registration R where T.RegNum = R.RegNum and T.date > “1997-01-01” and R.Model LIKE “BMW Z3%”
Matrix representation of empty joins A,B(R S)
Jarek Gryz: Staircase data structure
Properties of the algorithm • Time Complexity O(nm) • requires a single scan of the sorted data • Space Complexity O(min(n,m)) • only two rows of the matrix need be kept in memory • Scalable with respect to: • number of tuples in the join result • number of discovered empty rectangles • size of the domain of one of the attributes
How many empty rectangles are there? Tests done on 4 pairs of attributes with numerical domain present in typical joins in a real-world workload of a health insurance company.
Query rewrite: simple case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<60 and...
Query rewrite: complex case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and (… and …) or (… and …) or (… and …) or ...
Query optimization experiments real-world workload of 26 queries 5 of the queries “qualified” for the rewrite only simple rewrites were considered all rewrites led to improved performance
Query Cardinality Estimate via Empty Joins (SIEQE) • Cardinality estimates crucial for designing good query evaluation plans • Uniform data distribution (UDA): standard assumption in database systems • Histograms effective in single dimensions: too expensive to build and maintain otherwise
The Strategy • With UDA, the “density”: 1 tuple/sq unit • Empty joins cover 20% of the area • Adjusted density: 1.25 tuples/sq unit Q1 Q2
ExperimentsNumber of queries for which the error is less than a given limit
Discovery of Check Constraints and Their Application in DB2 We discover two types of (rules) check constraints: • correlations between attributes over ordered domains • partitioning of attributes
Correlations between attributes over ordered domains Rules have the form: Y = bX + a + [emin, emax] Algorithm for all tables in the database for all comparable variable pairs (X and Y) in the table apply OLS estimation to get the function of the form: Y = a + bX calculate the max and min error (or residual) emax and emin endfor endfor
Partitioning • Rules have the form: If X = a, then Y [emin, emax] • Algorithm • for all tables in the database • for any qualifying variable pair (X and Y) in the table • calculate partitions by using GROUP BY X statements • find the max and min value of Y for each partition • endfor • endfor
Experiments in TPC-H TPC-H contains the following check constraint: L_RECEIPTDATE > L_SHIPDATE Our algorithm discovered the following rule: L_RECEIPTDATE = L_SHIPDATE + (1, 30), m = 0.0114. • Rules discovered through partitioning: • If L_LINESTATUS=F, then L_SHIPDATE=(01/04/1992, 06/17/1995), m = 0.50 • If L_LINESTATUS=O, then L_SHIPDATE=(06/19/1995, 12/25/1998), m = 0.50
Applications • DBA Wizard • Semantic Query Optimization • Improved Filter Factor Estimates
Example Consider a query issued against a hotel database, that requests the number of guests staying in the hotel on a given date. • ARRIVAL DATE <= ‘1999-06-15’ AND DEPARTURE_DATE >= ‘1999-06-15’ • Thefilter factor estimate for the query would be: • ff = ff1 * ff2 If ‘1999-06-15’ was approximately midway in the date ranges, we would estimate a quarter of all the guests that came in over the number of years would be in the answer of the query!
Example (cont.) • Assume that the following check constraint was discovered: • DEPARTURE_DATE >= ARRIVAL_DATE + (1 DAY, 5 DAYS) • The original condition in the query predicate can then be changed to: • ARRIVAL_DATE <= ‘1999-06-15’ AND ARRIVAL_DATE >= ‘1999-06-18’ • or • ARRIVAL_DATE BETWEEN ‘1999-06-15’ AND ‘1999-06-18’ The filter factor is now estimated to: ff = (ff1 + ff2 –1)
Other Research on the Use of Soft Constraints in Query Optimization
Query-driven Approach • Built multidimensional histograms based on query results (Microsoft) • Improve cardinality estimates by looking at the intermediate query results (IBM) • Both techniques generate statistical soft constraints
Data-driven Approach • Lots of methods using Bayesian networks to infer statistical soft constraint • Lots of methods to discover functional dependencies in data (absolute soft constraints) • Most recently, BHUNT and CORDS use sampling to discover soft constraints (IBM)
References • Q. Cheng, J. Gryz, F. Koo, T. Y. Cliff Leung, L. Liu, X. Qian, B. Schiefer: Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database. VLDB 1999. • J. Edmonds, J. Gryz, D. Liang, R. Miller: Mining for Empty Rectangles in Large Data Sets. ICDT 2001. • J. Gryz, B. Schiefer, J. Zheng, C. Zuzarte: Discovery and Application of Check Constraints in DB2. ICDE 2001. • P. Godfrey, J. Gryz, C. Zuzarte: Exploiting Constraint-Like Data Characterizations in Query Optimization. SIGMOD 2001. • J. Gryz, D. Liang: Query Optimization via Empty Joins. DEXA 2002. • J. Gryz, D. Liang: Query Cardinality Estimation via Data Mining. IIS 2004.