330 likes | 345 Views
This paper explores a similarity-based optimization technique for complex queries, utilizing the structural characteristics of sub-queries to improve query execution plans. Experimental results, advantages, and limitations are discussed.
E N D
COSC6421 Advanced Database Systems Instructor Jarek Gryz Exploiting Similarity of Subqueries for Complex Query Optimization[1] Presented by Qiong Wang York University
Outline • Introduction-Why exploiting similarity • Description of Technique • Experimental Result • Advantages and Limitation Discussion • Related work • Summary
Background Introduction • User queries become more and more complex, such as increasing complexity of the operations and the query structure. • Query optimizers are seeking an efficient query execution plan( QEP) in DBMS. But, there are some limitations • Limitations of two types of current algorithms determining a optimal join orders • Dynamic programming -> Optimal Solution but long optimization time • Heuristics Based Algorithm-> Good time complexity but suboptimal solution • Limitations of some other studies which can find a good plan • May not make use of some special characteristics of the underlying queries • Only exploiting the common sub-queries in a query
Technique Introduction • A new similarity-based optimization technique. • Takes the structural characteristics of a complex query into consideration • The key idea • Identify groups of similar sub-queries that often appear in a complex query • Share the optimization result within each group in the query
Technique Description • Preliminary • Exploiting Similarity of Sub-Query • Optimizing Query
Preliminary A query Q • T={R1,R2,R3,……,Rm} The set of Base Tables referenced in Q • P ={p1,p2,p3,……pn} The set of predicates referenced in Q • Table Instance - each table reference in Q
Preliminary (1) A Query Graph G • Vthe set of all vertices inG <=Table Instances For x ∈ V, R ∈ T,mappingδ(x) = R • E the set of all edges in G <= Predicates For e ∈ E, c ∈ 2P , e is labeled with ϕ(e) = c • If there is at least one predicate inP involving vertices x and y, then query graph G has an edge e between x and y • The set of all predicates involving x and y labeled on the edge e
Constructing Query graph e (w, w) e (x,y) x e (y,w) e (y,z) y w z An example query graph G (V,E,T,P,δφ) • Sizeof(x) Size of the table represented by x • Sel (e) Selectivity of ϕ(e), i.e., the selectivity of the conjunction of all the simple predicates in ϕ(e)
Preliminary • A data structure which represents a query graph • Every vertex x in query graph G has a node x • Node x has a set of adjacent nodes y1, y2, ..., yn • A ring : x → y1 → y2 → ... → yn → x • to represent such an adjacency relationship. • Node x is called the owner (node) of the ring, • Nodes y1, y2, ..., yn are called the members of the ring. (2) Ring NetWork
Preliminary Suppose we have two sub-query graphs : G’(V’ ,E|V’ , T|V’ , P|V’ , δ|V’ , ϕ|V’ ) G”(V”, E|V”, T|V”, P|V”, δ|V”, ϕ|V”) Rt Error bound for table size Rs Error bound for condition selectivities (3) Similar Sub-query Graphes
Constructing Query graph If G’ and G” satisfy the following conditions, they are regarded as a pair of similar sub-query graphs with respect to error bounds rt and rs, G’ ≈ |(rt,rs)G” Similar Sub-query Graphes (cont’d)
Exploiting Similarity of Sub-Queries Algorithm Design Idea • Identify pairs of similar sub-queries with respect to the given error bounds rt and rs • Optimize one sub-query in a similar pair, and map and apply the resulting execution plan to the other sub-query • Replace similar sub-queries with their (estimated) result tables in the query graph and optimize the resulting modified query.
R2(10K) R6(40K) R3(25K) R2(10K) 10 8 1 5 9 11 6 4 3 2 7 12 R3(25K) R1(80K) R2(10K) R7(70K) R5(60K) R4(12K) R8(20K) Exploiting Similarity of Sub-Queries Algorithm Explanation Assume: all self-loops are removed 1. Choosing Starting Nodes: (1) Construct ring network and similarity starting lists
Exploiting Similarity of Sub-Queries Similarity Starting List OL2 Each list is for one base table and has a header containing : -Base Table Name Ri, -Two sublists OLi –all its instances (nodes) SLi --other table instances whose sizes are within error bound rt with respect to the size of Ri SL2 Base Table R2
Choosing Starting Nodes (cont’d) (2) In order to find a pair of as large as possible similar subquery graphs, we should choose a pair of nodes with the maximum number of adjacent node pairs. • Because the larger the similar subquery graphs in a pair, the more the optimization work can be shared.
How to Choose Starting Nodes For any pair of potential starting nodes x and y selected from a similarity starting list, we have a formula: *m be the size of set T for query Q choosing a pair of starting nodes is to choose the pair that maximizes the value of formula (1).
How to Choose Starting Nodes Two indicator arrays O (occurrence) and S (similarity) For each table instance The lengths of arrays O and S are the size of T • Ox[i] indicates current adjacent nodes representing base table Ri • Sx[i] indicates current adjacent nodes whose table sizes are within the given error tolerance with respect to Ri excluding Ox[i].
Searching for Similar Sub-query Graphs 2. For all the unselected nodes (x1 and y1) in the rings of x, y • Check if sizeof(x1) and sizeof(y1) are within error bound rt • Check if adding x1 and y1 into the current sub-query pair graph will violate the similarity of sub-queries or not • Add x1, y1 into sub-query pair graphs
Searching for Similar Sub-query Graphs (Cont’d) 3. Use two threshold values c1 and c2 to determine whether to accept, reject, or hold a pair of similar sub-query graphs. • Let n be the number of nodes in a similar subquery graph • (1) if n ≥ c2, then the new pair of similar subquery graphs is accepted; • (2) if c1 ≤ n < c2, then we put this pair on hold; • (3) if n < c1, then the new pair is rejected.
4. Remove all the nodes in the accepted graphs and choose another starting nodes and repeated the procedure until no pair of nodes can be expanded.
5. Optimizing Query • Apply an optimization algorithm to optimize one of the sub-query graphs in each pair. • Map the execution plan for one sub-query to the one for its partner • Example, G1 and G2 Nodes: x1 ↔ y1, x2 ↔ y2, x3 ↔ y3, By optimizing G1, we get such a plan: ((x1×x2) × x3). The mapped plan for G2 is ((y1×y2) ×y3).
Experimental Result Comparison of I/O costs for execution plans generated by two techniques For a set of randomly-generated test queries.
Effect of changing error bounds on I/O costs • Very small error bounds cannot yield good performance, since the smaller the error bounds, the smaller the similar sub-queries; • Moderate error bounds (0.15 - 0.40) yield the best performance, since similar sub-queries with reasonable sizes can be found; • Large error bounds lead to poor performance, since sub-queries are less similar in such cases, which makes that sharing execution plans between them may not be appropriate; • Very large error bound (close to 1), the performance of the similarity-based technique stays at the same level. • The sizes of similar sub queries may reach their maximums • when the error bounds are beyond a certain limit.
Advantages and Limitation Discussion • Advantages: • Take the features of original structures of complex query into consideration and design a query graph • Exploiting the similarity of Sub-queries beyond common sub-query, reduce the optimization time by sharing the same mapped plan • Reduce the query complexity before apply any optimization algorithm. • Limitation: • According to the experimental result, the selectivity of similarity sub-queries depends on the error bound. • 2. The paper didn’t mention the application of this technique in multiple queries.
Related Work (1) The author also introduced a technique to optimize a complex query by exploiting common sub-queries recently. This technique is shown to be more effective than a pure randomization based method since it takes the structural characteristics of a complex query into consideration. (2) Some other Algorithms used to optimize large query: Iterative improvement (II) Simulated Annealing (SA) Tabu Search (TS) AB algorithm (AB) Generic Algorithm (GA)
Related Work (2) Related topic-Multiple-Query • A single query given to a system may result in multiple queries • Queries are given to the system from various users. Then batching all users’ requests is possibly required. • Some parts of the query are automatically generated by programs
Multiple Query Optimization • Major issue in multiple-query processing is the redundancy due to accessing the same data multiple times in different queries.[2] • Recognize the possibilities of sharing • Exploiting common sub-expressions and reduce evaluation cost • Reusing Materialized or intermediate results from other computation • Example: Volcano-SH and RU [3] • Modify the optimizer search strategy and find a globally optimal plan. Example: IE (Interleave Execution) and HA (Heuristic Algorithm) [2]
Summary • Described a new query optimization technique for complex query, “exploiting similarity of Sub-queries” • Algorithm Design Idea • Query Graph, Ring Network, Similar Sub-query Graph • Discussed experimental result, advantages and limitation • Related work Discussion
Reference [1] Yingying Tao, Qiang Zhu and Calisto Zuzarte. Exploiting Similarity of Subqueries for Complex Query Optimization, LNCS 2736, Jan 2003, Pages 747 - 759 [2] TIMOS K. SELLIS and C. Lin. Multiple Query Optimization, ACM TODS13(1) p23-52 (1988) [3] Prasan Roy, S. Seshadri, S. Sudarshan, Siddhesh Bhobe. Efficient and Extensible Algorithms for Multi Query Optimization, ACM 2000 1-58113-218-2/00/05 [4] Jamal R.Alsabbagh, Vijay V. Raghavan. Analysis of Common Subexpression Exploitation Models in Multiple-Query Processing