310 likes | 379 Views
Reducing Order Enforcement Cost in Complex Query Plans. Ravindra Guravannavar and S. Sudarshan (To appear in ICDE 2007). Background. Sort-based query processing algorithms Sort-merge Join (also Union/Intersection) Sort-based grouping and duplicate elimination Explicit “order by”
E N D
Reducing Order Enforcement Cost in Complex Query Plans Ravindra Guravannavar and S. Sudarshan (To appear in ICDE 2007)
Background • Sort-based query processing algorithms • Sort-merge Join (also Union/Intersection) • Sort-based grouping and duplicate elimination • Explicit “order by” • Notion of “Interesting Sort Orders” (System-R) • Find and remember the best plan for each sort order that may be useful • Optimization goal in Volcano : (expr, sort-order)
Group By {a2,a4,a5,…} R.a1=S.a1 and R.a2=S.a2 … R.an=S.an G S R The Problem • Interesting orders can be too many! • Factorial in number of attributes involved • Plan cost can vary substantially with the choice of interesting order • Clustering and covering indices • Other operators in the input sub-expressions • Possibility of partial sorting
Motivation • Joins in data integration and decision support involve large number of attributes • Increasing use of covering indices • Several alternative sort orders • Partial sorting • Query patterns • Attributes common to multiple operators • Known techniques • Work only for unary operators like group-by
Outline of the Talk • Partial sorting • Changes to external sort • Optimizer changes to handle partial sort orders • Interesting orders for a join tree : A special case • Problem is NP-Hard • A 2-approximation for the special case • The general problem • Notion of favorable orders • Plan generation using favorable orders • Post-optimization phase • Experimental results
Exploiting Partial Sort Orders • Sort on (a1, a2) given (a1) • Standard external-sort • Cost is independent of input sort order • Replacement-selection • Produces single run but incurs I/O • Both methods break the pipeline – first o/p tuple after reading all i/p R.a1=S.a1 and R.a2=S.a2 (a1) (a1,a2) () (a1,a2) R S C. Index on (R.a1)
A Minor Change to External Sorting • Multiple “partial sort segments” • Hold only one segment at any given time • When a new segment starts • Sort the current segment and output • No run generation I/O if each segment fits in memory • Early output (good for Top-K) • Reduced comparisons • O(n log n/k) Vs. O(n log n), k = # segments
Optimizer Changes to Handle Partial Sort Orders • Cost Model for Partial Sort: • Let the input order be o1 • Required (output) order be o2 • Let os=Longest common prefix between o1 and o2 • Let or=o2 – os (i.e, os + or = o2) • A(o) = Attribute set of order o • Є : Empty (no) sort order • coe(e, o1,o2) = D(e, A(os)) X coe(e’, Є, or), • where e’=sp(e) and p equates A(os) to a constant.
Optimizer Changes to Handle Partial Sort Orders • Cost Model for Partial Sort: • coe(e, o1,o2) = D(e, A(os)) X coe(e’, Є, or), • where e’=sp(e) and p equates A(os) to a constant. o2=(a,c) o1=(a,b) e os=(a), or=(c), e’=s(a=k)(e)
Flexible Order Requirements • Most operators have interest in any order on the attributes involved • Merge-Join, Merge-Union, Group By, Duplicate Elimination • Binary operators demand the same order from inputs G {a1, a2} {a1,a2,a3,a4} {a3,a5,a6} {a4,a7}
A special case: All relations/intermediate results of the same size All attribute cardinalities same We try to maximize the length of common prefixes Maximize SLCP(pi, pj) Reduction from graph layout problem SUM-CUT Optimal algorithm for paths and 2-approximation for binary trees Finding Optimal is NP-Hard
s2 Sn-1 s1 s3 sn A 2-Approximation Algorithm • Optimal algorithm for paths • OPT(i,j) = max {OPT(i,k) + OPT(k+1,j) + c(i,j)}, i ≤ k < j • 2-Approximation for binary trees Even levels Odd levels - OPT ≤ OPT-EVEN + OPT-ODD - Take the one with higher benefit
General Case • Logical plan space for inputs not expanded (i.e, Join order not fixed) • Varying sizes of relations and intermediate results • All orders on base relations do not have the same cost (due to clustering and covering indices)
Overview of the Approach • Identify a small set of favorable orders • Orders that are relatively inexpensive • Should not require expanding the input plan space • Plan generation (Phase-1) • Deduce the interesting orders from the favorable orders • Try each of the interesting order, retain the best • Plan refinement (Phase-2) • Use the 2-approximation algorithm and refine the sort orders further
Favorable Orders • Benefit of an order: benefit(o, e) = cbp(e, Є) + coe (e, Є, o) – cpb(e,o) Positive benefit The order can be obtained at cost less than the full sort of unordered result (e.g., the clustering order) • Favorable orders: ford(e)={ o : benefit(o,e) > 0 } • Can be a huge set • E.g., Every order having the clustering order as its prefix is a favorable order.
E E Minimal Favorable Orders • A favorable order o that satisfies: • o’ ≤ o s.t. cbp(e, o’) + coe(e, o’, o) = cbp(e,o) • o” s.t. o ≤ o” and cbp(e, o”) = cbp(e,o) E.g., Relation R with clustering index on (a1,a2) • (a1,a2) is a minimal favorable order • (a1 ), (a1,a2,a3) are not • ford-min(e): Set of all minimal favorable orders for expression e • For base relations size of ford-min limited to the number of covering indices
Computing Favorable Orders: Issues • Defined in terms of cost of best plan • Need them before optimizing input sub-expressions • Even ford-min can get prohibitively large for join, group-by expressions ford-min contains every permutation of the join attributes J1 J2 S R
U U U Heuristics for Computing ford-min • e=R : {o: o is clustering or covering index order} • e=sp(e1) : {o: o ford-min(e1)} • e=PL(e1) :{o: o’ ford-min(e1) and o=o’ ^ L} • Pa,b(e1), ford-min(e1)={(a,c,b)} ford-min(e)={(a)} • e=e1 e2 : • Let T=ford-min(e1) U ford-min(e2) • T U {o: o’ T and o=((o’ ^ S) permute(S – A(o’ ^ S)))
Heuristics for Computing ford-min S={a,b,c,d} T = {(a,b,e), (b), (a)} ford-min={(a,b,e),(b)} ford-min={(a)}
Plan Generation (Phase-1) • Form the set I of interesting orders to try • Collect input favorable orders and rqd. o/p order • Take LCP with the set of join attributes • Extend the orders (arbitrarily) to include remaining attributes • For each order o in I, generate optimization sub-goals for input sub-expressions
Plan Refinement (Phase-2) • Identify the suffix that can be freely reordered • Use the 2-approximation algorithm to reorder the suffix {a,e,h} (a,b,c,h) (a,h,b,c) {a,e,h} {a,d,h} (a,h,e) (a,e,h) (a,d,h) (a,h,d) R4 (a) R3 (a) R2 (a) R1 (a)
Experiments • Benefits of exploiting partial sort orders • Evaluate the plans produced by our optimizer extensions
Experiment 1 SELECT suppkey, partkey FROM lineitem ORDER BY suppkey, partkey; (suppkey) (suppkey, partkey)
Experiment 2 R(c1,c2,c3), 10 M records, (c1)(c1,c2), card(c1)=10,000
SELECT ps_suppkey, ps_partkey, ps_availqty, sum(l_quantity) AS total_required FROM partsupp, lineitem WHERE ps_suppkey=l_suppkey AND ps_partkey=l_partkey AND l_linestatus='O' GROUP BY ps_partkey, ps_suppkey, ps_availqty, HAVING sum(l_quantity) > ps_availqty ORDER BY ps_partkey; Parts running out of stock: Experiment 4
Experiment 4 - Plans Merge-Join Plan on SYS1 and SYS2 Plan Generated by PYRO-O
Experiments with Variants of PYRO PYRO : Baseline PYRO PYRO-O-: No partial sort PYRO-P : Postgres Heuristic PYRO-O : Our Approach PYRO-E : Exhaustive