350 likes | 365 Views
Explore consistency and bias challenges in estimating predicate selectivity, using maximum entropy, iterative scaling, and performance analysis. Learn how overlapping information impacts accuracy and bias in cardinality estimation.
E N D
Consistently Estimating the Selectivity of Conjuncts of Predicates Volker Markl, Nimrod Megiddo, Marcel Kutsch, Tam Minh Tran, Peter Haas, Utkarsh Srivastava Title slide
Agenda • Consistency and Bias Problems in Cardinality Estimation • The Maximum Entropy Solution • Iterative Scaling • Performance Analysis • Related Work • Conclusions
What is the problem? • Consider the following three attributes: Make Model Color Legend: Correlation
How to estimate the cardinality of the predicate… 200,000 100,000 Make ‘Mazda’ Model ‘323’ (real cardinality: 49,000) … Make = ‘Mazda’ AND Model = ’323’ AND Color = ‘red’ 200,000 Legend: cardinality Color ‘red’ attribute value
Without any additional knowledge 100,000 200,000 Base cardinality: 1000,000 Make ‘Mazda’ Model ‘323’ 200,000 Selectivity( Make= ‘Mazda’ AND Model = 323 AND Color = ‘red’ ) Color red Independence assumption: s(Make = ‘Mazda’ ) * s( Model = ‘323’ ) * s( Color =‘red’ ) = 100,000 * 200,000 * 200,000 = 0.004 1,000,000 1,000,000 1,000,000 Legend: denote by s(?) the selectivity of ? Estimated Cardinality: 0.004 * 1,000,000 = 4000
Additional knowledge given (1): 100,000 200,000 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Make ‘Mazda’ Model ‘323’ 200,000 case 1: s( Make AND Model ) * s( Color ) = Color ‘red’ estimated card: 10,000 50,000 * 200,000 = 0.01 1,000,000 1,000,000 Additional knowledge: Make AND Model card(‘Mazda’AND ‘323’) = 50,000 Legend: Conjunct Pred X AND Pred Y Cardinality
Additional knowledge given (2): 100,000 200,000 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Make ‘Mazda’ Model ‘323’ 200,000 case 1: s( Make AND Model ) * s(Color) = 0.01 estimated card: 10,000 Color ‘red’ case 2: s( Make AND Color ) * s( Model ) = estimated card: 18,000 200,000 * 90,000 = 0.018 1,000,000 1,000,000 Additional knowledge: Make AND Model card(‘Mazda’AND ‘323’) = 50,000 Legend: Make AND Color cardl(‘Mazda’AND ‘red’) = 90,000 Conjunct Pred X AND Pred Y Cardinality
Additional knowledge given (3): 100,000 200,000 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Make ‘Mazda’ Model ‘323’ 200,000 case 1: s( Make AND Model ) * s(Color) = 0.01 estimated card: 10,000 Color ‘red’ case 2: s( Make AND Color ) * s( Model ) = 0.018 estimated card: 18,000 Additional knowledge: case 3: s( Model AND Color ) * s( Make ) = estimated card: 15,000 150,000 * 100,000 = 0.015 1,000,000 1,000,000 Make AND Model card(‘Mazda’AND ‘323’) = 50,000 Legend: Make AND Color cardl(‘Mazda’AND ‘red’) = 90,000 Conjunct Pred X AND Pred Y Cardinality Model AND Color card(‘323’AND ‘red’) = 150,000
15,000 18,000 FETCH Make FETCH Model 150,000 90,000 Index Scan Index Scan Model, Color Make, Color Why is this a problem? case 0: s( Make) * s(Model ) * s(Color) = 0.004 estimated card: 4,000 Cardinality Bias Fleeing from Knowledge to Ignorance case 3: s( Model AND Color ) * s( Make ) = 0.015 estimated card: 15,000 case 2: s( Make AND Color ) * s( Model ) = 0.018 estimated card: 18,000 4,000 Index Intersect Make Color Model
What has happened? • Inconsistent model • different estimates for the same intermediate result • due to multivariate statistics with overlapping information • Bias during plan selection • results in the selection of sub-optimal plans • Bias Avoidance means keeping the model consistent • State-of-the-art is to do bookkeeping of the first multivariate statistic used, and ignore further overlapping multivariate statistics • Does not solve the problem, as ignoring knowledge also means bias • Bias is arbitrary, depends on what statistics are used first during optimization • Only possible solution is to exploit all knowledge consistently
Problem: Only partial knowledge of the DNF atoms 100,000 200,000 Mazda 323 Make ‘Mazda’ Model ‘323’ 200,000 Mazda& 323& red Mazda&323&red Color ‘red’ Mazda &323&red Mazda & 323 & red Additional knowledge: Mazda&323 &red Mazda&323 &red Make AND Model p(‘Mazda’AND ‘323’) = 50,000 Legend: Mazda &323&red DNF = disjunctive normal form X denotes not X Make AND Color pl(‘Mazda’AND ‘red’) = 90,000 Mazda &323&red Model AND Color p(‘323’AND ‘red’) = 150,000 red
How to compute the missing values of the distribution? Mazda 323 Mazda& 323& red Mazda&323&red Mazda &323&red Mazda & 323 & red Mazda&323 &red Mazda&323 &red Probability( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Mazda &323&red Mazda &323&red red
Solution: Information EntropyH( X ) = -∑ xi log( xi ) • Entropy is a measure for the “uninformedness” of a probability distribution X=(x1, …, xm) with x1 + … + xm = 1 • Maximizing information entropy • for unknown selectivities • using known selectivities as constraints will avoid bias • The less is known about a probability distribution, the larger the entropy • Nothing uniformity: s(X = ?) = 1/m • Marginals independence: s(X = ? and Y = ?) = s(X=?) * s(Y=?) • Thus: the principle of maximum entropy generalizes uniformity and independence used in today’s query optimizers
Entropy Maximization for Cardinality Estimation given some selectivities (single and conjunctive) over a space of n predicates p1, …, pn choose a model which is consistent with this knowledge but otherwise as uniform as possible maximize the entropy of the probability distribution X = (xb | b {0,1}n) xb is the selectivity of the DNF atom bi = 0 means that predicate pi is negated in the DNF bi = 1 means that predicate pi is a positive term in the DNF Legend: {0,1}n denotes the n-fold cross product of theset {0,1}, i.e., {0,1} … {0,1} n times Also, for a predicate p1 = pp0 = notp
0 0 0 Mazda 323 Mazda&323&red 1 0 0 0 1 0 Mazda& 323& red 1 1 0 Mazda&323&red Mazda&323&red Mazda& 323& red 1 1 1 1 0 1 0 1 1 Mazda&323 &red Mazda&323 &red 0 0 1 red Mazda&323&red Maximum Entropy Principle – Example: Knowledge sY, Y T: s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2 s1,2 =s(Mazda &323)= 0.05 s1,3 =s(Mazda& red)= 0.09 s2,3 =s(red&323)= 0.15 T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, } • Constraints: s1= Mazda&323&red + s1 = x100 + Mazda &323&red + x101 + Mazda&323&red + x110 + Mazda&323&red x111
0 0 0 Mazda 323 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 red Maximum Entropy Principle – Example: Knowledge sY, Y T: s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2 s1,2 =s(Mazda &323)= 0.05 s1,3 =s(Mazda& red)= 0.09 s2,3 =s(red&323)= 0.15 T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, } • Constraints: 0.10 = s1 = x100 + x101 + x110 + x111 0.20 = s2 = x010 + x011 + x110 + x111 0.20 = s3 = x101 + x111 + x011 + x001 0.05 = s1,2 = x110 + x111 0.09 = s1,3 = x101 + x111 0.15 = s2,3 = x011 + x111 1.00 = s = x000 + x001 + x010 + x011 + x100 + x101 + x110 + x111 • Objective Function:
1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 Solving the Constrained Optimization Problem Minimize the objective function: Satisfying the |T|2{1, .., n} constraints: Legend: 2{1,…,n} denotes the powerset of {1,..,n} C(Y) denotes all DNF atoms that contribute to Y, i.e., formally, C(Y) := {b {0,1}n | iY : bi = 1} and C() := {0,1}n General solution: Iterative Scaling
Differentiation w.r. to xb and equating to zero yields conditions for minimum Legend: P(b, T) Tdenotes the indexes Y of all known selectivities sY to which DNF atom b contributes its value xb: P(b,T) = {YT | iY: bi = 1} {} Maximum Entropy and Lagrange Multipliers • is convex. • Replacing xb in each constraint yields a condition in the exponentiated Lagrange multipliers zX • We can build a Lagrangian function by associating a multiplier Ywith each constraint and subtracting the constraints from the objective function • Exponentiation of the Lagrange Multipliers in the derivatives yields product form
Iterative Scaling • We can now isolate zY for a particular Y T • and thus iteratively compute zY from all zW, W T\{Y} • This algorithm is called Iterative Scaling (Darroch and Ratcliff, 1972) and converges to a stable set of Lagrangian multipliers zY, Y T • This stable point minimizes the objective function and satisfies all constraints • We can compute all DNF atoms xb from these stable multipliers using • and can in turn compute all missing selectivities
Maximum Entropy Solution of the Example Knowledge: Mazda s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2 s(Mazda &323)= s1,2 = 0.05 s(Mazda& red)= s1,3 = 0.09 s(red &323)= s2,3 = 0.15 323 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) s1,2,3 = x111 = ??? 0 0 1 red
Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s1 x 110 z1 = 0.067957 z2 = 1 z1,2 = 1 z3 = 1 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.1 s2 = 0.785759 s1,2 = 0.05 s3 = 0.785759 s1,3 = 0.05 s2,3 = 0.392879 s = 1.571518 x 100 x 111 x 101
Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s2 x 010 x 110 z1 = 0.067957 z2 = 0.254531 z1,2 = 1 z3 = 1 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.062727 s2 = 0.2 s1,2 = 0.012727 s3 = 0.492879 s1,3 = 0.031363 s2,3 = 0.1 s = 0.985759 x 111 x 011
Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s1,2 x 110 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 1 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.1 s2 = 0.237273 s1,2 = 0.05 s3 = 0.511516 s1,3 = 0.05 s2,3 = 0.118637 s = 1.023032 x 111
Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s3 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 1 z2,3 = 1 z = 1 s1 = 0.069550 s2 = 0.165023 s1,2 = 0.034775 s3 = 0.2 s1,3 = 0.019550 s2,3 = 0.046386 s = 0.711516 x 111 x 011 x 101 x 001
Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s1,3 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 4.603645 z2,3 = 1 z = 1 s1 = 0.14 s2 = 0.200248 s1,2 = 0.07 s3 = 0.27045 s1,3 = 0.09 s2,3 = 0.081611 s = 0.781966 x 111 x 101
Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ s2,3 Mazda 323 z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 4.603645 z2,3 = 1.837978 z = 1 s1 = 0.177709 s2 = 0.268637 s1,2 = 0.107709 s3 = 0.338839 s1,3 = 0.127709 s2,3 = 0.15 s = 0.850355 x 111 x 011 red
Iterative Scaling Knowledge: 111 110 101 100 011 010 001 000 s1 = 0.1 s2 = 0.2 s3 = 0.2 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 sØ = 1 z1 z1 z 1 z1 1 z2 z2 z2 z2 2 z1,2 z1,2 1,2 z3 z3 z3 z3 3 z1,3 z1,3 1,3 1st Iteration: z2,3 z2,3 2,3 zØ Ø zØ zØ zØ zØ zØ zØ zØ sØ z1 = 0.067957 z2 = 0.254531 z1,2 = 3.928794 z3 = 0.390994 z1,3 = 4.603645 z2,3 = 1.837978 z = 1.175979 s1 = 0.208982 s2 = 0.315911 s1,2 = 0.126664 s3 = 0.398468 s1,3 = 0.150183 s2,3 = 0.176397 s = 1 0.029399 0.029399 0.110115 0.097264 0.079133 0.052919 0.169152 0.432619
Maximum Entropy Solution of the Example Knowledge: Mazda s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2 s(Mazda &323)= s1,2 = 0.05 s(Mazda& red)= s1,3 = 0.09 s(red &323)= s2,3 = 0.15 323 0.009918 0.049918 0.000082 0.049918 0.040082 0.100082 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) s1,2,3 = x111 = 0.049918 Iterations: 241 0.009918 0.740082 red
Let’s compare: Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Real : s( Model AND Color AND Make ) = 0.049 actual card:49,000 case 0: s( Make) * s( Model ) * s(Color) = 0.004 estimated card: 4,000 Error: 10x case 1: s( Make AND Model ) * s(Color) = 0.010 estimated card: 10,000 Error: 5x case 2: s( Make AND Color ) * s( Model ) = 0.018 estimated card: 18,000 Error: 2.5x case 3: s( Model AND Color ) * s( Make ) = 0.015 estimated card: 15,000 Error: 3x ME: s( Model AND Color ) * s( Make ) = 0.049918 estimated card: 49,918 Almost no error
Forward Estimation: Predicting s1,2,3 , given … Legend: 4th quartile 3rd quartile median 200 queries 2nd quartile 1st quartile s1,2s1,3s2,3 s1s2s3 s1,3s2,3 s1,2s1,3 s1,2s2,3 s1,3 s2,3 s1,2 s1,2,3
Comparing DB2 and ME : Predicting s1,2,3 , given … Legend: Legend: th th 4 4 quartile quartile rd rd 3 3 quartile quartile median 200 200 mean queries queries nd nd 2 2 quartile quartile st st 1 1 quartile quartile DB2 ME DB2 ME DB2 ME DB2 ME s1,3s2,3 s1,2s1,3 s1,2s2,3 s1,2,s1,3, s2,3
Backward Estimation: Given s1,2,3 , predicting … Legend: Legend: th th 4 4 quartile quartile rd rd 3 3 quartile quartile median 200 200 mean queries queries nd nd 2 2 quartile quartile st st 1 1 quartile quartile DB2 ME DB2 ME DB2 ME s1,2 s1,3 s2,3
Related Work • Selectivity Estimation • SAC+79 P.G. Selinger et al: Access Path Selection in a Relational DBMS. SIGMOD 1979 • Chr83 S. Christodoulakis: Estimating record selectivities. Inf. Syst. 8(2): 105-115 (1983) • Lyn88 C. A. Lynch: Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values. VLDB 1988: 240-251 • PC84 G. Piatetsky-Shapiro, C. Connell: Accurate Estimation of the Number of Tuples Satisfying a Condition. SIGMOD Conference 1984: 256-276 • PIH+96 V. Poosala, et. al: Improved histograms for selectivity estimation of range predicates. SIGMOD 1996 • Recommending, Constructing, and Maintaining Multivariate Statistics • AC99 A. Aboulnaga, S. Chaudhuri: Self-tuning Histograms: Building Histograms Without Looking at Data. SIGMOD 1999: 181-192 • BCG01 N. Bruno, S. Chaudhuri, L. Gravano: STHoles: A Multidimensional Workload-Aware Histogram. SIGMOD 2001 • BC02 N. Bruno and S. Chaudhuri: Exploiting Statistics on Query Expressions for Optimization. SIGMOD 2002 • BC03 N. Bruno, S. Chaudhuri: Efficient Creation of Statistics over Query Expressions. ICDE 2003: • BC04 N. Bruno, S. Chaudhuri: Conditional Selectivity for Statistics on Query Expressions. SIGMOD 2004: 311-322 • SLM+01 M. Stillger, G. Lohman, V. Markl, and M. Kandil: LEO – DB2’s Learning Optimizer. VLDB 2001 • IMH+04 I. F. Ilyas, V. Markl, P. J. Haas, P. G. Brown, A. Aboulnaga: CORDS: Automatic discovery of correlations and soft functional dependencies. Proc. 2004 ACM SIGMOD, June 2004. • CN00 S. Chaudhuri, V. Narasayya: Automating Statistics Management for Query Optimizers. ICDE 2000: 339-348 • DGR01 A. Deshpande, M. Garofalakis, R. Rastogi: Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. SIGMOD 2001 • GJW+03 C. Galindo-Legaria, M. Joshi, F. Waas, et al: Statistics on Views. VLDB 2003: 952-962 • GTK01 L. Getoor, B. Taskar, D. Koller: Selectivity Estimation using Probabilistic Models. SIGMOD 2001 • PI97 V. Poosala and Y. Ioannidis: Selectivity Estimation without value independence. VLDB 1997 • Entropy and Maximum Entropy • Sha48 C. E. Shannon: A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948 • DR72 J.N. Darroch and D. Ratcliff: Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics (43), 1972:1470–1480. • GP00 W. Greiff, J. Ponte: The maximum-entropy approach and probabilistic IR models. ACM TIS. 18(3): 246-287, 2000 • GS85 S. Guiasu and A. Shenitzer: The principle of maximum-entropy. The Mathematical Intelligencer, 7(1), 1985.
Conclusions • Problem: Inconsistent Cardinality Model and Bias in today’s Query Optimizers • due to overlapping Multivariate Statistics (MD Histograms, etc.) • To reduce bias, today’s optimizers only use a consistent subset of available multivariate statistics • Cardinality estimates suboptimal despite better information • Bias towards plans without proper statistics (“fleeing from knowledge to ignorance”) • Solution: Maximizing Information Entropy • Generalizes concepts of uniformity and independence used in today’s query optimizers • All statistics are utilized Cardinality estimates improve, some by orders of magnitude • Cardinality Model is consistent No bias towards particular plans • Consistent estimates are computed in subsecond time • for up to 10 predicates per table • however, algorithm is exponential in the number of predicates • Not covered in the talk (see paper): • Reducing algorithm complexity through pre-processing • Impact on query performance speedup, sometime by orders of magnitude • Future Work: • Extension to join estimates