290 likes | 544 Views
Selectivity Estimation using Probabilistic Models. Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng. Outline. Introduction Estimation for single tables SRM: Statistical Relational Models Selectivity Estimation using SRMs Learning SRM. Introduction.
E N D
Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng
Outline • Introduction • Estimation for single tables • SRM: Statistical Relational Models • Selectivity Estimation using SRMs • Learning SRM
Introduction • Accurate estimates of the result size of queries are crucial to several query processing components of DBMS. • Cost-based query optimizers: choose the optimal query execution plan. • Query profilers: predicting resource consumption and distribution of query results. • Answer counting queries
Introduction • How to estimate the size of a selection query over multiple attributes for a single table? The result size is determined by the joint frequency distribution of the values of these attributes. Query: select * from R where R.A1 = a1 and … and R.Ak = ak Size(Q) = |R| PD(a1,…,ak)
Introduction • But…exponential in # of attributes vn, representing all combination of attribute values is infeasible. • Attribute value independence assumption: joint distribution is product of single attribute distributions • Problem: overestimate or underestimate the query size. → Bayesian Network
Estimation for single table • Example: Given a simple relation R with three attributes: Education (high-school, college,degree) 3 values Income (low, medium, high) 3 values Home-Owner (false, true) 2 values Joint distribution need 18 numbers. • Observation: Some of the correlations between attributes might be indirect ones, mediated by other attributes. → Conditional Independence
Estimation for single table • P(H=h|E=e,I=i)=P(H=h|I=i) Home-owner is conditionally independent of Education given Income. Education Income Home-owner → Compact form of the joint distribution P(H,E,I) = P(E)P(I|E)P(H|I,E)=P(E)P(I|E)P(H|I)
Estimation for single table Figure (b) can encode precisely the same joint distribution as in Figure (a)
Bayesian Networks Pneumonia Tuberculosis Lung Infiltrates X-Ray Sputum Smear Nodes = random variables Edges = direct probabilistic influence Network structure encodes independence assumptions: X-Ray conditionally independent of Pneumonia given Infiltrates
Bayesian Networks P(I |P, T ) P T Pneumonia Tuberculosis p t 0.8 0.2 p t 0.6 0.4 Lung Infiltrates p t 0.2 0.8 t p 0.01 0.99 X-Ray Sputum Smear • Associated with each node Xi there is a conditional probability distribution P(Xi|Pai:) —distribution over Xi for each assignment to parents
BN Semantics P T I X S Compact & natural representation: nodes have k parents 2k n vs. 2n params conditional independencies in BN structure local probability models full joint distribution over domain + =
BNs for Query Estimation • Query: select * from R where R.A1 = a1 and … and R.Ak = ak • P(a1,a2,…an)= P(ai|parents(ai)) Use Bayesian inference algorithm to compute PD(a1,…,ak) • Algorithm complexity depends on BN connectivity; efficient in practice Size(Q) = |R| PD(a1,…,ak)
Join Selectivity Estimation Size(Purchase Person) = | Purchase | Naïve Approach Uniform Join Assumption Purchase Person Assuming referential integrity
Join Selectivity Estimation Example query Q: “person.income=high and purchase.type=luxury” p = P (person.income=high) q = P (purchase.type=luxury) SizeQ = |Purchase|*p*q Problems: Joining Two Tables
Correlated Attributes Type = luxury Income = high Income = low Type = necessity Purchase Person The attributes of the two different tables are often correlated
Skewed Join Type = luxury Income = high Income = low Type = necessity Purchase Person The probability that two tuples join with each other can also be correlated with various attributes.
Join Indicator R S Query: select * from R, S where R.F = S.K and R.A = a and S.B = b P(JF) = prob. randomly chosen tuple from R joins with a randomly chosen tuple from S size(Q) = | R | | S | P(JF, a, b)
Statistical Relational Models • Model distribution of attributes across multiple tables • Allow attribute values to depend on attributes in the same table (like a BN) • Allow attribute values to depend on attributes in other tables along a foreign key join • Can model the join probability of two tuples using join indicator variable
Statistical Relational Model • A SRM for a relational database is a pair(S,θ),which specifies a local probabilistic model for each of the following variables • A variable R.A for each table R and each attribute A R.* • A boolean join indicator variable R.JF • For each variable R.X • S specifies a set of parents Pa(R.X) • Θ specifies a CPD P(R.X|Pa(R.X))
Example SRM School Person Prestige Age Purchase Bought-by Attended Type Income Jschool Jperson Type=necessity false true Income = high 0.999, 0.001 false true 0.9998, 0.0002 0.99, 0.01
Universal Foreign Key Closure t s s1 s2 • Schema: R, S, T.R.F refers to S, S.F refers to Tstratification: T < S < R s.F2 = t.K r.F1 = s.K r • Schema: R, SR.F1 refers to S, R.F2 refers to Sstratification: S < R r.F1 = s1.K r.F2 = s2.K r
Universal Foreign Key Closure • Minimal extension Q+ to a query Q: Let Q be a keyjoin query over r1,r2,…rk For each r, if there is an attribute R.A with parent R.F.B where R.F points to S, then there is a unique tuple variable s representing the join r.F=s.K • Proposition: Let Q be a query and let Q+ be its minimal extension. Then sizeQ[D]=sizeQ+[D]
Answering Queries Using SRMs Prestige Age Person Purchase Type Income Jschool Jperson • Construct Query Evaluation BN for Query: select * from Person, Purchase where Person.id = Purchase.buyer-id and Person.Income = high and Purchase.Type=luxury School Compute upward closure of query attributes by including all parents as well
SRM Learning Database Strain Patient Contact • Learn parameters & qualitative dependency structure • Extend known techniques for learning Bayesian networks from data.
Structure selection • Define scoring function: log-likelihood function l( θ,S | D)=log P(D | S, θ) Finding the model that has maximum log-likelihood given data. • Do the greedy local structure search
Parameter Estimation • The model contains a parameter Θa|x for each value a of A and each assignment of values x to X. Θa|x = P(R.A=a |X=x) Θa|x = FD(R.A=a,X=x)/FD(X=x)
System Architecture Query Q Selectivity Estimator Size(Q) execution time Database Model Constructor offline
Conclusions • SRM is unique in its ability to handle select and join operators • Estimates the high-dimensional joint distribution using a set of lower-dimensional conditional distributions To do: • Incremental maintenance of the SRM as the database changes • Joins over non-key attributes
Selected Publications • “Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D. Koller and B. Taskar, JMLR 2002. • “Probabilistic Models of Text and Link Structure for Hypertext Classification”, L. Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond Classification’, 2001. • “Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D. Koller, SIGMOD-01. • “Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller, and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N. Lavrac, 2001. • see also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI-99. • “Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman, D. Koller, and B. Taskar, ICML-01. • “From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D. Koller and N. Friedman, ICML Workshop on Attribute-Value and Relational Learning: Crossing the Boundaries, 2000. • Notes from AAAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, 2000. • Notes from IJCAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, 2003. See http://www.cs.umd.edu/~getoor