280 likes | 419 Views
General Database Statistics Using Maximum Entropy. Raghav Kaushik 1 , Christopher Ré 2 , and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison 3 University of Washington. 1. Model: Information that optimizer knows
E N D
General Database Statistics Using Maximum Entropy Raghav Kaushik1, Christopher Ré2, and Dan Suciu3 1Microsoft Research 2University of Wisconsin--Madison 3University of Washington
1. Model: Information that optimizer knows 2. Prediction: use the model to estimate cardinality of future queries Study Cardinality Estimation Propose a declarative language with statistical assertions “We estimate that distinct # of Employees is 10” Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization.
Motivating Applications 1. Incorporate query feedback records - Underutilized: No general purpose mechanism 2. Optimizers for new domains (DB Kit 2.0) • Cloud Computing, Information Extraction 3. Data generation and description
Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions Outline
An assertion is a CQ Views + sharp (#) statement: V(x) :- R(x,y), …. #V= 106 Statistical Assertions V1(x) :- R(x,-) #V1 = 20 “The number of values in the output of V1 is 20” V2(y) :- R(-,y),S(y) #V2= 50 “The number of values in the output V2 is 50” A program is a set of assertions
V(x) :- R(x,y), …. #V= 106 Model as a Probabilistic Database Intuitively, # is “Expected Value” V1(x) :- R(x,-) #V1 = 20 “The number of values in the output of V1 is 20” A model is a probabilistic database s.t. the expected number of tuples in V1 is 20. Ok, but whichpdb?
Two Desiderata for the distribution (D1): Should agree with provided statistics (D2): Should assume nothing else V(x) :- R(x,y), …. #V= 106 Desiderata for our solution Approach: maximize entropy subject to D1 Challenge: Compute params of MaxEnt Distribution Technical Desideratum: want paramsanalytically
Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions Outline
Consider a domain D of size n. Fix a schema R=R1, R2,… Let Inst(n) = all instances over R on D An element I of Inst(n) is called a world Notation for Probabilistic Databases
Consider a domain D of size n. Fix a schema R=R1, R2,… Let Inst(n) = all instances over R on D An element I of Inst(n) is called a world Notation for Probabilistic Databases A probabilistic database is a pair (Inst(n),p) Essentially, any discrete probability distribution on relations
Achieving (D1): Stats must agree The semantics of # # means “expected value” V1(x) :- R(x,-) #V1 = 20 “The number of values in the output of V1 is 20” NB: In truth, we let n tend to infinity, and settle for asymptotically equal.
Given V1, V2, … with #Vi = di for i=1,…,t Achieving (D1): Stats must agree Multiple Views If p satisfies these equations, we’ve achieved: (D1): Should agree with provided statistics Many such distributions exist. How do we pick one?
Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions Selecting the best one
Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions Selecting the best one One can show that p has following form: NB: p is only a function of the stats, and so we have achieved (D2) Z is normalizing constant and ai is positive parameter for i=1,..,t
Every (consistent) statistical program induces a well-defined distribution • Every query has a well-defined cardinality estimate • Statistics as a whole, not as individual stats. • Can add new statistics to our heart’s content Benefits of MaxEnt A statistical program Technical Challenge:ai analytically
Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions Outline
I: A material random Graph • Even simple EM solutions have interesting theory • II: Intersection Models • Generating function , and • Different, analytic technique Two quick Examples
Example I: Random Graphs are EM Random Graph: Add edges independently at random V(x,y) :- R(x,y) #V = d
Example I: Random Graphs are EM Random Graph: Add edges independently at random V(x,y) :- R(x,y) #V = d By Linearity, E[V] = xn2 = d
Example I: Random Graphs are EM Random Graph: Add edges independently at random V(x,y) :- R(x,y) #V = d By Linearity, E[V] = xn2 = d This isMaxEnt…write:
Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1
Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1
Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1
Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1
Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1
Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1
Normal Form for statistical programs • Syntactic classes that we can solve analytically • “Project-Semijoin” queries (previous slide) • A general technique, conditioning: • Start with tuple independent prior, and condition • Introduces inclusion constraints • Extensions to handle histograms Results in the paper
Showed a principled, general model for database statistics based on MaxEnt Analytically solved syntactic classes of statistics Applications: Query Feedback and the Cloud Conclusion