General Database Statistics Using Maximum Entropy

General Database Statistics Using Maximum Entropy Raghav Kaushik1, Christopher Ré2, and Dan Suciu3 1Microsoft Research 2University of Wisconsin--Madison 3University of Washington

1. Model: Information that optimizer knows 2. Prediction: use the model to estimate cardinality of future queries Study Cardinality Estimation Propose a declarative language with statistical assertions “We estimate that distinct # of Employees is 10” Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization.

Motivating Applications 1. Incorporate query feedback records - Underutilized: No general purpose mechanism 2. Optimizers for new domains (DB Kit 2.0) • Cloud Computing, Information Extraction 3. Data generation and description

Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions Outline

An assertion is a CQ Views + sharp (#) statement: V(x) :- R(x,y), …. #V= 106 Statistical Assertions V1(x) :- R(x,-) #V1 = 20 “The number of values in the output of V1 is 20” V2(y) :- R(-,y),S(y) #V2= 50 “The number of values in the output V2 is 50” A program is a set of assertions

V(x) :- R(x,y), …. #V= 106 Model as a Probabilistic Database Intuitively, # is “Expected Value” V1(x) :- R(x,-) #V1 = 20 “The number of values in the output of V1 is 20” A model is a probabilistic database s.t. the expected number of tuples in V1 is 20. Ok, but whichpdb?

Two Desiderata for the distribution (D1): Should agree with provided statistics (D2): Should assume nothing else V(x) :- R(x,y), …. #V= 106 Desiderata for our solution Approach: maximize entropy subject to D1 Challenge: Compute params of MaxEnt Distribution Technical Desideratum: want paramsanalytically

Consider a domain D of size n. Fix a schema R=R1, R2,… Let Inst(n) = all instances over R on D An element I of Inst(n) is called a world Notation for Probabilistic Databases

Consider a domain D of size n. Fix a schema R=R1, R2,… Let Inst(n) = all instances over R on D An element I of Inst(n) is called a world Notation for Probabilistic Databases A probabilistic database is a pair (Inst(n),p) Essentially, any discrete probability distribution on relations

Achieving (D1): Stats must agree The semantics of # # means “expected value” V1(x) :- R(x,-) #V1 = 20 “The number of values in the output of V1 is 20” NB: In truth, we let n tend to infinity, and settle for asymptotically equal.

Given V1, V2, … with #Vi = di for i=1,…,t Achieving (D1): Stats must agree Multiple Views If p satisfies these equations, we’ve achieved: (D1): Should agree with provided statistics Many such distributions exist. How do we pick one?

Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions Selecting the best one

Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions Selecting the best one One can show that p has following form: NB: p is only a function of the stats, and so we have achieved (D2) Z is normalizing constant and ai is positive parameter for i=1,..,t

Every (consistent) statistical program induces a well-defined distribution • Every query has a well-defined cardinality estimate • Statistics as a whole, not as individual stats. • Can add new statistics to our heart’s content Benefits of MaxEnt A statistical program Technical Challenge:ai analytically

I: A material random Graph • Even simple EM solutions have interesting theory • II: Intersection Models • Generating function , and • Different, analytic technique Two quick Examples

Example I: Random Graphs are EM Random Graph: Add edges independently at random V(x,y) :- R(x,y) #V = d

Example I: Random Graphs are EM Random Graph: Add edges independently at random V(x,y) :- R(x,y) #V = d By Linearity, E[V] = xn2 = d

Example I: Random Graphs are EM Random Graph: Add edges independently at random V(x,y) :- R(x,y) #V = d By Linearity, E[V] = xn2 = d This isMaxEnt…write:

Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1

Normal Form for statistical programs • Syntactic classes that we can solve analytically • “Project-Semijoin” queries (previous slide) • A general technique, conditioning: • Start with tuple independent prior, and condition • Introduces inclusion constraints • Extensions to handle histograms Results in the paper

Showed a principled, general model for database statistics based on MaxEnt Analytically solved syntactic classes of statistics Applications: Query Feedback and the Cloud Conclusion

General Database Statistics Using Maximum Entropy