260 likes | 407 Views
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes. Xiaohui Yu 1 , Calisto Zuzarte 2 , Ken Sevcik 1 1 University of Toronto 2 IBM Toronto Lab xhyu@cs.toronto.edu. Distinct value combinations. 1. 2. 3. 3 distinct value combinations.
E N D
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik1 1University of Toronto 2IBM Toronto Lab xhyu@cs.toronto.edu
Distinct value combinations 1 2 3 3 distinct value combinations COLSCARD (COlumn Set CARDinality) = 3 The problem: estimating COLSCARD for a given set of attributes CIKM 2005
Motivation • Cardinality estimation for query optimization, e.g., • Estimating the size of • Estimating the size of the aggregation • Approximate query answering, e.g., COUNT queries • SELECT sales_date, sales_person, • SUM(sales_quantity) AS unit_sold • FROM sales • GROUP BY sales_date, sales_person CIKM 2005
Roadmap • Related work • Estimation with known marginal distributions • Upper/lower bounds • An estimator • Estimation with histograms • Experimental results • Conclusions CIKM 2005
Related work • Previous work has focused on the case of single attribute. • [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00] • Sampling approach is used. • Estimation through sampling is difficult [CCMN’00] • No existing statistical information is exploited. CIKM 2005
Our solution • Considering multiple-attributes • Utilizing existing statistics on individual attributes • Readily available in most database systems • Does not require access to the data • Granularity of statistics • Exact marginal frequency distributions • Approximate distributions: histograms etc. CIKM 2005
Estimation with known marginals • Number of distinct values in attribute Ai, • frequency vector CIKM 2005
The naïve estimator COLSCARD = Number of possible value combinations di: the number of distinct values in attribute Ai Sanity bound: COLSCARD cannot be greater than the table size The problem: Some value combinations with low occurrence probabilities may not appear in the table! CIKM 2005
Upper/Lower bounds • Trivial bounds • Upper bound: (the naïve estimator) • Lower bound: • Tighter bounds? • In the case of two attributes, tighter bounds are available. CIKM 2005
Tighter bounds A2 A1 N = 10 value freq value freq d a 1 b e 1 f c [2, 3] Naïve bounds: 3, 9 Lower bound = 2+1+1 = 4 Upper bound = 3+1+1 = 5 CIKM 2005
Expected number of combinations • Assumptions • The data distributions of individual columns are independent • The occurrence of each combination in the table is independent • Each element of f represents the frequency of a specific value combination. • An estimate of the probability of occurrence CIKM 2005
Estimator • The probability of the i-th combination not appearing in a particular tupleis • The probability of the i-th combination not appearing in the table (of size N) is • The expected number of value combinations is CIKM 2005
Example revisited • Estimate the COLSCARD for attribute set (A1, A2, A3), given Naïve estimate: 3*2*2 = 12 New estimate: 5.94 CIKM 2005
Roadmap • Related work • Estimation with known marginal distributions • Upper/lower bounds • An estimator • Estimation with histograms • Experimental results • Conclusions CIKM 2005
Estimation with histograms • Histograms exist on individual attributes • Two classes of histograms • Partition-based • End-biased • Marginals can be (approximately) reconstructed from histograms • Optimal histograms in each class? CIKM 2005
Optimal histograms • Minimizing the error incurred by histograms • ERR = |ESThist– ESTexact| • Partition-based histograms • A dynamic programming algorithm similar to that for V-optimal histogram construction [Jagadish et al. 98] can be used. CIKM 2005
Optimal end-biased histograms • An end-biased histogram with B buckets stores • The exact frequencies of B-1 attribute values • The average of the remaining values • Which B-1 values to store exactly? • Most widely used end-biased histograms store the frequencies of the most frequent values • Not always optimal for COLSCARD estimation!! CIKM 2005
Example N=10 Attributes (A1, A2) Choose 1 frequency to store exactly Error table CIKM 2005
Optimal end-biased histograms • Exhaustive search takes time proportional to • We prove that the optimal choices can be one of the following • Most frequent values • Least frequent values • A combination of most frequent and least frequent values • Only need to search both ends • Cost is linear in B, independent of dj! CIKM 2005
Roadmap • Related work • Estimation with known marginal distributions • Upper/lower bounds • An estimator • Estimation with histograms • Experimental results • Conclusions CIKM 2005
Experiments – Data sets • Synthetic data • Skew: Zipfian parameter z=0 (uniform) to 4 (highly skewed) • Number of tuples: 10K to 1M • Real data • Cover Type: 581,012 tuples, 10 attributes • Census Income: 32,561 tuples, 14 attributes • Error measure: ratio error • ERR = max{true/est-1, est/true-1} CIKM 2005
Effect of data skew N=100K di=1k CIKM 2005
Effect of number of tuples CIKM 2005
Results on real data 45 pairs 91 pairs CIKM 2005
Accuracy of end-biased histograms Results on the “capital-gain” attribute of Census Income data set CIKM 2005
Conclusions • Utilizing existing knowledge maintained in database systems • Proposed upper/lower bounds as well as an estimator • Considered two cases • exact marginal frequencies • Histograms: optimal histograms • Experimental results show the effectiveness of the proposed method CIKM 2005