1 / 26

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes. Xiaohui Yu 1 , Calisto Zuzarte 2 , Ken Sevcik 1 1 University of Toronto 2 IBM Toronto Lab xhyu@cs.toronto.edu. Distinct value combinations. 1. 2. 3. 3 distinct value combinations.

porter
Download Presentation

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik1 1University of Toronto 2IBM Toronto Lab xhyu@cs.toronto.edu

  2. Distinct value combinations 1 2 3 3 distinct value combinations COLSCARD (COlumn Set CARDinality) = 3 The problem: estimating COLSCARD for a given set of attributes CIKM 2005

  3. Motivation • Cardinality estimation for query optimization, e.g., • Estimating the size of • Estimating the size of the aggregation • Approximate query answering, e.g., COUNT queries • SELECT sales_date, sales_person, • SUM(sales_quantity) AS unit_sold • FROM sales • GROUP BY sales_date, sales_person CIKM 2005

  4. Roadmap • Related work • Estimation with known marginal distributions • Upper/lower bounds • An estimator • Estimation with histograms • Experimental results • Conclusions CIKM 2005

  5. Related work • Previous work has focused on the case of single attribute. • [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00] • Sampling approach is used. • Estimation through sampling is difficult [CCMN’00] • No existing statistical information is exploited. CIKM 2005

  6. Our solution • Considering multiple-attributes • Utilizing existing statistics on individual attributes • Readily available in most database systems • Does not require access to the data • Granularity of statistics • Exact marginal frequency distributions • Approximate distributions: histograms etc. CIKM 2005

  7. Estimation with known marginals • Number of distinct values in attribute Ai, • frequency vector CIKM 2005

  8. The naïve estimator COLSCARD = Number of possible value combinations di: the number of distinct values in attribute Ai Sanity bound: COLSCARD cannot be greater than the table size The problem: Some value combinations with low occurrence probabilities may not appear in the table! CIKM 2005

  9. Upper/Lower bounds • Trivial bounds • Upper bound: (the naïve estimator) • Lower bound: • Tighter bounds? • In the case of two attributes, tighter bounds are available. CIKM 2005

  10. Tighter bounds A2 A1 N = 10 value freq value freq d a 1 b e 1 f c [2, 3] Naïve bounds: 3, 9 Lower bound = 2+1+1 = 4 Upper bound = 3+1+1 = 5 CIKM 2005

  11. Expected number of combinations • Assumptions • The data distributions of individual columns are independent • The occurrence of each combination in the table is independent • Each element of f represents the frequency of a specific value combination. • An estimate of the probability of occurrence CIKM 2005

  12. Estimator • The probability of the i-th combination not appearing in a particular tupleis • The probability of the i-th combination not appearing in the table (of size N) is • The expected number of value combinations is CIKM 2005

  13. Example revisited • Estimate the COLSCARD for attribute set (A1, A2, A3), given Naïve estimate: 3*2*2 = 12 New estimate: 5.94 CIKM 2005

  14. Roadmap • Related work • Estimation with known marginal distributions • Upper/lower bounds • An estimator • Estimation with histograms • Experimental results • Conclusions CIKM 2005

  15. Estimation with histograms • Histograms exist on individual attributes • Two classes of histograms • Partition-based • End-biased • Marginals can be (approximately) reconstructed from histograms • Optimal histograms in each class? CIKM 2005

  16. Optimal histograms • Minimizing the error incurred by histograms • ERR = |ESThist– ESTexact| • Partition-based histograms • A dynamic programming algorithm similar to that for V-optimal histogram construction [Jagadish et al. 98] can be used. CIKM 2005

  17. Optimal end-biased histograms • An end-biased histogram with B buckets stores • The exact frequencies of B-1 attribute values • The average of the remaining values • Which B-1 values to store exactly? • Most widely used end-biased histograms store the frequencies of the most frequent values • Not always optimal for COLSCARD estimation!! CIKM 2005

  18. Example N=10 Attributes (A1, A2) Choose 1 frequency to store exactly Error table CIKM 2005

  19. Optimal end-biased histograms • Exhaustive search takes time proportional to • We prove that the optimal choices can be one of the following • Most frequent values • Least frequent values • A combination of most frequent and least frequent values • Only need to search both ends • Cost is linear in B, independent of dj! CIKM 2005

  20. Roadmap • Related work • Estimation with known marginal distributions • Upper/lower bounds • An estimator • Estimation with histograms • Experimental results • Conclusions CIKM 2005

  21. Experiments – Data sets • Synthetic data • Skew: Zipfian parameter z=0 (uniform) to 4 (highly skewed) • Number of tuples: 10K to 1M • Real data • Cover Type: 581,012 tuples, 10 attributes • Census Income: 32,561 tuples, 14 attributes • Error measure: ratio error • ERR = max{true/est-1, est/true-1} CIKM 2005

  22. Effect of data skew N=100K di=1k CIKM 2005

  23. Effect of number of tuples CIKM 2005

  24. Results on real data 45 pairs 91 pairs CIKM 2005

  25. Accuracy of end-biased histograms Results on the “capital-gain” attribute of Census Income data set CIKM 2005

  26. Conclusions • Utilizing existing knowledge maintained in database systems • Proposed upper/lower bounds as well as an estimator • Considered two cases • exact marginal frequencies • Histograms: optimal histograms • Experimental results show the effectiveness of the proposed method CIKM 2005

More Related