Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik1 1University of Toronto 2IBM Toronto Lab xhyu@cs.toronto.edu

Distinct value combinations 1 2 3 3 distinct value combinations COLSCARD (COlumn Set CARDinality) = 3 The problem: estimating COLSCARD for a given set of attributes CIKM 2005

Motivation • Cardinality estimation for query optimization, e.g., • Estimating the size of • Estimating the size of the aggregation • Approximate query answering, e.g., COUNT queries • SELECT sales_date, sales_person, • SUM(sales_quantity) AS unit_sold • FROM sales • GROUP BY sales_date, sales_person CIKM 2005

Roadmap • Related work • Estimation with known marginal distributions • Upper/lower bounds • An estimator • Estimation with histograms • Experimental results • Conclusions CIKM 2005

Related work • Previous work has focused on the case of single attribute. • [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00] • Sampling approach is used. • Estimation through sampling is difficult [CCMN’00] • No existing statistical information is exploited. CIKM 2005

Our solution • Considering multiple-attributes • Utilizing existing statistics on individual attributes • Readily available in most database systems • Does not require access to the data • Granularity of statistics • Exact marginal frequency distributions • Approximate distributions: histograms etc. CIKM 2005

Estimation with known marginals • Number of distinct values in attribute Ai, • frequency vector CIKM 2005

The naïve estimator COLSCARD = Number of possible value combinations di: the number of distinct values in attribute Ai Sanity bound: COLSCARD cannot be greater than the table size The problem: Some value combinations with low occurrence probabilities may not appear in the table! CIKM 2005

Upper/Lower bounds • Trivial bounds • Upper bound: (the naïve estimator) • Lower bound: • Tighter bounds? • In the case of two attributes, tighter bounds are available. CIKM 2005

Tighter bounds A2 A1 N = 10 value freq value freq d a 1 b e 1 f c [2, 3] Naïve bounds: 3, 9 Lower bound = 2+1+1 = 4 Upper bound = 3+1+1 = 5 CIKM 2005

Expected number of combinations • Assumptions • The data distributions of individual columns are independent • The occurrence of each combination in the table is independent • Each element of f represents the frequency of a specific value combination. • An estimate of the probability of occurrence CIKM 2005

Estimator • The probability of the i-th combination not appearing in a particular tupleis • The probability of the i-th combination not appearing in the table (of size N) is • The expected number of value combinations is CIKM 2005

Example revisited • Estimate the COLSCARD for attribute set (A1, A2, A3), given Naïve estimate: 3*2*2 = 12 New estimate: 5.94 CIKM 2005

Estimation with histograms • Histograms exist on individual attributes • Two classes of histograms • Partition-based • End-biased • Marginals can be (approximately) reconstructed from histograms • Optimal histograms in each class? CIKM 2005

Optimal histograms • Minimizing the error incurred by histograms • ERR = |ESThist– ESTexact| • Partition-based histograms • A dynamic programming algorithm similar to that for V-optimal histogram construction [Jagadish et al. 98] can be used. CIKM 2005

Optimal end-biased histograms • An end-biased histogram with B buckets stores • The exact frequencies of B-1 attribute values • The average of the remaining values • Which B-1 values to store exactly? • Most widely used end-biased histograms store the frequencies of the most frequent values • Not always optimal for COLSCARD estimation!! CIKM 2005

Example N=10 Attributes (A1, A2) Choose 1 frequency to store exactly Error table CIKM 2005

Optimal end-biased histograms • Exhaustive search takes time proportional to • We prove that the optimal choices can be one of the following • Most frequent values • Least frequent values • A combination of most frequent and least frequent values • Only need to search both ends • Cost is linear in B, independent of dj! CIKM 2005

Experiments – Data sets • Synthetic data • Skew: Zipfian parameter z=0 (uniform) to 4 (highly skewed) • Number of tuples: 10K to 1M • Real data • Cover Type: 581,012 tuples, 10 attributes • Census Income: 32,561 tuples, 14 attributes • Error measure: ratio error • ERR = max{true/est-1, est/true-1} CIKM 2005

Effect of data skew N=100K di=1k CIKM 2005

Effect of number of tuples CIKM 2005

Results on real data 45 pairs 91 pairs CIKM 2005

Accuracy of end-biased histograms Results on the “capital-gain” attribute of Census Income data set CIKM 2005

Conclusions • Utilizing existing knowledge maintained in database systems • Proposed upper/lower bounds as well as an estimator • Considered two cases • exact marginal frequencies • Histograms: optimal histograms • Experimental results show the effectiveness of the proposed method CIKM 2005

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes

Presentation Transcript

The Number of Elements in a Set 2.4

Estimating the recreation value of beach nourishment

Estimating the Value of a Parameter Using Confidence Intervals

Towards a Set of Design Principles for Executive Compensation Contracts

9.5 Counting Subsets of a Set: Combinations

# of combinations

Number of Elements in a Finite Set

Permutation - an arrangement of a set of distinct objects in a certain order.

Estimating the Economic Value of Beach Nourishment

Counting Subsets of a Set: Combinations

Estimating Distinct Elements, Optimally

Estimating the Value of Improved Information

Estimating the Value of a Parameter Using Confidence Intervals

A Framework for Estimating the Number of Extremists in Canada

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Estimating the Value of a Parameter Using Confidence Intervals

Counting Subsets of a Set: Combinations

The Attributes of High Value Firms

DES Chapter 4 Estimating the Value of ACME

Estimating Distinct Elements, Optimally

Estimating the recreation value of beach nourishment