New Sampling-Based Estimators for OLAP Queries

New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of Florida Gagan Agrawal, The Ohio State University

Approximate Query Processing • AQP is an active area of DM research • The goal is to provide accurate estimation of queries without access the entire databases • Especially useful and important for data warehouse and OLAP • Consider you have a total of 10,000 disks, each with 200GB (2PB) • Takes 1 hour to scan • Answering a single, simple aggregate query may need an hour • Unacceptable to analysts/end-users • If each disk cost $1000 year to maintain • One simple query can cost • $1572=10,000  $1000/ (365  24) • inhibitive cost

OLAP Queries • Querying the Large Relational Tables composed of • Dimensional Attributes • Categorical data (Most) • Sex, Country, State, City, Product Code, Department, Color, … • Measure Attributes • Numerical data • Salary, Sales, Price, Number of Complaints, … • Aggregate Queries • Most AQP tailored to numerical data • Wavelets, kernels, histograms • Problematic for categorical data and high-dimensionality • Random Sampling • Well studied in statistical theory • Can handle high-dimension category data • Provide estimates of the query results as well as the estimate accuracy

Confidence Interval • The measure for accuracy • COMPLAINTS(PROF, SEMESTER, NUM_COMPLAINTS) • SELECT SUM (NUM_COMPLAINTS) FROM COMPLAINS WHERE PROF = ‘Smith’ AND SEMESTER = ‘Fa03’ • A Confidence Bound: • With a probability of .95, Prof. Smith received 27 to 29 complaints in the Fall of 2003 Accuracy level Interval width =2

How to estimate the confidence interval? • Uniform Sampling • Central limit theorem (CLT) • Delta Methods Assuming the distribution of an estimator ŷ of an aggregate query result y is approximately normal with mean E(ŷ), and variance V(ŷ) for a large sample, an approximate 95% confidence interval for the estimator is given by [ŷ-1.96SE(ŷ), ŷ+1.96SE(ŷ)] where 1.96 is the 0.975th percentile of the standard normal distribution, and SE(ŷ) is the standard error (the square root of the variance V(ŷ) ). Accuracy level Interval width = 3.92SE(ŷ)

How to (cont’d) • Unequal Probability Sampling • Stratified Sampling • Separate Samples for Each Measure (Numerical) Attribute • Re-Sampling • Bootstrapping • Computational Intensive • Distribution-free • Chebyshev and Hoeffding bound • Loose bound

Problem studied in this presentation • How to provide an accurate confidence interval together with an estimation? • Boosting the accuracy level • Reducing the interval width • Key idea: Ensemble Estimates • Find multiple (unbiased) estimators for each OLAP query • Linearly combine the individual estimators and derive the optimal coefficients to reduce the global variance • Handle the correlation among the individual estimators

Example • Database describing student complaints

Example • We sample the database…

Example • And ask: How many complaints for Smith? Est: (21+7+8)/8×16=72; Answer: 121

Why So Bad? • We missed two important records Oops!

How we know something went wrong? • What if we know the total complaints of the entire table: SUM(NUM_COMPLAINTS) • Compare with the estimated total complaints of the entire table Est: (2+21+1+7+8+4+3+0)/8 × 16 = 92, Answer: 148 • One of the key ideas in the APA approach • Pre-aggregation of the low-dimensional aggregates • 0-dimensional fact: SUM(NUM_COMPLAINTS) =148 • 1-dimensional fact, for example, on SEMESTER SELECT SUM(NUM_COMPLAINTS) FROM COMPLAINTS GROUP-BY SEMESTER • Or higher, depending on the cost of such pre-aggregation • In our example, assuming only the 0-dimensional fact is know!

How we can pull ourselves out? • APA use Maximal Likelihood Estimation (MLE) • Break data space based on relational selection predicates; 2m Quadrants • Compute aggregate for each quadrant • Characterize the error of the estimates using normal PDF (justification: CLT) • Pretend estimates are independent • Adjust the means to max likelihood • Subject to known facts about the data • Shows to be very accurate in various datasets, significantly better than plain sampling and stratified sampling • In our example, the New Estimation is 136.3 (answer was 121, the original estimation is 72) • However, loss of analytic guarantees on accuracy!

Let us go back to the plain sampling • For the query: How many complaints for Smith? Est: (21+7+8)/8×16=72 (Answer: 121); The standard error (SE) is 68. 2 [ŷ-1.96SE(ŷ), ŷ+1.96SE(ŷ)]

New Estimator: The Negative One • To answer the query: How many complaints for Smith? (Answer:121) • We first ask: How many complaints NOT for Smith? Est: (2+1+4+3+0)/8×16=20, The Negative Estimator: 148-20=128, Standard Error (SE) = 13.4

How two is always better than one: The Ensemble Estimator • Linearly combining the direct (positive) estimator and the negative estimator • Estnew = αEstdirect + (1- α ) Estnegative (0 α 1) • Note since both the direct estimators and negative estimators are unbiased estimators, the ensemble estimator is also unbiased. • Choose the parameter α to minimize the variance the ensemble estimator • The ensemble estimator always is always more accurate • If the individual estimators are independent, the optimal value of the parameter α is V(Estdirect)/(V(Estdirect)+V(Estnegative )) • In our example, α=0.0373, Estnew=125.95, Standard Error (SE) = 13.1

What if we have higher-dimensional facts? • Image we have the relational table EMPLOYEE(NAME, SEX,DEPARTMENT,JOB_TYPE, SALARY) • Query: SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ • Pre-Aggregation • 1-dimesional facts

More negative estimators SEX ‘M’ AND DEPARTMENT ‘ACCOUNT’ AND JOB_TYPE ‘SUPERVISOR’ b1^b2^b3, or b1^b2^b3 b1^b2^b3 b1^b2^b3 DEPARTMENT SEX SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 JOB_TYPE

More negative estimators SEX ‘M’ AND DEPARTMENT ‘ACCOUNT’ AND JOB_TYPE ‘SUPERVISOR’ b1^b2^b3, or b1^b2^b3 b1^b2^b3 b1^b2^b3 b1^b2^b3 b1^b2^b3 DEPARTMENT SEX SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 JOB_TYPE

More negative estimators (cont’d) SEX ‘M’ AND DEPARTMENT ‘ACCOUNT’ AND JOB_TYPE ‘SUPERVISOR’ b1^b2^b3, or b1^b2^b3 b1^b2^b3 b1^b2^b3 b1^b2^b3 b1^b2^b3 DEPARTMENT SEX SELECT SUM (SALARY) FROM EMPLOYEE WHERE SEX=‘M’ AND DEPARTMENT=‘ACCOUNT’ AND JOB_TYPE=‘SUPERVISOR’ b1 b2 b3 JOB_TYPE

Combining Positive and Negative Estimators in APA1+ • We will have multiple negative estimators • Estnew = α0 Estdirect + α1 Estnegative1 +α2 Estnegative2 +… 0 αi  1, α0+ α1+ α2+… = 1 • Decompose the negative estimators into the cell representations • Each cell in the cube correspond to a direct estimation • The variance of the cell can be estimated • We can use Lagrange multipliers to optimize all the parameters (αi) • We assume the direct estimations for each cell is independent • This procedure usually involve a linear solver for a linear equation

Actually, the estimators are correlated • Fortunately, we are able to capture such correlation analytically • If each individual estimator is approximately normal, and they are independent, the combined estimator is also approximately normal • However, the correction effect results in a slightly different distribution • Analytically very close to the spherically symmetric distribution, of which normal distribution is a special case. • Empirically, it shows strong tendency to normal • We use normal distribution to derive the confidence interval

Empirical Distribution of the Ensemble Estimators Empirical distribution of APA0+ Empirical distribution of APA1+

Experimental Evaluation • Four datasets • Forest Cover data (from UCI KDD archive) • River Flow data • William Shakespeare data • Image Feature vector • Approximation techniques • Simple Random Sampling • Stratified Sampling • APA0+ • APA1+ • Queries • 2000 queries for each dataset

Measure the estimated confidence interval • We generate 95% confidence intervals of all estimation techniques for each query • Accuracy level • What are the real chances the correct answers actually fall in the confidence intervals? • Interval width • How tight are the bounds of the confidence intervals?

How good are the new estimators? • Accuracy of the confidence intervals (Expected: 95%) • APA1+ average around 90%, which was 23.2% higher than simple random sampling (the next best alternative in terms of accuracy) • The accuracy of APA0+, random sampling, and stratified sample are comparable, all less than 70% in average • Confidence interval width • The width of the confidence interval produced by APA1+ is only 1/2 the size of one from random sampling • Compared with stratified sampling, APA1+ is at least 20% smaller • The width of the confidence interval produced by APA0+ is around 15% smaller than random sampling

Discussion • Overall, the new estimators work pretty well! • It’s very simple! • Significantly better than the random sampling • Significantly better than the stratified sampling • APA1+ is the only estimator which provides the confidence interval close to the theoretically expected accuracy and with much smaller width! • Suitable for both categorical, numerical data • APA0+, and APA1 unaffected by high dimensions! • Future work • How to apply this idea to work with more complicated aggregation functions?

Thanks!!

Roadmap • Approximate Query Processing and Confidence Interval • Motivating Example • Generalization and Handling Correlation • Experimental Results • Conclusions • Inspired by • Chris’s original APA approach (how to find multiple estimators) • Ensemble Classifiers in Statistical Learning

New Sampling-Based Estimators for OLAP Queries

New Sampling-Based Estimators for OLAP Queries

Presentation Transcript

Overcoming Limitations of Sampling for Aggregation Queries

Preference Queries from OLAP and Data Mining Perspective

New Sampling-Based Summary Statistics for Improving Approximate Query Answers

Sampling-Based Planners

Overcoming Limitations of Sampling for Agrregation Queries

Insight gaining from OLAP queries via data movies

Section 6-4 Sampling Distributions and Estimators

Observers/Estimators

Aggregate Queries in Peer-to-Peer OLAP

Evaluating XML-Extended OLAP Queries Based on a Physical Algebra

NEW ESTIMATORS FOR THE POPULATION MEDIAN IN SIMPLE RANDOM SAMPLING

Location-based Spatial Queries

Section 6-4 Sampling Distributions and Estimators

Distance-Variable Estimators for Sampling and Change Measurement

Sampling Distributions of Estimators of Parameters

A New OLAP Aggregation Based on the AHC Technique

Sampling for Part Based Object Models

Preference Queries from OLAP and Data Mining Perspective

OLAP Queries