A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : SurajidChaudhuri Gautam Das VivekNarasayya Presented by :SayedMuchallil September 21st, 2010

CONTENTS • INTRODUCTION • ARCHITECTURE FOR APPROXIMATE QUERY PROCESSING • FIXED WORKLOAD • STRATIFIED SAMPLING • SOLUTION • SUMMARY

Pre-computed samples • Can give approximate answer very efficiently. • Workload are used to make sure that errors are acceptable.

Previous Studies • Solution is difficult to evaluate theoretically. • Do not formally deal with uncertainty in the expected workload. • Ignoring the variance in the data distribution.

Sample • Only 50% of R records can be used as sample • Query : “SELECT SUM(Revenue) FROM R” • The answer for is 1030 Table R

Sample (cont.) • The answer for the query for table S1 is 40. • The answer for the query for table S2is 2020. • How to get these answer? Sample Table S1 Sample Table S2

Sample (cont.) • large variance in the aggregate column can lead to large relative errors. • Relative error =|y- y’| / y • Relative error for S1 = |1030 – 40| / 1030 • Relative error for S2= |1030 – 2020| / 1030

What’s New ? • The goal is to pick sample that minimize error. • If actual workload is identical to the given workload (fixed), error will be smaller. • Can work for identical and similar query to the given workload.

Sampling • Two ways for selecting samples • Randomized • Deterministic • A Workload W is a set of pairs of queries and their weight. • W = {<Q1, w1>,<Q2, w2>,…<Qq, wq>} • Σiwi = 1.

Architecture for Approximate Query Processing

Architecture (cont.) • Offline Component • Selects sample or records from relation R • Online Component • Rewrites an incoming query to use the sample. What is “rewrites” means? • Reports answer with an estimate error

Architecture (cont.) • New method for automatically lifting a given workload. • It is unrealistic to assume that the incoming queries will be identical to the given workload. • The key : the ability to compute a probability distribution Pw.

Error Metrics • Relative Error : |y - y’| / y • Squared Error : SE(Q) = (|y - y’| / y)² • Squared Error for GROUP BY query SE(Q) = (1/g) Σi((yi – yi’)/ yi)² • a probability distribution of queries pw • Mean squared error for the distribution: MSE(pw) =ΣQpw(Q)*SE(Q) • Root mean squared error: RMSE(pw) = √MSE(pw)

Fixed Workload • Special case ? • A given workload are “identical” to the incoming queries. • Problem: FIXEDSAMP Input: R, W, k Output: A sample of k records (with appropriate additional columns) such that MSE(W) is minimized.

Fundamental Regions • Relation R contains 9 records • W consists of 2 queries • Q1 = select records with C values between 10 -50 • Q2 = select records with C values between 40 -70 • These queries divide Relation R into 4 fundamental regions.

Fundamental Regions (cont.)

Fundamental Regions (cont.) • partitioning the records in R into a minimum number of regions R1, R2, …, Rr such that for any region Rj, each query in W selects either all records in Rj or none. • Total number fundamental regions =? Min(2|W|, n)

FIXEDSAMP Solution • Step 1. Identify Fundamental Regions in R • r <= k • r > k • Step 2 Pick Sample Records • Step 3 Assign values to additional columns

LIFTING WORKLOAD TO QUERY DISTRIBUTION • Query Q’ is not identical, Pw(Q’) is high if Q’ is similar to queries in the workload, and Low if not. • Q’ and Q are similar if selected records have significant overlap.

LIFTED WORKLOAD • P{Q}(R’)is the probability of occurrence of any query that selects exactly the set of records R’. • For any given record inside (resp. outside) RQ, the parameter δ (resp. γ) represents the probability that an incoming query will select this record

LIFTED WORKLOAD (Cont.)

LIFTED WORKLOAD (Cont.) δ → 1 and γ → 0: implies that incoming queries are identical to workload queries. δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries. δ → ½ and γ → 0: implies that incoming queries are subsets of workload queries. δ → ½ and γ → ½: implies that incoming queries are unrestricted.

RATIONALE FOR STRATIFIED SAMPLING • A population is partitioned into multiple strata, and samples are selected uniformly from each stratum.

STRATIFIED SAMPLING • a stratified sampling scheme partitions R into r strata containing n1, ., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k). • Q1 = SELECT COUNT(*) FROM R WHERE ProductID IN(3,4); • POPQ is population of query Q • POPQ1 = {0,0,1,1} = non-zero variance • Divided into two strata {0,0} and {1,1}

SOLUTION FOR SINGLE-TABLE SELECTION QUERIES WITH AGGREGATION • Stratification • How many strata • How many records for each stratum • Allocation • Determines how to divide k • Sampling • Forms the final sample of k record

SOLUTION FOR COUNT AGGREGATE • Stratification (lemma 1) • r is not known, divide R into fundamental regions and treat them as strata. • Allocation (lemma 2) • MSE(pW) = Σiwi MSE(p{Q}) • MSE(pW) can be expressed as a weighted sum of the MSE of each query in the workload

SOLUTION FOR COUNT AGGREGATE (Cont.) • For any Q εW, we express MSE(p{Q}) as a function of the kj’s Lemma 3 : ApproxMSE(p{Q}) = Then,

SOLUTION FOR COUNT AGGREGATE (Cont.) • Since we have an (approximate) formula for MSE(p{Q}), we can express MSE(pw) as a function of the kj’svariables. Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a function of n1,…,nr, δ, and γ. αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj. • Now we can minimize MSE(pw).

SOLUTION FOR COUNT AGGREGATE (Cont.) Lemma 4: Σj(αj / kj) is minimized subject to Σjkj = k if kj = k * ( sqrt(αj) / Σisqrt(αi) ) • This provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region

SOLUTION FOR SUM AGGREGATE • Stratification • Bucketing technique • Divide fundamental regions with large variance into a set of finer regions. • Treat each region as strata • Allocation • Yj is average (sum) of the aggregate column values of all records in region Rj

SOLUTION FOR SUM AGGREGATE (Cont.) • Each value in the region can be approximated as yj • An approximate formula for MSE(P{Q}) for SUM query Q in W

Pragmatic Issues • Identifying Fundamental Regions • Handling Large Number of Fundamental Regions • Obtaining Integer Solution • Obtaining unbiased error

STRAT ALGORITHM

IMPLEMENTATION AND EXPERIMENTAL RESULT • This experiment compares the STRAT method to other methods. • USAMP – uniform random sampling • WSAMP– weighted sampling • OTLIDX– outlier indexing combined with weighted sampling • CONG– Congressional sampling

COUNT AGGREGATE

SUM AGGREGATE

COUNT AGGREGATE

THANK YOU

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries