APPROXIMATE QUERY PROCESSING IN DATABASES

APPROXIMATE QUERY PROCESSING IN DATABASES By: Jatinder Paul

Introduction User • DecisionSupport Systems(DSS) SQL Query Exact Answer Long Response Times! • In recent years, advances in data collection and management technologies • have led to proliferation of very large databases. • Decision support applications such as Online Analytical Processing (OLAP) • and data mining tools are gaining popularity. Executing such applications on • large volumes of data can be resource intensive. • For very large databases (multi-gigabyte ) processing even a single query • involves accessing enormous amounts of data, leading to very high running • time.

What is Approximate Query Processing ? AQP • Keeping query processing time small is very important in data mining and decision support system (DSS). • In aggregate queries (involving SUM,COUNT,AVG etc.) precision to “last decimal” is not needed • e.g., “What is averages sale of a product in all the states of US ?” • Exactness of result is not very important . Result can be given as approximate answer with some error guarantee. • e.g. Average salary $59,000 +/- $500 (with 95% confidence) in 10 seconds vs. $59,152.25 in 10 minutes • Approximate Query Processing (AQP) techniques sacrifice accuracy to improve the running time

Different AQP techniques • All AQP techniques involves building some kind of synopses (summary) of the database. • AQP techniques differ in the way synopses is build. • Sampling : Most preferred and practically used technique. * Randomized sampling : Sampling uniformly at random. * Deterministic sampling : Selecting the best sample that minimizes the total error in the approximate answers. • Histogram : Involves creating histograms which represent the database. • Wavelets : Still a research area ! • Concentrating on various sampling techniques.

Basic Sampling Based AQP Architecture. • Pre-processing phase (offline): Sample is build . • Run time phase (online): Queries are rewritten to run against the sample. The result is then scaled to given the approximate answer.

Limitations of Uniform Sampling for Aggregation Queries • We discuss two problems that adversely affect the accuracy of sampling based estimations: • presence of skewed data in aggregate values , • the effect of low selectivity in selection queries and , • the presence of small groups in group-by queries. • Discussions of these problems in details.

Effect of data skew • A skewed database is characterized by the existence of certain tuples that are deviant from the rest in terms of their contribution the aggregate value. • These tuples are called outliers. These are the values that contribute heavily to error in uniform sampling. • Outliers in data results large variance . Large variance => Large standard deviation and , Large standard deviation =>Large error in results.

Effect of data skew • Example: Consider a relation R containing Two columns <ProductId, Revenue> and four records {<1, 10>, <2, 10>, <3,10>, <4, 1000>}. and aggregate query Q: SELECT SUM(Revenue) FROM R. Using a sample S of two records from R to answer the query. Query is answered by : • Running it against S and • Scaling the result by a factor of two (since we are using a 50% sample). • Consider a sample S1 = {<1, 10>, <3, 10>}. The estimated answer for Q using S1 is 40, which is a severe underestimate of the actual answer (1030). • Now consider another sample S2 = {<1,10>, <4,1000>}. The estimated answer for Q using S2 is 2020, which is a significant overestimate. Thus, large variance in the aggregate column can lead to large relative errors

Handling data skew :Outlier -indexes • Main idea is deal with outliers separately ,and sample from rest of the relation. • Partition the relation R into two sub relation:- • Ro(outliers) and • RNO (non-outliers). • The query Q can now be considered as the UNION of two sub queries :- • Q applied to Ro and • Q applied to RNO (non-outliers).

Handling data skew :Outlier -indexes • Preprocessing steps 1.Determine outliers — specify the sub-relation RO of the relation R deemed to be the set of outliers. 2. Sample non-outliers— select a uniform random sample T of the relation RNO. • Query processing steps. 1.Aggregate outliers— apply the query to the outliers in RO accessed via the outlier-index. 2. Aggregate non-outliers— Apply the query to the sample T and extrapolate to obtain an estimate of the query result for RNO. 3. Combine aggregates— combine the approximate result for RNO with the exact result for RO to obtain an approximate result forR.

Outlier –indexes : Other Issues. • Since the database content changes over time, this requires selection of outlier indexes and samples to be refreshed appropriately. The samples need to be refreshed periodically as pre-computed samples can become stale with use. • An alternative is to do the sampling completely online i.e., make it a part of query processing. • For the case of the COUNT aggregate ,outlier-indexing is not beneficial since there is no variance among the data values. • Also , outlier indexing is not useful for aggregate that depend on the rank order of tuples rather than their actual values. (e.g. Aggregates such as min ,max and median ) Overcoming limitation of sampling in AQp

Effect of low selectivity • Most queries involve selection conditions and/or group-by’s . • If the selectivity of a query is low (i.e. it results in few records selection) then it adversely impacts the accuracy of sampling based estimation. • A selection query partitions the relation into two sub-relations: • tuples that satisfy the condition (relevant sub-relation) and • tuples that do not. • If we sample uniformly from the relation, the number of tuples that are sampled from the relevant sub-relations will be proportional to its size. If this relevant samplesize is small due to low selectivity of the query, it may lead to large error.

Effect of small groups. • The effect of small groups is same as effect of low selectivity . • The group-by queries also partition the relation into numerous sub-relations (tuples that belong to specific groups). • Thus for uniform sampling to perform well, the relevant sub-relation should be large in size, which is not the case in general

Using Workload Information: Solution to low selectivity and small groups • Approach to this problem is to use weighted sampling(instead of uniform sampling) of data by exploiting workload information while drawing the sample. • The essential idea behind weighted sampling scheme is to sample more from subsets of data that are small in size but are important, i.e. have high usage. This results in low error. • This approach is based on the fact that the usage of a database is typically characterized by considerable locality in the access pattern, i.e., queries against the database access certain parts of the data more than others.

Using Workload Information: Solution to low selectivity and small groups • The use of workload information involves the following steps: 1.Workload Collection: Obtain a workload consisting of representative queries against the database. Modern database systems provide tools to log queries posed against the server (e.g. the Profiler component of Microsoft SQL Sever). 2.Trace Query Patterns:The workload can be analyzed to obtain parsed information, e.g., the set of selection conditions that are posed. 3. Trace Tuple Usage:The execution of the workload reveals additional information on usage of specific tuples, e.g., frequency of access to each tuple , the number of queries in the workload for which it passes the selection condition of the query. 4. Weighted Sampling:Perform sampling by taking into account weights of tuples (from Step 3).

Weighted Sampling • Assuming that a tuple ti has weight wi, if the tuple ti is required to answer wi of the queries in the workload. • The normalized weight be wi‘defined as . • Tuple is accepted in the sample with the probability : pi = n.wi’ , where n is the sample size. • Thus the probability with which each tuple is accepted in the sample varies from tuple to tuple. • The inverse of this probability is the multiplication factor associated with tuple used while answering the query. Each aggregate computed over this tuple gets multiplied by this multiplication factor.

Problem with Group-By Queries • Decision support queries routinely segment the data into groups. • For example, a group-by query on the U.S. census database could be used to determine the per capita income per state. However ,there can be a huge discrepancy in the sizes of different groups, e.g., the state of California has nearly 70 times the population of North Carolina. • As a result, a uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups, which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. • Standard error is inversely proportional to √n for uniform sample. n is the uniform sample random size.

Solution: Congressional Samples • Congressional samples are hybrid union of uniform and biased samples. • The strategy adopted is to divide the available sample space X equally among the g groups , and take a uniform random sample within each group. • Consider US Congress which is hybrid of House and Senate. House has representative from each from each state in proportion to its population. Senate has equal number of representative from each state. • We now apply House and Senate scenario for representing different groups. House sample:Uniform random sampling from each group . Senate sample: Sample an equal number of tuples from each group.

Congressional Samples • We define a strategy S1 as following : • Divide the available sample space X equally among the g groups , and take a uniform random sample within each group • Congressional approach : In this approach we consider the entire set of possible group by queries over a relation R. • Let be the set of non-empty groups under the grouping G. The grouping G partitions the relation R according to the cross-product of all the grouping attributes; this is the finest possible partitioning for group-bys on R. Any group h on any other grouping T G is the union of one or more groups g from . • Constructing Congress, 1. Apply S1 on each TG. 2. Let be the set of non-empty groups under the grouping T, and let the number of such groups. 3. By S1, each of the non-empty groups in T should get a uniform random sample of X/mT tuples from the group.

Congressional Samples • Constructing Congress, 4. Thus for each subgroup g in of a group h in T, the expected space allocated to g is simply 5. Then, for each group g , take the maximum over all T of Sg,T, as the sample size for g, and scale it down to limit the space used to X. The final formula is: Sample Size (g) = 6. For each group g in , select a uniform random sample of size Sample Size(g). Thus we have a stratified, biased sample in which each group at the finest partitioning is its own strata. Thus Congress essentially guarantees that both large and small groups in all groupings will have a reasonable number of samples. where ng and nh are the number of tuples in g and h respectively.

Rewriting for biased samples • Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate. • For each tuple, let its scale factor ScaleFactor be the inverse sampling rate for its strata. • All the sample tuples belonging to a group will have the same ScaleFactor. Thus key step in scaling is efficiently associate each tuple with its corresponding ScaleFactor. • There are two approaches to doing this: a) store the ScaleFactor(SF) with each tuple in sample relation and b) use a separate table to store the ScaleFactors for the groups. • Each approach has its pros and cons.

Dynamic Sample Selection for AQP • All previous strategies for AQP uses single sample with fixed bias and do not take advantage of extra disk space. • Increasing the size of a sample stored on disk increases the running time of a query executing against that sample. • Dynamic Sampling gets around with this problem by : 1.Creating a large sample containing a family of differently biased sub samples during pre-processing phase. 2. Using only portion of the sample to answer each query at runtime. • Because there are many different sub samples with different biases available to choose from a runtime, the chances increase that one of them will be a “good fit” for any particular query that is issued. • And since only a small portion of the overall sample is used to answer query response time is low.

Generic Dynamic Sample Selection Architecture • Pre-Processing Phase • Consists of two steps : • The data distribution and query distribution ( e.g. workload information) is examined to identify a set of biased samples to be created. • Samples are created and stored in database along with the metadata. Figure 2. Pre-processing phase Division of database Identifies the characteristics of each sample

Generic Dynamic Sample Selection Architecture • Runtime Phase • Two steps : • Rewrites the query to run against sample tables. • Appropriate sample table(s) to use for a given query Q is determined by comparing Q with the metadata annotations for the sample. Figure 3. Runtime phase • We need algorithm to decide : • Which samples are to be built during pre-processing . • Which samples are to be used for query answering during runtime phase. • It should be possible to quickly determine which of various samples to use to • answer that query.

Small Group Sampling :A dynamic sample selection technique • Small group sampling targets most common type of analysis queries , aggregation queries with “groups-bys”. • Basic idea: Uniform sampling does performs well for larger group-by query but for small groups uniform sampling performs poorly since do not have enough representative from small groups. The small group sampling approach uses a combination of • a uniform random sample (overall sample) over large groups. • One or more small group tables that contain only rows from small groups. The small group tables are not downscaled.

Small Group Sampling Algorithm Description Pre-processing phase The pre-processing algorithm takes two input parameter: • Base sampling rate, r, which determines the size of the uniform random sample that is created (i.e. overall sample). • Small group fraction t, which determines the maximum size of each small group sample table. Let, N = number of rows in the database. C= the set of columns in the database. The pre-processing algorithm can be implemented efficiently by making just two scans: 1) First scan identifies the frequently occurring values for each column an their approximate frequencies. 2) Second scan, the small group tables for each column in S is constructed along with the overall sample.

Small Group Sampling Algorithm Description Pre-processing phase Algorithm Final pass of algorithm 1. Initially , the set S is initialized to C. 2. Count the number of occurrences of each distinct value in each column of the database and determine the common values for each column. ( can be done using separate hash table). 3. If the number of distinct values for a column exceeds a threshold u, then remove the that column from S and stop maintaining its count. L(C) is defined as the minimum set of values from C whose frequencies sum to at least N(1 − t). Second pass of algorithm 1. Determine the set of common values L(C) for each column C. 2. Rows with values from the set L(C) will not be included in the small group table for C, but rows with all other values will be; there are at most Nt such rows. 3. If column C has no small groups, remove it from S. 4. After computing L(C) for every C € S, the algorithm creates a metadata table which contains a mapping from each column name to an unique index between 0 and |S| − 1.

Pre-Processing Phase Algorithm • The final step in pre-processing is to make a second scan of the database to construct the sample tables. • Each row containing an uncommon value for one or more columns (i.e. a value not in the set L(C)) is added to the small group sample table for the appropriate columns. • At the same time as the small group tables are being constructed, the preprocessing algorithm also creates the overall sample, using reservoir sampling to maintain a uniform random sample of rN tuples. • Each row that is added to either a small group table or the overall sample is tagged with an extra bit-mask field (of length |S|) indicating the set of small group tables to which that row was added. This field is used during runtime query processing to avoid double-counting rows that are assigned to multiple sample tables.

Small Group Sampling Algorithm Description Runtime phase • When a query arrives at runtime, it is re-written to run against the sample tables instead of the base fact table. • Each query is executed against the overall sample, scaling the aggregate values by the inverse of the sampling rate r. • In addition, for each column CЄS in the query’s group-by list, the query is executed against that column’s small group table. The aggregate values are unscaled when executing against the small group sample tables. • Finally, the results from the various sample queries are aggregated together into a single approximate query answer. Since a row can be included in multiple sample tables, the re-written queries include filters that avoid double-counting rows

Small group sampling v/s Congressional sampling

Problem with Joins • Approximate query strategies discussed so far uses some sort of sample ( called base sample) using uniform random sampling. • The use of base samples to estimate the output of a join of two or more relations, however, can produce a poor quality approximation. This is for the following two reasons: 1. Non-Uniform Result Sample: In general, the join of two uniform random base samples is not a uniform random sample of the output of the join. In most cases, this non-uniformity significantly degrades the accuracy of the answer and the confidence bounds. 2. Small Join Result Sizes: The join of two random samples typically has very few tuples, even when the actual join selectivity is fairly high. This can lead to both inaccurate answers and very poor confidence bounds since they critically depend on the query result size. • Thus, it is in general impossible to produce good quality approximate answers using samples on the base relations alone.

Solution :Join Synopses • We need to pre-compute samples of join results for making quality answer possible. • First , a naïve solution is to pre-compute such samples is to execute all possible join queries of interest and collect samples of their results. • But it is not feasible since it is too expensive to compute and maintain. • One strategy is to compute samples of the results of a small set of distinguished joins , which can be used to obtain random samples of all possible joins in the schema. The samples of these distinguished joins as join synopses. • Join synopses is a solution for queries with only foreign key joins.

Join Synopses • Lemma J.1The sub-graph of G on the k nodes in any k-way foreign key join must be a connected sub-graph with a single root node. We denote the relation corresponding to the root node as the source relation for the k-way foreign key join. • Lemma J.2 There is a 1-1 correspondence between tuples in a relation r1 and tuples in any k-way foreign key join with source relation r1. • For each relation r, there is some maximum foreign key join (i.e., having the largest number of relations) with r as the source relation. For example., in Figure, is the maximum foreign key join with source relation C.

Join Synopses • Definition of Join synopses: For each node u in G, corresponding to a relation r1. Let ( u) to be the output of the maximum foreign key join with source r1. Let be a uniform random sample of r1. A join synopsis, , to be the output of The join synopses of a schema consists of for all u in G. • Join synopses are also referred as join samples.

Join Synopses • A single join synopses can be used for a large number of distinct joins. • Lemma J.3 From a single join synopsis for a node whose maximum foreign key join has relations, we can extract a uniform random sample of the output of between and distinct foreign key joins. • In other words, a join synopsis of relation r can be used to provide a random samples of any join involving r and one or more of its descendants. • Since Lemma J.2 fails to apply in general for any relation other than the source relation, the joining tuples in any relation r other than the source relation will not in general be a uniform random sample of r. Thus distinct join synopses are needed for each node/relation. • Since tuples in join synopses are the results of multi-way joins, a possible concern is that they will be too large because they have many columns. To reduce the columns stored for tuples in join synopses, we can eliminate redundant columns (for example, join columns) and only store columns of interest. Small relations can be stored in their entirety, rather than as part of join synopses.

References • Surajit Chaudhuri, Gautam Das, Mayur Datar, Rajeev Motwani, Vivek Narasayya: Overcoming Limitations of Sampling for Aggregation Queries. ICDE 2001. • Surajit Chaudhuri, Gautam Das, Vivek Narasayya: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. SIGMOD Conference 2001. • Brian Babcock, Surajit Chaudhuri, Gautam Das: Dynamic Sample Selection for Approximate Query Processing. SIGMOD Conference 2003. • S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy.Join synopses for approximate query answering. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 275-286, June 1999. • Gautam Das: Survey of Approximate Query Processing Techniques.(Invited Tutorial) SSDBM 2003 • P. B. Gibbons, S. Acharya, and V. PoosaIa. Aqua Approximate Query Answering System • Yong Yao and Johannes Gehrke, Query Processing for sensor network

THANK YOU

APPROXIMATE QUERY PROCESSING IN DATABASES