140 likes | 269 Views
Congressional Samples for Approximate Answering of Group-By Queries. Swarup Acharya Phillip B. Gibbons Viswanath Poosala Presented By: Muhammed Z. Miah. Introduction. Limitations of Uniform Sampling Presence of skewed data in aggregate values Effect of low selectivity in selection queries
E N D
Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath Poosala Presented By: Muhammed Z. Miah CS6392 - DB Exploration
Introduction • Limitations of Uniform Sampling • Presence of skewed data in aggregate values • Effect of low selectivity in selection queries • Presence of small groups in group-by queries • Biased Sampling for Group-By Queries • (Precomputed) Biased sampling – hybrid union of biased and uniform sampling CS6392 - DB Exploration
Aqua System (Architecture) CS6392 - DB Exploration
Problems with Group-By Queries • Decision support queries routinely segment the data into groups. • For example, a group-by query on the U.S. census database could be used to determine the per capita income per state. However ,there can be a huge discrepancy in the sizes of different groups, e.g., the state of California has nearly 70 times the population of Wyoming. • As a result, a uniform random sample of the relation will contain disproportionately fewer tuples from the smaller groups, which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. • Standard error is inversely proportional to √n for uniform sample. n is the uniform sample random size. CS6392 - DB Exploration
Solution (Congressional Sampling) • Congressional samples are hybrid union of uniform and biased samples. • The strategy adopted is to divide the available sample space X equally among the g groups , and take a uniform random sample within each group. • Consider US Congress which is hybrid of House and Senate. House has representative from each state in proportion to its population. Senate has equal number of representative from each state. • Then apply House and Senate scenario for representing different groups. House sample:Uniform random sampling from each group . Senate sample: Sample an equal number of tuples from each group. CS6392 - DB Exploration
Solution (Congressional Sampling) • Define a strategy S1 as following : • Divide the available sample space X equally among the g groups , and take a uniform random sample within each group • Congressional approach : In this approach consider the entire set of possible group by queries over a relation R. • Let be the set of non-empty groups under the grouping G. The grouping G partitions the relation R according to the cross-product of all the grouping attributes; this is the finest possible partitioning for group-bys on R. Any group h on any other grouping T G is the union of one or more groups g from . • Constructing Congress, 1. Apply S1 on each TG. 2. Let be the set of non-empty groups under the grouping T, and let the number of such groups. 3. By S1, each of the non-empty groups in T should get a uniform random sample of X/mT tuples from the group. CS6392 - DB Exploration
Solution (Congressional Sampling) • Constructing Congress, 4. Thus for each subgroup g in of a group h in T, the expected space allocated to g is simply 5. Then, for each group g , take the maximum over all T of Sg,T, as the sample size for g, and scale it down to limit the space used to X. The final formula is: Sample Size (g) = 6. For each group g in , select a uniform random sample of size Sample Size(g). Thus we have a stratified, biased sample in which each group at the finest partitioning is its own strata. Thus Congress essentially guarantees that both large and small groups in all groupings will have a reasonable number of samples. where ng and nh are the number of tuples in g and h respectively. CS6392 - DB Exploration
Rewriting • Query rewriting involves two key steps: a) scaling up the aggregate expressions and b) deriving error bounds on the estimate. • For each tuple, let its scale factor ScaleFactor be the inverse sampling rate for its strata. • All the sample tuples belonging to a group will have the same ScaleFactor. Thus key step in scaling is efficiently associate each tuple with its corresponding ScaleFactor. • There are two approaches to doing this: a) store the ScaleFactor(SF) with each tuple in sample relation - Integrated b) use a separate table to store the ScaleFactors for the groups - Normalized, Key-normalized, Nested-integrated • Each approach has its pros and cons. CS6392 - DB Exploration
Computation and Maintenance • One Pass Algorithm • [AGP99b] S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. Technical report, Bell Laboratories, Murray Hill, New Jersey, November 1999 CS6392 - DB Exploration
Experiments • Testbed • On Aqua, with Oracle (v7) • Accuracy of Sample Allocation Strategies • Performance for Different Query Sets • Queries w/ No Group-bys, Three group-bys, Two group-bys • Effect of Sample Size • Error drops as more space is allocated to store the samples • Congress – drops error rapidly w/ increasing sample size and provide high accuracy even for arbitrary group-bys • Performance of Rewriting Strategies CS6392 - DB Exploration
Extensions • Generalization to Multiple Criteria • Generalization to Other Queries CS6392 - DB Exploration
Related Work • Online Aggregation • Histograms • Wavelets • Biased Sampling (Stratified Sampling) CS6392 - DB Exploration
Conclusions • Congressional samples are effective for group-by queries with arbitrary group-bys (including none) • New strategies were validated experimentally for both in their ability to produce accurate estimates to group-by queries and in their execution efficiency CS6392 - DB Exploration
THANK YOU Happy Valentines CS6392 - DB Exploration