1 / 26

Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010

Dynamic Sample Selection for Approximate Query Processing Brian Babcock, Surajit Chaudhuri, Gautam Das ACM SIGMOD 2003. Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010. 3 Key Terms for the Topic. Approximate Query Processing (AQP). Biased Sample. Dynamic Selection.

merle
Download Presentation

Presented By Srikanth Vadada Fall 2010 - CSE 6339 23 rd Sep 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Sample Selection for Approximate Query Processing Brian Babcock, Surajit Chaudhuri, Gautam Das ACM SIGMOD 2003 Presented By Srikanth Vadada Fall 2010 - CSE 6339 23rd Sep 2010

  2. 3 Key Terms for the Topic Approximate Query Processing (AQP) Biased Sample Dynamic Selection Goal : Dynamically construct an appropriate Biased Sample for Approximate Query Processing

  3. Why Approximate Query Processing (AQP) ? • Rapid strides in Data Collection & Management Technologies • Resulting in very large Databases • Effective Data Analysis methods – Ongoing Research • Analysis Queries require aggregation or summarizations • Expensive Running Times • Requirements of Analysis Systems (Decision Support Systems) • Short Query Response Time • Exactness of Query Results less important AQP Techniques are the Solution

  4. Why Sampling Techniques ? • Data Analysis Queries • OLAP – Materialized Views of Data Cubes • Building Indexes • Physical Data Design – Use Preprocessing Time & Space • Effective when Query Workload is known in advance • Expensive to build indexes for all possible queries • AQP & Physical Database Design Methods Complementary

  5. Approximate Query Processing (Related Work) • Online Aggregation • Hellerstein J., Haas P., Wang H. Online Aggregation, CM SIGMOD 1997 • Approximate answers are produced during early stages • Gradual refinement until data is processed • Advantages: • No pre-processing required • Allows progressive refinement of answers at runtime • Disadvantages : • Require random disk access (slow). • Requires query processor code change.

  6. Approximate Query Processing (Related Work) • Join Synopses - Sampling based method • Acharya S., Gibbons P. B., Poosala V., Ramaswamy S. Join Synopses for Approximate Query Answering, ACM SIGMOD 1999 • Join Queries – Primary Key Joins • Pre Computation - Join of Fact Tables with Dimension Tables • Disadvantages: • Does not extend to queries that involve non-foreign key joins

  7. Approximate Query Processing (Related Work) • Icicles - Weighted Sampling method • Ganti V., Lee M. L., Ramakrishnan R. ICICLES: Self-tuning Samples for Approximate Query Answering, VLDB, 2000. • Frequency of Tuple Access by Queries in Workload • Disadvantages: • Addresses only low selectivity problem

  8. Dynamic Sample Selection • “Appropriate Biased” Samples • Give accurate approximate results for most queries • Appropriate Biased Sample varies from Query to Query • Previous Sampling Methods vs Dynamic Sample Selection • Previous sampling - Single Sample with fixed bias • Individual Tailored Sample for each query • Creation of subsamples is done offline - Preprocessing • Assembly into an overall sample is done online - Runtime

  9. Effectiveness of Biased Sampling - Example • Database consists of 100 Product Tuples • Product = “Stereo” - 90 Tuples • Product = “TV” – 10 Tuples • Sampling 10 tuples in 2 ways • 10% of the tuples uniformly each with weight 10 • 0% of “Stereo” tuples and 100% “TV” tuples with weight 1 • Query – Count of “TV” Tuples • Which gives a correct answer always? 2nd Sample - Always gives the exact answer 1st Sample - Only if exactly 1 of the TV tuples is chosen

  10. Dynamic Sampling SAMPLE DATA DATA SAMPLE SAMPLE Query Query SAMPLE SAMPLE Static vs Dynamic Sampling Selection StandardSampling

  11. Dynamic Sample Selection Architecture ( 1 / 3) • Pre- Processing Phase • Extra Disk space not taken advantage by Standard Sampling Methods • DSS uses this effectively by creating a large sample containing a family of differently biased subsamples • Step 1 - Examine Data Distribution for creating a set of biased samples – Results into Overlapping Strata • Step 2 – Samples are created with potentially different sampling rates for each stratum. Generate Metadata – Characteristics of each sample Query Workload Sample Data Select Strata Build Sample Data Meta- Data Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

  12. Dynamic Sample Selection Architecture ( 2 / 3) • Runtime Phase • When Queries are issued at Runtime - DSS re-writes the queries to run against sample tables • Appropriate Sample Tables to use are determined by comparing Query with the Meta data • Algorithms for choosing which samples to build in pre processing and samples to use for Query Processing are not described Query Sample Data Choose Samples Rewrite Query Meta- Data Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

  13. Dynamic Sample Selection Architecture ( 3 / 3) • Policies for Sample Selection • Choice of Sample guided by incoming Query Syntax • Examples • Separate sample for each table and choose sample based on the FROMclause • Separate sample for each pre-specified aggregate expression and choose sample based on the SELECTclause

  14. Motivation - Small Group Sampling (1 / 2) • Sampling for answering aggregation queries with “group-bys” • Ex : SELECT DEPTID , COUNT(*) • FROM EMP_DEPT • GROUP BY DEPTID • Uniform Sampling give weight to each GROUP proportional to number of tuples falling in the GROUP • Skewness in data distribution results in Group Oversampling • Heuristic Method for samples satisfying most group by queries • Uniform Sampling - Good providing estimates for larger groups

  15. Motivation - Small Group Sampling ( 2 / 2) • Small Groups are the problem. • Since the Groups are small – Take Advantage • Scan all tuples contributing to small groups • Need to identify the small groups • Small Group Sampling method • Overall sampling using Uniform Sampling- Overall Sample • Rows from small groups – Small group tables • No downsampling for small group tables -100% rows taken Large Groups Small Groups

  16. Small Group Sampling - Example • Example : Aggregation queries with “group-bys” select Age, Income, count(*) from Employee_Tbl group by Age, Income

  17. Small Group Sampling – Illustration (1/3) s_overall • sample – perform uniform sampling on large groups. • Small group tables - one or more sample tables for smaller groups. Pre-Processing Phase: • Create a overall sample s_overall • 2. “Age” Histogram ( Column Index: 0) • r : Base Sampling rate, determines the size of Overall Sample (eg, 30%) • t : small group fraction, max size of each small group table (eg, 20%) s_age Small group table

  18. Small Group Sampling – Illustration (2/3) s_overall Pre-Processing Phase: “Income” Histogram (Column Index: 1) s_income 011 011

  19. Small Group Sampling – Illustration (3/3) Runtime Phase: s_age (Column Index: 0 - 001) s_overall Query Issued - SELECT Age, Income, count(*) FROM Employee_tbl GROUP BY Age, Income SELECT Age, Income, count(*) FROM s_age GROUP BY Age, Income UNION ALL SELECT Age, Income, count(*) FROM s_income GROUP BY Age, Income WHERE Bitmask & 1 = 0 /* ie, 001 . (eg, 010 & 001 = 000 ; 011 & 001 = 1)*/ UNION ALL SELECT Age, Income, count(*) * (100/30) FROM s_overall GROUP BY Age, Income WHERE Bitmask & 3 = 0 /* 3 = 20 + 21ie, 011 (eg, 001 & 011 = 1; 011 & 011 = 1; 010 & 011 = 1)*/ s_income (Column Index: 1 - 010)

  20. Accuracy Metrics (1 / 2) • As many possible groups to be preserved in approximate answer • Error in the aggregate value for each group should be small • Q = Aggregation Query • Let G = {g1, g2, g3, … gn} be the set of n groups in the answer to Q • xi = aggregate value for group gi. • A= Approximate Answer to Q • G’ = {gi1, gi2, gi3, … gim} be the set of m groups in A • x’i1 = aggregate value for group gij.

  21. Accuracy Metrics (2 / 2) Percentage of Groups from Q missed by A Average relative error on Q of A Average squared relative error on Q of A

  22. Experimental Results ( 1 / 2) TPC – H Database , Count & Sum queries , Number of Columns in all Tables =245 RelErr, PctGroups increased for Uniform Sampling & Small Group Sampling Increase was more pronounced for Uniform Sampling Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

  23. Experimental Results ( 2 / 2) Uniform Sampling outperforms Small group sampling at low skews Small group sampling does better at moderate to high skew Speedup decreases as the number of grouping columns increase Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.

  24. Conclusions • Dynamic Sample Selection improves on previous AQP Methods • Productively utilizes additional Disk Space • Small Group Sampling targets aggregate queries with group bys • Small Group sampling outperforms other techniques

  25. References • Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing • http://crystal.uta.edu/~cse6339/Fall08DBIR.htm • http://crystal.uta.edu/~cse6339/Fall09DBIR.htm

  26. Q & Questions ? A

More Related