1 / 24

Online aggregation

Online aggregation. Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California, Berkley. Presented By: Arjav Dave. Overview. Introduction Goals Implementation

belden
Download Presentation

Online aggregation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online aggregation Joseph M. HellersteinUniversity of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California, Berkley Presented By: Arjav Dave

  2. Overview • Introduction • Goals • Implementation • Performance Evaluation • Future Work

  3. Problems with Aggregation • Aggregation is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time and then the final answer is returned. • Users are forced to wait without feedback while the query is being processed. • Aggregate queries are generally used to get a rough picture of a large body, yet they are computed with painstaking precision.

  4. What is Online Aggregation and Why? • Permits users to observe the progress of their aggregation. • Controls the execution on the fly.

  5. Example • Consider the query: select avg(final_grade) from grades where course_name=‘cse186’ • Without index the query will scan all records before returning the answer.

  6. Example using online aggregation • The user can stop the query processing if the result is within specified confidence interval.

  7. Interface with groups • Consider a query with group by clause having 6 groups in output • There are 6 stop signs each for one group. They can be used to stop the group processing. • Such an interface is easy for the non-statistical users.

  8. Usability Goals • Continuous Observation: Statistical, Graphical and other intuitive interfaces. Interfaces must be extensible for each aggregate function. • Control of Time/Precision: User should be able to terminate processing at any time controlling trade off between time and precision. • Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.

  9. Performance Goals • Minimum time to Accuracy: Minimize time required to produce useful estimate of the final answer. • Minimize time to completion: Minimize the time required to produce the final answer. • Pacing: The running aggregates should be updated at regular rates, without being so frequent that they overburden the user or user interface.

  10. Building a system for Online Aggregation Two Approaches: • Naïve Approach • Modifying a DBMS

  11. Naive Approach • Consider the query: select running_avg(final_grade), running_confidence(final_grade), running_interval(final_grade) from grades; • POSTGRES supports user defined function which can be defined for simple aggregates. For complex aggregates performance and functional issues arise.

  12. Modifying a DBMS • Modify the database engine to support online aggregation. • Random access to Data: • Necessary to get the statistically meaningful estimates of the precision of running aggregates. • Access Methods: • Heap Scan • Index Scan • Sampling from Indices

  13. Types of Access Methods • Heap Scan: • Generally stored in random order, so easy to fetch • In clustered files there may be some logical order, so choose other access method for queries over that attributes. • Index Scan: • Returns tuples based on some attribute or in groups based on some attribute. • Sampling from Indices: • Ideal for producing meaningful confidence intervals. • Less efficient than other two.

  14. Fair, Non-Blocking Group By • Traditional technique does sorting and then grouping. But sorting blocks • Instead hash the input relation on its grouping columns. But does not perform better as the number of groups increases. • Hybrid hashing or its optimized version Hybrid Cache can be used. • For DISTINCT queries same technique can be used.

  15. Index Striding • Update for the group with few members will be very infrequent. • Index Striding: It uses Round-Robin technique to fetch tuples from different groups fairly. • Advantages: • Output is updated according to default or user settings. • Delivery of tuples of a group can be stopped, so delivery of tuples for other groups is much faster.

  16. Non-Blocking Join Algorithms • Sort-Merge Join: Sorting blocks. • Merge Join: Not acceptable for access methods that fetch tuples in sorted order. • Hybrid Hash Join: Acceptable if inner relation is small, as it blocks to hash the inner relation. • Pipeline Hash: Less Efficient than hybrid hash join. But efficient if both the relations are large. • Nested-Loops join: Not useful for joins with large inner relation non-indexed.

  17. Optimization • Avoid Sorting • Function to calculate cost for Blocking sub-operations (e.g. Hashing inner relation in hybrid hash join) according to processing time should be exponential. • Maximize user control. • Trade off need to be evaluated between the output rate and the time to completion.

  18. Aggregate Functions • New aggregate functions must be defined to return running confidence intervals. • Query Executor must be modified to provide running aggregate values for display. • An API must be provided to control the rate e.g. stopGroup, speedupGroup, slowDownGroup, setSkipFactor.

  19. Running Confidence Intervals • Precision of running aggregates is given by running confidence intervals • Three types: • Conservative Confidence Interval • Large Sample Intervals • Deterministic Confidence Interval

  20. Future Work • Enhancing User Interface • Support for Nested Queries • Checkpointing and Continuation • Tracking online queries

  21. QUESTIONS?

More Related