260 likes | 461 Views
Online aggregation. Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California, Berkley. Presented By: Arjav Dave. Overview. Introduction Goals Implementation
E N D
Online aggregation Joseph M. HellersteinUniversity of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California, Berkley Presented By: Arjav Dave
Overview • Introduction • Goals • Implementation • Performance Evaluation • Future Work
Problems with Aggregation • Aggregation is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time and then the final answer is returned. • Users are forced to wait without feedback while the query is being processed. • Aggregate queries are generally used to get a rough picture of a large body, yet they are computed with painstaking precision.
What is Online Aggregation and Why? • Permits users to observe the progress of their aggregation. • Controls the execution on the fly.
Example • Consider the query: select avg(final_grade) from grades where course_name=‘cse186’ • Without index the query will scan all records before returning the answer.
Example using online aggregation • The user can stop the query processing if the result is within specified confidence interval.
Interface with groups • Consider a query with group by clause having 6 groups in output • There are 6 stop signs each for one group. They can be used to stop the group processing. • Such an interface is easy for the non-statistical users.
Usability Goals • Continuous Observation: Statistical, Graphical and other intuitive interfaces. Interfaces must be extensible for each aggregate function. • Control of Time/Precision: User should be able to terminate processing at any time controlling trade off between time and precision. • Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.
Performance Goals • Minimum time to Accuracy: Minimize time required to produce useful estimate of the final answer. • Minimize time to completion: Minimize the time required to produce the final answer. • Pacing: The running aggregates should be updated at regular rates, without being so frequent that they overburden the user or user interface.
Building a system for Online Aggregation Two Approaches: • Naïve Approach • Modifying a DBMS
Naive Approach • Consider the query: select running_avg(final_grade), running_confidence(final_grade), running_interval(final_grade) from grades; • POSTGRES supports user defined function which can be defined for simple aggregates. For complex aggregates performance and functional issues arise.
Modifying a DBMS • Modify the database engine to support online aggregation. • Random access to Data: • Necessary to get the statistically meaningful estimates of the precision of running aggregates. • Access Methods: • Heap Scan • Index Scan • Sampling from Indices
Types of Access Methods • Heap Scan: • Generally stored in random order, so easy to fetch • In clustered files there may be some logical order, so choose other access method for queries over that attributes. • Index Scan: • Returns tuples based on some attribute or in groups based on some attribute. • Sampling from Indices: • Ideal for producing meaningful confidence intervals. • Less efficient than other two.
Fair, Non-Blocking Group By • Traditional technique does sorting and then grouping. But sorting blocks • Instead hash the input relation on its grouping columns. But does not perform better as the number of groups increases. • Hybrid hashing or its optimized version Hybrid Cache can be used. • For DISTINCT queries same technique can be used.
Index Striding • Update for the group with few members will be very infrequent. • Index Striding: It uses Round-Robin technique to fetch tuples from different groups fairly. • Advantages: • Output is updated according to default or user settings. • Delivery of tuples of a group can be stopped, so delivery of tuples for other groups is much faster.
Non-Blocking Join Algorithms • Sort-Merge Join: Sorting blocks. • Merge Join: Not acceptable for access methods that fetch tuples in sorted order. • Hybrid Hash Join: Acceptable if inner relation is small, as it blocks to hash the inner relation. • Pipeline Hash: Less Efficient than hybrid hash join. But efficient if both the relations are large. • Nested-Loops join: Not useful for joins with large inner relation non-indexed.
Optimization • Avoid Sorting • Function to calculate cost for Blocking sub-operations (e.g. Hashing inner relation in hybrid hash join) according to processing time should be exponential. • Maximize user control. • Trade off need to be evaluated between the output rate and the time to completion.
Aggregate Functions • New aggregate functions must be defined to return running confidence intervals. • Query Executor must be modified to provide running aggregate values for display. • An API must be provided to control the rate e.g. stopGroup, speedupGroup, slowDownGroup, setSkipFactor.
Running Confidence Intervals • Precision of running aggregates is given by running confidence intervals • Three types: • Conservative Confidence Interval • Large Sample Intervals • Deterministic Confidence Interval
Future Work • Enhancing User Interface • Support for Nested Queries • Checkpointing and Continuation • Tracking online queries