220 likes | 244 Views
Online Aggregation. By:. Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539)
E N D
Online Aggregation By: Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539) Deepak Anand (1000603813)
AGENDA • Motivation • Online Aggregation • Basic Approach • Goals • Building an Online Aggregation system • Optimization • Running confidence intervals • Conclusion • Future work
Motivation • Aggregation in traditional databases • Long delay in query execution and user is forced to wait without feedback till query completes execution. • Users want to see the aggregation information right away. • Aggregation queries are typically used to get a ‘rough picture” but they are computed with painstaking precision. • This paper suggests the following changes: • Perform aggregation online so that: • Progress can be observed. • execution of the queries can be controlled on the fly.
An Example Consider the following example: SELECT AVG(final_grade) FROM grades WHERE course_name = ‘CS186’ If there is no index on the course_name attribute, then this query scans the entire grades table before returning the result. AVG -------------- | 2.631046 | --------------
An alternative approach Running aggregate An estimate of the final result based on the records retrieved so far Progress Bar Running confidence interval 2.6336 +/- 0.0652539 with 95% probability
Online Aggregation Interface with Groups • If the records are retrieved in the random order, a good approximate result can be obtained • We can stop sampling once the length of the confidence interval becomes sufficiently small. Stop Button • Consider a GROUP BY query with 6 groups in the output • The user is presented with 6 outputs and 6 “Stops-sign” buttons • Stopping condition can be set on the fly • Easy to understand for non-statistical user
Usability goals • Continuous observation: Users can observe the processing in the GUI and get a sense of the current level of precision. • Control of Time/Precision: Users can terminate processing at any time at a fine granularity(trade-off between time and precision) • Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.
Performance goals • Minimum time to accuracy: Minimize time required to produce a useful estimate of the final answer. • Minimum time to completion: Minimize time required to produce the final answer. • Pacing: The running aggregates are updated at a regular rate, to guarantee a smooth and continuously improving display.
Building an Online Aggregation System There are two approaches that can be taken: • A Naive approach: • Trivial implementation without modification to POSTGRES. • User defined functions can be written in C. • Cannot be used with GROUP BY clause. • Modifying the DBMS: • Difficult to implement online aggregation as user level addition. • Modifying the database engine to support Online Aggregation. SELECT running_avg(final_grade) running_confidence(final_grade) running_interval(final_grade) FROM grades
Random Access to Data Estimates of the running aggregates is accurate when records are retrieved randomly. • Heap Scans • Simple heap scans can be effective in traditional heap file access methods where records are stored in unspecified order. • Need to choose different method for the aggregate attributes, which are correlated to the logical order of formation of heap. • Index Scans • Can be used if aggregate attributes are not used for indexing. • Sampling from Indices • Techniques for pseudo random sampling from various index structures can be used. [Olken’s work]
Non-blocking/Fair access GROUP BY and DISTINCT • Groups should receive updates in fair manner Solution: Sorting ?? No, because sorting blocks • Must use hash based techniques • Pros: Non-blocking • Cons: Does not perform well as the number of groups grow. • Solution: Hybrid hashing. • Optimized version: Hybrid cache • For DISTINCT columns, a similar hashing technique can be used.
Index Striding • Updates for the groups with few members will be very infrequent. • For fair group by • Read tuples in round robin fashion (a tuple from group 1, a tuple from group 2, …) • Supported by technique index striding • What is Index Striding ? • Additional advantages • Group updating rate can be controlled • Particular group processing can be stopped
POSTGRES with index striding Speed control
Non-blocking Join Algorithms • For interactive display of online aggregation, avoid algorithms that block. • Sort-merge join • Unacceptable as sorting is blocking operation • Merge Join • OK but produces sorted output • Hybrid hash join • Not good if inner relation is large • Nested loops join isalways good, • In case of large un-indexed inner relation its too slow • An optimizer must be used to choose between these strategies.
Optimization • Avoid sorting unless explicitly requested by the user. • Blocking sub-operations have costs and appropriate costs should be considered. • Cost function = f(to) + g(td ) • There are 2 components in cost function: • dead time (td ): time spent doing “invisible” work • output time (to ): time spent producing output • Preferences to the plans that maximize user control (index striding)
Extended aggregate functions • Standard set of aggregate functions must be extended • Aggregate functions must be written that provides running estimates • Running computation • SUM, COUNT, AVG – straight forward • VAR, STD DEV – can be implemented using algorithms • Aggregate functions returning running confidence must be defined.
API Current API uses built-in methods • e.g., StopGroup(cursor,groupval) speedUpGroup(cursor,groupval) slowDownGroup(cursor,groupval) setSkipFactor(cursor_name,integer) Skip Factor
Statistical Issues • Running confidence interval • Given an estimate, probability p that we’re within eof the right answer Mu • A large value of e means that the records seen so far may not be sufficiently representative of the entire database and the current estimate of the result may be far from the final result. • Types of running confidence interval s: • Conservative confidence interval • For n (no of tuples retrieved) >= 1 • Answer guaranteed to be >= probability p [based on Hoeffding’s inequality] • Large-sample confidence intervals • Deterministic confidence intervals • Running confidence interval can be dynamically adjusted depending on the value of n.
Conclusion • An interactive, intuitive and user-controllable approach to aggregation is needed. • This can be achieved by significant extensions to the database engine. • These extensions satisfy the usability and performance goals. • Ability to produce statistical confidence intervals for running aggregates.
Future work • Better UI • Nested Queries • Control without Indices • Checkpointing / Continuation