Enhancing Data Aggregation in Databases: A Real-Time Approach

Online Aggregation By: Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539) Deepak Anand (1000603813)

AGENDA • Motivation • Online Aggregation • Basic Approach • Goals • Building an Online Aggregation system • Optimization • Running confidence intervals • Conclusion • Future work

Motivation • Aggregation in traditional databases • Long delay in query execution and user is forced to wait without feedback till query completes execution. • Users want to see the aggregation information right away. • Aggregation queries are typically used to get a ‘rough picture” but they are computed with painstaking precision. • This paper suggests the following changes: • Perform aggregation online so that: • Progress can be observed. • execution of the queries can be controlled on the fly.

An Example Consider the following example: SELECT AVG(final_grade) FROM grades WHERE course_name = ‘CS186’ If there is no index on the course_name attribute, then this query scans the entire grades table before returning the result. AVG -------------- | 2.631046 | --------------

An alternative approach Running aggregate An estimate of the final result based on the records retrieved so far Progress Bar Running confidence interval 2.6336 +/- 0.0652539 with 95% probability

Online Aggregation Interface with Groups • If the records are retrieved in the random order, a good approximate result can be obtained • We can stop sampling once the length of the confidence interval becomes sufficiently small. Stop Button • Consider a GROUP BY query with 6 groups in the output • The user is presented with 6 outputs and 6 “Stops-sign” buttons • Stopping condition can be set on the fly • Easy to understand for non-statistical user

Usability goals • Continuous observation: Users can observe the processing in the GUI and get a sense of the current level of precision. • Control of Time/Precision: Users can terminate processing at any time at a fine granularity(trade-off between time and precision) • Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.

Performance goals • Minimum time to accuracy: Minimize time required to produce a useful estimate of the final answer. • Minimum time to completion: Minimize time required to produce the final answer. • Pacing: The running aggregates are updated at a regular rate, to guarantee a smooth and continuously improving display.

Building an Online Aggregation System There are two approaches that can be taken: • A Naive approach: • Trivial implementation without modification to POSTGRES. • User defined functions can be written in C. • Cannot be used with GROUP BY clause. • Modifying the DBMS: • Difficult to implement online aggregation as user level addition. • Modifying the database engine to support Online Aggregation. SELECT running_avg(final_grade) running_confidence(final_grade) running_interval(final_grade) FROM grades

Random Access to Data Estimates of the running aggregates is accurate when records are retrieved randomly. • Heap Scans • Simple heap scans can be effective in traditional heap file access methods where records are stored in unspecified order. • Need to choose different method for the aggregate attributes, which are correlated to the logical order of formation of heap. • Index Scans • Can be used if aggregate attributes are not used for indexing. • Sampling from Indices • Techniques for pseudo random sampling from various index structures can be used. [Olken’s work]

Non-blocking/Fair access GROUP BY and DISTINCT • Groups should receive updates in fair manner Solution: Sorting ?? No, because sorting blocks • Must use hash based techniques • Pros: Non-blocking • Cons: Does not perform well as the number of groups grow. • Solution: Hybrid hashing. • Optimized version: Hybrid cache • For DISTINCT columns, a similar hashing technique can be used.

Index Striding • Updates for the groups with few members will be very infrequent. • For fair group by • Read tuples in round robin fashion (a tuple from group 1, a tuple from group 2, …) • Supported by technique index striding • What is Index Striding ? • Additional advantages • Group updating rate can be controlled • Particular group processing can be stopped

POSTGRES with index striding Speed control

Non-blocking Join Algorithms • For interactive display of online aggregation, avoid algorithms that block. • Sort-merge join • Unacceptable as sorting is blocking operation • Merge Join • OK but produces sorted output • Hybrid hash join • Not good if inner relation is large • Nested loops join isalways good, • In case of large un-indexed inner relation its too slow • An optimizer must be used to choose between these strategies.

Optimization • Avoid sorting unless explicitly requested by the user. • Blocking sub-operations have costs and appropriate costs should be considered. • Cost function = f(to) + g(td ) • There are 2 components in cost function: • dead time (td ): time spent doing “invisible” work • output time (to ): time spent producing output • Preferences to the plans that maximize user control (index striding)

Extended aggregate functions • Standard set of aggregate functions must be extended • Aggregate functions must be written that provides running estimates • Running computation • SUM, COUNT, AVG – straight forward • VAR, STD DEV – can be implemented using algorithms • Aggregate functions returning running confidence must be defined.

API Current API uses built-in methods • e.g., StopGroup(cursor,groupval) speedUpGroup(cursor,groupval) slowDownGroup(cursor,groupval) setSkipFactor(cursor_name,integer) Skip Factor

Statistical Issues • Running confidence interval • Given an estimate, probability p that we’re within eof the right answer Mu • A large value of e means that the records seen so far may not be sufficiently representative of the entire database and the current estimate of the result may be far from the final result. • Types of running confidence interval s: • Conservative confidence interval • For n (no of tuples retrieved) >= 1 • Answer guaranteed to be >= probability p [based on Hoeffding’s inequality] • Large-sample confidence intervals • Deterministic confidence intervals • Running confidence interval can be dynamically adjusted depending on the value of n.

Performance Issues

Conclusion • An interactive, intuitive and user-controllable approach to aggregation is needed. • This can be achieved by significant extensions to the database engine. • These extensions satisfy the usability and performance goals. • Ability to produce statistical confidence intervals for running aggregates.

Future work • Better UI • Nested Queries • Control without Indices • Checkpointing / Continuation

TIME TO ASK QUESTIONS

Enhancing Data Aggregation in Databases: A Real-Time Approach

Enhancing Data Aggregation in Databases: A Real-Time Approach

Presentation Transcript

Lead Aggregation Online

Colloidal Aggregation

FX Aggregation

Aggregation cultures

Aggregation

Online aggregation

Ripple Joins for Online Aggregation

Online Aggregation

Aggregation

AGGREGATION

FIB Aggregation

Data Aggregation

Online Aggregation

Basin Aggregation

IPFIX Aggregation

IPFIX Aggregation

Aggregation and Secure Aggregation

Aggregation

Aggregation