Online Aggregation Joseph M. Hellerstein Peter J.Haas Helen J.Wang

Online AggregationJoseph M. HellersteinPeter J.HaasHelen J.Wang Presented by Archana Vijayalakshmanan

Contents • Introduction • Example • Advantages • Requirements • Approaches to building a system • System issues • Conclusion

+ AVG Query Results 3.262574342 Online Aggregation: Motivation Select AVG(grade) from ENROLL; • A “fancy” interface:

A Better Approach • Don’t process in batch! Online aggregation:

Example Select AVG(grade) from ENROLL GROUP BY major;

Advantages • stopping condition set on the fly! • statistical techniques are more sophisticated • can handle GROUP BY w/o a priori knowledge

Usability Continuous output non-blocking query plans time/precision control fairness/partiality Performance time to accuracy time to completion pacing Requirements

A Naive Approach SELECT running_avg(final_grade), running_confidence(final_grade), running_interval(final_grade) FROM grades; • No grouping • Can’t meet performance & usability needs: • no guarantee of continuous output • no guarantee of fairness (or control over partiality) • no control over pacing

Random Access to Data • Heap Scan • OK if clustering uncorrelated to agg & grouping attrs • Index Scan • can scan an index on attrs uncorrelated to agg or grouping • Sampling from indices • could introduce new sampling access methods (e.g. Olken’s work)

Group By & Distinct • Can’t sort! • sorting blocks • sorting is unfair • Must use hash-based techniques • non-blocking approach but do not scale gracefully. • Hybrid Hashing. • “Hybrid Cache” even better.

Index Striding • For fair Group By: • read tuples in round-robin fashion. (want random tuple from Group 1, random tuple from Group 2, ...) • each group is updated at appropriate rate. • gives info/speed match!

Join Algorithms • Non-Blocking Joins • no sorting! • merge join OK, but watch for the sorted output • hybrid hash not great • symmetric pipeline hash • nested loops always good, can be too slow

Query Optimization • Avoid sorting • Blocking sub-operations 2 components in cost function: • dead time (td ): time spent doing “invisible” work -- tax this at a high rate! • output time (to ): time spent producing output • Preference to plans that maximize user control e.g., index striding

Extended Aggregate Functions • Basically,aggregate functions must provide running estimates SUM,COUNT-straight forward VAR,STD DEV-algorithms • return confidence intervals

API Current API uses built-in methods • e.g., StopGroup(cursor,groupval) speedUpGroup(cursor,groupval) slowDownGroup(cursor,groupval) setSkipFactor(cursor name,integer)

Future Work • Better UI -online data visualization (Tioga DataSplash) • data viz = “graphical” aggregate - “drill down” and roll up” facilities • Nested Queries • Control w/o Indices • Checkpointing/continuation • Tracking online queries • Extensions of statistical results

References control.cs.berkeley.edu/online/olamd/olamd.PPT

Online Aggregation Joseph M. Hellerstein Peter J.Haas Helen J.Wang