1 / 26

Online Aggregation

Online Aggregation. Joe Hellerstein UC Berkeley. +. AVG. Query Results. 3.262574342. Online Aggregation: Motivation. Select AVG(grade) from ENROLL; A “fancy” interface:. A Better Approach. Don’t process in batch! Online aggregation:. Isn’t This Just Sampling?. Yes

faxon
Download Presentation

Online Aggregation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Aggregation Joe Hellerstein UC Berkeley

  2. + AVG Query Results 3.262574342 Online Aggregation: Motivation Select AVG(grade) from ENROLL; • A “fancy” interface:

  3. A Better Approach • Don’t process in batch! Online aggregation:

  4. Isn’t This Just Sampling? • Yes • we need values to arrive in random order • “confidence intervals” similar to sampling work • … and No! • stopping condition set on the fly! • statistical techniques are more sophisticated • can handle GROUP BY w/o a priori knowledge...

  5. Grouping Select AVG(grade) from ENROLL GROUP BY major;

  6. Usability Continuous output non-blocking query plans time/precision control fairness/partiality Performance time to accuracy time to completion pacing Requirements

  7. Fairness/Partiality • Speed controls!

  8. Applications • Large-scale Data Analysis • Online drill-down/roll-up/CUBE • Visualization tools • Data mining • Distributed requests • Generally: • Any long-running activity should be visualized and controllable on line. • Accuracy rarely critical for non-lookups

  9. A Naïve Approach Select onln_avg(grade) from ENROLL; • Can do it in Illustra/Informix today! • But... • No grouping • Can’t meet performance & usability needs: • no guarantee of continuous output • optimized for completion time! • no guarantee of fairness (or control over partiality) • no control over pacing

  10. Random Access to Data • Heap Scan • OK if clustering uncorrelated to agg & grouping attrs • Index Scan • can scan an index on attrs uncorrelated to agg or grouping (or index on random()) • Sampling: • could introduce new sampling access methods (e.g. Olken’s work)

  11. Group By & Distinct • Fair, Non-Blocking Group By/Distinct • Can’t sort! Must use hash-based techniques • sorting blocks • sorting is unfair • Hybrid hashing! • especially for duplicate elimination. • “Hybrid Cache” even better.

  12. Index Striding • For fair Group By: • want random tuple from Group 1, random tuple from Group 2, ... • Idea: Index gives tuples from a single group. • Sol’n: one access method opens many cursors in index, one per group. Fetch round-robin. • Can control speed by weighting the schedule • Gives fairness/partiality, info/speed match! (Next step: “heap stride”)

  13. Join Algorithms • Non-Blocking Joins • no sorting! • merge join OK, but watch “interesting orders”! • hybrid hash not great • symmetric “pipeline” hash [Wilschut/Apers 91] • nested loops always good, can be too slow • optimization...?

  14. Query Optimization • 2 components in cost function: • dead time (td ): time spent doing “invisible” work -- tax this at a high rate! • output time (to ): time spent producing output • this will do a lot automagically! • User control vs. performance? • can use competition (e.g., a la Rdb) • “interesting” (i.e. bad!) ordering

  15. Extended Aggregate Functions • Aggs need to return tuples in the middle of processing • add a 4th “current estimate” function • open, iterate, estimate, close • sometimes like close function (AVG), sometimes not (COUNT) • aggregation code needs to spit this out to the front-end

  16. API • Current API uses built-in methods • button-press = query invocation • e.g., select StopGroup(val); • High overhead, need to know internals. • Very flexible. Easy to code! • Estimates returned as tuples • no distinction between estimates and final answer • OK? • A general API for running status?

  17. Pacing • Inter-tuple speed is critical!!

  18. Statistical Issues • Confidence Intervals for SQL aggs • given an estimate, probability p that we’re within eof the right answer • 3 types of estimates • Conservative (Hoeffding’s inequality) • Large-Sample (Central Limit Theorems) • Deterministic Previous work + new results from Peter Haas

  19. Initial Implementation • prototype in PostgreSQL (a/k/a Postgres95, a/k/a PG-Lite) • aggs with running output • hash-based grouping, dup elimination • index striding • optimizer tweaks • “API” hacks and a simple UI

  20. Normal plan Pacing study Select AVG(grade), Interval(0.99) From ENROLL;

  21. Access Methods, Big Group Select AVG(grade), Interval(0.99) From ENROLL GROUP BY college; College L: 925K/1.5 Mtuples

  22. Cost of API call? Surprise!!! Access Methods, Small Group Select AVG(grade), Interval(0.99) From ENROLL GROUP BY college; College S: 15K/1.5 Mtuples

  23. Future Work • Better UI • simple example: histograms w/error bars • online data visualization (Tioga DataSplash) • data viz = “graphical” aggregate • sampled points, or sampled wavelet coefficients? • great for panning, zooming • Nested Queries • Tie-ins with AI’s “anytime algorithms”

  24. Future Work II • Control w/o Indices • Heap Striding = do multiple scans • optimization 1: fast scans can pass hints to slow • optimization 2: “piggyback” scans via buffering • big payoff for complex queries • Checkpointing/continuation • also continuous data streams

  25. Future Work III • Sample Tracking • important for financial auditing, statistical quality control • cached views, RID-lists, logical views • More stats: • non-standard aggs • simultaneous confidence intervals • allow for slop in cardinality estimation

  26. Summary • DSS/OLAP users very demanding! • “speed of thought” performance required • only choices: precomputation or estimation • for estimation, need to provide control and steady, useful feedback • puts user in the driver’s seat • takes some performance burden away from system • Requires significant backend support • but well worth the effort • runs the “impossible” queries

More Related