1 / 50

Scalable Approximate Query Processing

Scalable Approximate Query Processing. Florin Rusu. Data Explosion. Data storage advancements Price / capacity ($70 / 1 TB) Human generated Web 2.0 & social networking User data Communication Network & web logs (eBay – 50 TB / day) Call Detail Records (CDRs) Scientific experiments

mckile
Download Presentation

Scalable Approximate Query Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Approximate Query Processing Florin Rusu

  2. Data Explosion • Data storage advancements • Price / capacity ($70 / 1 TB) • Human generated • Web 2.0 & social networking • User data • Communication • Network & web logs (eBay – 50 TB / day) • Call Detail Records (CDRs) • Scientific experiments • LHC (Large Hadron Collider) • SKA (Square Kilometer Array) – 1 EB (1018) / day • Sensor networks

  3. Large-Scale Data Analytics • Traditional DB (OLTP) • Multi-user transaction processing • Optimized for specific workloads (views, indexes, …) • Analytic processing (OLAP) • Data cubes • Aggregate at different hierarchical levels • Pre-defined aggregates, not flexible • Shared-nothing architectures (MPP) • Startups: Netezza, Greenplum, AsterData, Vertica, … • Parallel databases on clusters of computers • Storage layer (row store, column store, hybrid) • Compression

  4. Interactive Data Analysis & Exploration • Ad-hoc queries • Compute statistical aggregates over all data • Example: web log analysis • Documents (URL, Content) • UserVisits (IP, URL, Date, Duration) • “How much time did users spend searching for cars during the period May – July 2009?” SELECTSUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09]

  5. Roadmap • Database query execution • System design & implementation • DataBaseOnline (DBO) • Approximation methods (theoretical analysis & practical implementation) • Sampling • Sketches • Sketches over samples

  6. Query Execution Σ ⋈ σ σ SELECTSUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D • Selections push down • Sort-Merge Join • Aggregate

  7. Selection Σ • Storage manager • One thread for each table scan • Project unused columns ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

  8. Selection Σ • Tuples are pipelined into join ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

  9. Sort-Merge Join – Sort Phase Σ • Sort tuples on join attribute • Write sorted runs to disk • Buffer space: UV(8) ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] UV D Run 1 Run 2

  10. Sort-Merge Join – Merge Phase Σ ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] UV D Run 1 Run 2 Run

  11. Sort-Merge Join – Merge Phase Σ ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] UV D Run 1 Run 2 Run

  12. Aggregation Σ • Update the sum as tuples are produced ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] UV D

  13. Final Result Σ ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] UV D

  14. Roadmap • Database query execution • System design & implementation • DataBaseOnline (DBO) • Approximation methods (theoretical analysis & practical implementation) • Sampling • Sketches • Sketches over samples

  15. What is the problem? • TPC-H benchmark results (price / performance) • 10 TB scale • 928 hard-disks (90 TB total storage capacity) • 16 × quad-core processors • 512 GB RAM • $1.5 million • Load time: 55 hours • Q1: linear scan over one table with aggregates on top • 1 query: 19 minutes • 9 queries: 3 hours (linear scaling)

  16. Approximate Query Processing SELECTSUMf(r1•r2• … •rn) FROM R1 as r1, R2 as r2, …, Rn as rn Result estimate Traditional query processing Query result Confidence bounds Time

  17. DBO System Architecture[Rusu et al. 2008] In-Memory Join Query Result 4 ⋈ UV' D' Σ Levelwise Step Controller 3 2 ⋈ 5 1 σ σ 7 6 Estimation Module UV D Confidence bounds Result DB Engine Approximate answer

  18. Roadmap • Database query execution • System design & implementation • DataBaseOnline (DBO) • Approximation methods (theoretical analysis & practical implementation) • Sampling • Sketches • Sketches over samples

  19. Sampling[Dobra, Jermaine, Rusu & Xu 2009] Σ • Control, coordinate & schedule data flow between operators • Embed randomness in each operator ⋈ σ σ SELECTSUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

  20. Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

  21. Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

  22. Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

  23. Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ 50% input: 360; [-328, 1048] 95% probability • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

  24. Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ Exceed In-Memory Join capacity (10 tuples)! Eliminate tuples such that variance is minimized. σ σ UV D

  25. Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ 74% input: 258; [-293, 808] 95% probability • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

  26. Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ All input: 448; [3, 892] 95% probability • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

  27. Sampling Estimation – Intermediate Levels • Query result estimator & variance estimator computed from result tuples found by In-Memory Join • Confidence bounds derived with Central Limit Theorem • Solve optimization problem to keep bounds stable when tuples are deleted from In-Memory Join

  28. Sampling – Join (Sort) SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] Σ • Sort tuples on random function of join attribute ⋈ σ σ UV D Run 1 Run 2 Run 1 Run 2

  29. Sampling – Join (Merge) In-Memory Join Σ ⋈ σ σ UV D SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] Run 1 Run 2 Run 1 Run 2

  30. Sampling – Join (Merge) In-Memory Join 50% input: 468; [194, 741] 95% probability Σ ⋈ σ σ UV D Run 1 Run 2 SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] Run 1 Run 2

  31. Sampling – Join (Merge) In-Memory Join Σ ⋈ σ σ Run 1 Run 2 UV D Run 1 Run 2 SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09]

  32. Sampling Estimation – Upper Level • Bernoulli sampling with probability given by domain fraction seen so far • Consolidate tuples generated by same join key • Solve optimization problem to minimize variance across levels • Keep confidence bounds stable

  33. Contributions • Design & implement DBO, first online analytical processing engine • Provide estimates & confidence bounds throughout entire query execution • SELECT-PROJECT-JOIN (SPJ) & GROUP BY queries over any number of relations • Design & analyze fastest convergent estimation method for online aggregation • Statistics & optimization techniques

  34. Roadmap • Database query execution • System design & implementation • DataBaseOnline (DBO) • Approximation methods (theoretical analysis & practical implementation) • Sampling • Sketches • Sketches over samples

  35. Sketches Σ • Build sketches on join attribute while data is read from disk • Use attributes in aggregate ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

  36. Sketches SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09]

  37. Sketches SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09]

  38. Sketches SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09]

  39. Sketches SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] 230; [-416, 876] 95% probability

  40. Sketches Estimation • Two random processes • Bucket selection • Sign • Sketch update • Estimator • Confidence bounds • Multiple independent sketches • Chebyshev & Chernoff inequalities (worst-case) • Median Central Limit Theorem, Student-t distribution (statistics)

  41. Pseudo-Random Number Generators[Rusu & Dobra 2006, 2007b] • Detailed comparison of generating schemes • Abstract algebra (orthogonal arrays, vector spaces, prime & extension fields) • Degree of independence as function of seed size • Fast range-summable • Empirical evaluation • Generating time is few processor cycles • Identify EH3 as generator for sketches • Lowest possible degree of independence • 7.3 ns to generate single number

  42. Statistical Analysis[Rusu & Dobra 2007a, 2008] • Detailed comparison of sketch estimators • Same accuracy (worst-case analysis) • Statistical analysis • Distribution (probability density function) • Higher frequency moments (kurtosis) • Confidence bounds • Empirical evaluation • Data skew, correlation, memory usage, update time • Identify Fast-AGMS as most reliable scheme • Accurate over entire range of data • Small memory footprint, fast update time

  43. Roadmap • Database query execution • System design & implementation • DataBaseOnline (DBO) • Approximation methods (theoretical analysis & practical implementation) • Sampling • Sketches • Sketches over samples

  44. Sketches over Samples[Rusu & Dobra 2009] Σ • Data is random on disk • Build sketches on join attribute while data is read from disk • Use attributes in aggregate • Provide estimates at any point ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

  45. Sketches over Samples SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] 50% input: 100; [-2382, 2582] 95% probability

  46. Sketches over Samples – Estimation • Define estimator over two completely different random processes & analyze statistically • Sampling – random partition, tuple domain • Sketches – random projection, frequency domain • Consider correlation between multiple sketches that share same sample • Moment generating functions • Generic analysis independent of sampling process • Bernoulli sampling • Sampling without replacement • Sampling with replacement

  47. Sketches over Samples – Analysis Var[sketch over samples] = Var[samples] + Var[sketch] + Var[interaction]

  48. Conclusions • Data explosion • Cheap, high-capacity storage • Current processing technology is too expensive for performance it provides • Framework for online analytical processing • DBO system architecture • Embed randomization into data processing • Provide estimates and bounds at any time • Approximation methods • Sampling – most flexible • Sketches – single pass • Sketches over samples – fastest

  49. Future Work • Short term • Define & design query optimization for DBO • Extend DBO to other types of queries and with other approximation techniques (end-biased samples, histograms, …) • Generalize sketches to multiple relations • Find optimal amount of data to sketch • Fully integrate sketches into DBO system • Medium term • Develop data aggregation & approximation techniques for other types of architectures • Multicore processors, GPUs • Distributed processing (Map-Reduce, Hadoop, …) • Long term • Design & build scalable analytic processing system • Aggregation & approximation

  50. Publications • A. Dobra, C. Jermaine, F. Rusu, F. Xu – Turbo-Charging Estimate Convergence in DBO. In VLDB 2009. • F. Rusu and A. Dobra – Sketching Sampled Data Streams. In ICDE 2009. • F. Rusu et al. – The DBO Database System. In SIGMOD 2008 (demo). • F. Rusu and A. Dobra – Sketches for Size of Join Estimation. In TODS, vol. 33, no. 3, 2008. • F. Rusu and A. Dobra – Pseudo-Random Number Generation for Sketch-Based Estimations. In TODS, vol. 32, no. 2, 2007. • F. Rusu and A. Dobra – Statistical Analysis of Sketch Estimators. In SIGMOD 2007. • F. Rusuand A. Dobra – Fast Range-Summable Random Variables for Efficient Aggregate Estimation. In SIGMOD 2006.

More Related