Scalable Approximate Query Processing

Scalable Approximate Query Processing Florin Rusu

Data Explosion • Data storage advancements • Price / capacity ($70 / 1 TB) • Human generated • Web 2.0 & social networking • User data • Communication • Network & web logs (eBay – 50 TB / day) • Call Detail Records (CDRs) • Scientific experiments • LHC (Large Hadron Collider) • SKA (Square Kilometer Array) – 1 EB (1018) / day • Sensor networks

Large-Scale Data Analytics • Traditional DB (OLTP) • Multi-user transaction processing • Optimized for specific workloads (views, indexes, …) • Analytic processing (OLAP) • Data cubes • Aggregate at different hierarchical levels • Pre-defined aggregates, not flexible • Shared-nothing architectures (MPP) • Startups: Netezza, Greenplum, AsterData, Vertica, … • Parallel databases on clusters of computers • Storage layer (row store, column store, hybrid) • Compression

Interactive Data Analysis & Exploration • Ad-hoc queries • Compute statistical aggregates over all data • Example: web log analysis • Documents (URL, Content) • UserVisits (IP, URL, Date, Duration) • “How much time did users spend searching for cars during the period May – July 2009?” SELECTSUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09]

Roadmap • Database query execution • System design & implementation • DataBaseOnline (DBO) • Approximation methods (theoretical analysis & practical implementation) • Sampling • Sketches • Sketches over samples

Query Execution Σ ⋈ σ σ SELECTSUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D • Selections push down • Sort-Merge Join • Aggregate

Selection Σ • Storage manager • One thread for each table scan • Project unused columns ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

Selection Σ • Tuples are pipelined into join ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

Sort-Merge Join – Sort Phase Σ • Sort tuples on join attribute • Write sorted runs to disk • Buffer space: UV(8) ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] UV D Run 1 Run 2

Sort-Merge Join – Merge Phase Σ ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] UV D Run 1 Run 2 Run

Aggregation Σ • Update the sum as tuples are produced ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] UV D

Final Result Σ ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] UV D

What is the problem? • TPC-H benchmark results (price / performance) • 10 TB scale • 928 hard-disks (90 TB total storage capacity) • 16 × quad-core processors • 512 GB RAM • $1.5 million • Load time: 55 hours • Q1: linear scan over one table with aggregates on top • 1 query: 19 minutes • 9 queries: 3 hours (linear scaling)

Approximate Query Processing SELECTSUMf(r1•r2• … •rn) FROM R1 as r1, R2 as r2, …, Rn as rn Result estimate Traditional query processing Query result Confidence bounds Time

DBO System Architecture[Rusu et al. 2008] In-Memory Join Query Result 4 ⋈ UV' D' Σ Levelwise Step Controller 3 2 ⋈ 5 1 σ σ 7 6 Estimation Module UV D Confidence bounds Result DB Engine Approximate answer

Sampling[Dobra, Jermaine, Rusu & Xu 2009] Σ • Control, coordinate & schedule data flow between operators • Embed randomness in each operator ⋈ σ σ SELECTSUM(UV.Duration) FROM Documents D, UserVisits UV WHERE D.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ 50% input: 360; [-328, 1048] 95% probability • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ Exceed In-Memory Join capacity (10 tuples)! Eliminate tuples such that variance is minimized. σ σ UV D

Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ 74% input: 258; [-293, 808] 95% probability • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

Sampling – Selection SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] In-Memory Join Σ All input: 448; [3, 892] 95% probability • Data in random order • Assign random timestamp to tuples • Controller schedules data flow between operators ⋈ σ σ UV D

Sampling Estimation – Intermediate Levels • Query result estimator & variance estimator computed from result tuples found by In-Memory Join • Confidence bounds derived with Central Limit Theorem • Solve optimization problem to keep bounds stable when tuples are deleted from In-Memory Join

Sampling – Join (Sort) SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] Σ • Sort tuples on random function of join attribute ⋈ σ σ UV D Run 1 Run 2 Run 1 Run 2

Sampling – Join (Merge) In-Memory Join Σ ⋈ σ σ UV D SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] Run 1 Run 2 Run 1 Run 2

Sampling – Join (Merge) In-Memory Join 50% input: 468; [194, 741] 95% probability Σ ⋈ σ σ UV D Run 1 Run 2 SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] Run 1 Run 2

Sampling – Join (Merge) In-Memory Join Σ ⋈ σ σ Run 1 Run 2 UV D Run 1 Run 2 SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09]

Sampling Estimation – Upper Level • Bernoulli sampling with probability given by domain fraction seen so far • Consolidate tuples generated by same join key • Solve optimization problem to minimize variance across levels • Keep confidence bounds stable

Contributions • Design & implement DBO, first online analytical processing engine • Provide estimates & confidence bounds throughout entire query execution • SELECT-PROJECT-JOIN (SPJ) & GROUP BY queries over any number of relations • Design & analyze fastest convergent estimation method for online aggregation • Statistics & optimization techniques

Sketches Σ • Build sketches on join attribute while data is read from disk • Use attributes in aggregate ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

Sketches SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09]

Sketches SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09]

Sketches SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’AND UV.Datebetween [05-01-09, 07-31-09] 230; [-416, 876] 95% probability

Sketches Estimation • Two random processes • Bucket selection • Sign • Sketch update • Estimator • Confidence bounds • Multiple independent sketches • Chebyshev & Chernoff inequalities (worst-case) • Median Central Limit Theorem, Student-t distribution (statistics)

Pseudo-Random Number Generators[Rusu & Dobra 2006, 2007b] • Detailed comparison of generating schemes • Abstract algebra (orthogonal arrays, vector spaces, prime & extension fields) • Degree of independence as function of seed size • Fast range-summable • Empirical evaluation • Generating time is few processor cycles • Identify EH3 as generator for sketches • Lowest possible degree of independence • 7.3 ns to generate single number

Statistical Analysis[Rusu & Dobra 2007a, 2008] • Detailed comparison of sketch estimators • Same accuracy (worst-case analysis) • Statistical analysis • Distribution (probability density function) • Higher frequency moments (kurtosis) • Confidence bounds • Empirical evaluation • Data skew, correlation, memory usage, update time • Identify Fast-AGMS as most reliable scheme • Accurate over entire range of data • Small memory footprint, fast update time

Sketches over Samples[Rusu & Dobra 2009] Σ • Data is random on disk • Build sketches on join attribute while data is read from disk • Use attributes in aggregate • Provide estimates at any point ⋈ σ σ SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] UV D

Sketches over Samples SELECTSUM(UV.Duration) FROMDocuments D, UserVisits UV WHERED.URL = UV.DocURLAND D.Contentcontains ‘car’ AND UV.Datebetween [05-01-09, 07-31-09] 50% input: 100; [-2382, 2582] 95% probability

Sketches over Samples – Estimation • Define estimator over two completely different random processes & analyze statistically • Sampling – random partition, tuple domain • Sketches – random projection, frequency domain • Consider correlation between multiple sketches that share same sample • Moment generating functions • Generic analysis independent of sampling process • Bernoulli sampling • Sampling without replacement • Sampling with replacement

Sketches over Samples – Analysis Var[sketch over samples] = Var[samples] + Var[sketch] + Var[interaction]

Conclusions • Data explosion • Cheap, high-capacity storage • Current processing technology is too expensive for performance it provides • Framework for online analytical processing • DBO system architecture • Embed randomization into data processing • Provide estimates and bounds at any time • Approximation methods • Sampling – most flexible • Sketches – single pass • Sketches over samples – fastest

Future Work • Short term • Define & design query optimization for DBO • Extend DBO to other types of queries and with other approximation techniques (end-biased samples, histograms, …) • Generalize sketches to multiple relations • Find optimal amount of data to sketch • Fully integrate sketches into DBO system • Medium term • Develop data aggregation & approximation techniques for other types of architectures • Multicore processors, GPUs • Distributed processing (Map-Reduce, Hadoop, …) • Long term • Design & build scalable analytic processing system • Aggregation & approximation

Publications • A. Dobra, C. Jermaine, F. Rusu, F. Xu – Turbo-Charging Estimate Convergence in DBO. In VLDB 2009. • F. Rusu and A. Dobra – Sketching Sampled Data Streams. In ICDE 2009. • F. Rusu et al. – The DBO Database System. In SIGMOD 2008 (demo). • F. Rusu and A. Dobra – Sketches for Size of Join Estimation. In TODS, vol. 33, no. 3, 2008. • F. Rusu and A. Dobra – Pseudo-Random Number Generation for Sketch-Based Estimations. In TODS, vol. 32, no. 2, 2007. • F. Rusu and A. Dobra – Statistical Analysis of Sketch Estimators. In SIGMOD 2007. • F. Rusuand A. Dobra – Fast Range-Summable Random Variables for Efficient Aggregate Estimation. In SIGMOD 2006.

Scalable Approximate Query Processing