340 likes | 490 Views
Optimal Workload-Based Weighted Wavelet Synopsis. Yossi Matias Daniel Urieli. School of Computer Science Tel Aviv University. Outline. Motivation Background & Contributions Wavelet synopses Optimal WB weighted wavelet synopses. Outline. Motivation Background & Contributions
E N D
Optimal Workload-Based Weighted Wavelet Synopsis Yossi Matias Daniel Urieli School of Computer Science Tel Aviv University
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Compact Data Synopses “Transformed” Query Approximate Answer KB/MB FAST!! Approximate Query Processing OperationalDatabase SQL Query Long Response Times! Exact Answer GB/TB
Goals • Develop data synopses • Most accurate answers • Using a small amount of memory • Massive data sets efficient construction • Time • I/O
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Data synopses • Samples: random samples, stratified samples, congressional samples, reservoir-sampling, backing samples, join synopses, sketches • [Olken-Rotem, Vitter, Alon-Matias-Szegedy, Gibbons-Matias-Poosala, Acharia et al…] • Used in commercial DB systems • Histograms: equi-depth, compressed, v-optimal, spline, multi-dimensional, dynamic, Max-diff, MHIST • [Poosala-Ionnidis, etc.] • Used in commercial DB systems • Wavelets synopses: basic, multi-dim, probabilistic, dynamic, extended • Adapts to nature of data effectively • [Matias-Vitter-Wang, Garafolakis-Gibbons, Chakrabarti et al, Rousopoulous-Kiotidis…] • Workload-based wavelet synopses [Matias, Portman]
Workload-based synopses • Future queries correlated to past queries • Can be thought of as taken from a probability distribution roughly determined by the workload • Workload based synopses: optimized for a given query workload • “Standard” synopses assume uniform workload
Workload-based synopses – prior work • Workload-based sampling • Overcoming limitations of sampling for aggregation queries [Chaudhuri, Das, Datar, Motwani, and Narasayya] • Icicles: Self-tuning samples for approximate query answering [Ganti, Lee, Ramakrishnan] • Workload-based histograms • Self-tuning histograms [Aboulnaga and Chaudhuri] • ST-holes [ Bruno et al. ] • Hierarchical range histogram [Guha-Koudas-Srivastava-02] • Workload-based wavelets • By Yossi Matias and Leon Portman
Workload-Based Wavelet synopses [MP03] • Adapts effectively to a given query workload (not only to data) • Reduces the mean-squared-absolute / relative error over a workload of queries • Order magnitude improvement over prior wavelet synopses • Not necessarily optimal
Contributions • Optimal Workload-based Weighted Wavelet (WWW) synopses • WB-MSE (Workload-Based Mean Squared Error) • WB-MRE (Workload-Based Mean-squared Relative Error) • Equivalently, minimize the expected squared, absolute or relative error over a point query • First to minimize the MRE over the data • WB-MRE with uniform distribution • Both WWW synopses are optimal enhanced wavelet synopses • A generalized definition which allows coefficients with arbitrary values • Optimal cost construction • Linear construction time • I/O optimal
Techniques • Problem definition in terms of • Weighted norm • Weighted-inner-product • Weighted-inner-product-space • Weighted wavelets for building data synopses
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Resolution Averages Detail Coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] 1 [1.5, 4] [0.5, 0] 0 [2.75] [-1.25] [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] Haar wavelet decomposition • Wavelets: mathematical tool for hierarchical decomposition of functions/signals • Haar wavelets: simplest wavelet basis, easy to understand and implement • Recursive pair wise averaging and differencing at different resolutions. • A linear time algorithm.
+ 2.75 + - -1.25 0.5 0 0 -1 0 -1 + - + - + - + - + - + - 2 2 0 2 3 5 4 4 Wavelet error tree [MVW98] Original data
+ + - + - + - + - + - + - + - 1 -1 0 1 The Haar Basis
+ 2.75 + - -1.25 0.5 0 0 -1 0 -1 + - + - + - + - + - + - 2 2 0 2 3 5 4 4 Wavelet error tree [MVW98] How should we choose which coefficients to retain? 1 1 Original data
Parseval-based optimal thresholding • Given a vector with respect to some orthonormal basis • Goal: approximate the vector using only M << N basis coefficients • Then, choosing the largest M coefficients is optimal • Minimizes the L2 norm of the error vector
Haar Wavelet Synopses - summary • Compute Haar wavelet decomposition of D • Coefficient thresholding: only M<<|D| = N coefficients can be kept • Parseval-based thresholding • optimal w.r.t the MSE • Several other greedy heuristics exists
Outline • Motivation • Background & Contributions • Wavelet synopses • Optimal WB weighted wavelet synopses
Given a synopsis S 3.5 3.5 -0.5 -1 0 -2 -1 0 0 -0.5 standard thresholding -0.707 0 0 -1 -0.5 0 WL2(S) 0.498 standard: 4 4 2 2 2 6 4 4 Importance: 0.001 0.001 0.001 0.001 0.249 0.249 0.249 0.249 Workload Example 2 2 2 6 3 5 4 4
0 3.5 -0.5 -1 0 0 -2 -1 WL2(S) 0.008 Importance: 0.001 0.001 0.001 0.001 0.249 0.249 0.249 0.249 Workload Example 3.5 Workload- based thresholding -0.5 -0.707 0 0 -1 -0.5 0 2 2 2 6 3 5 4 4 standard: 4 4 2 2 2 6 4 4 Workload based 5 4 2 2 4 4 3 4
Error definition • D = (d1,…,dN) - our data. • - the point query • - the approximated answer • abs-error: rel-error: • The purpose: reduce a norm of • For example:
Workload-based Error • A workload: (c1,…,cN), where ci is the probability that qi appears. • Given a workload W = (c1,…,cN) we define the Weighted L2 Norm: • When ci = 1/N: WL2(E) = MSE
Our goal • Minimizing the WL2 norm of the errors vector E • For given data set D and query workloads W • Equivalently: minimizing the expected squared error over a point query taken from a given distribution
Regular Haar transform Given a data set D = (d0,…,dN-1) D Haar Transform (HT) HT(D) standard thresholding wavelet synopsis
Parseval’s formula, the WL2 norm, the weighted inner product, and the algorithm for computing the WH basis from the workload Overview Given a data set D = (d0,…,dN-1) and a workload vector W = (c0,…,cN-1) W D WHB(W) Weighted Haar Basis (WHB( Weighted Haar Transform (WHT) WHT(D) standard thresholding WB – wavelet synopsis
x -y 0 1 The weighted Haar basis • The Weighted Haar Basis would also look like but
0 1 c0,c1,… , cN-1 Compute theWeighted Haar Basis • Meaning it would look more like: Recall the weight coefficients (the workload) W = (c0,…,cN-1) for D = (d0,…,dN-1)
Experimental results WB-MSE VS. STANDARD
Experimental results WB-MRE, ADAPTIVE, STANDARD
Experimental results WB-MRE, ADAPTIVE