Constructing Optimal Wavelet Synopses

Constructing Optimal Wavelet Synopses Dimitris Sacharidis dsachar@dblab.ntua.gr Timos Sellis timos@dblab.ntua.gr

outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue

introduction • analyzing massive multi-dimensional datasets • complex aggregate queries over large parts of the data • exploratory nature • promptness over accuracy, but with guarantees • resort in approximate query processing over precomputed synopses (e.g., histograms, samples, wavelets) • numerous data management applications require to continuously generate, process and analyze data on-line • the data streaming paradigm • summarize in real time, using small space and in one pass • provide approximate query answers with quality guarantees • provide useful data summarization • need to measure inaccuracy, application dependent

wavelets basics • wavelet decomposition is a mathematical tool for the hierarchical decomposition of functions • applications in signal/image processing • used extensively as a data reduction tool in db scenarios: • selectivity estimation for large aggregate queries • fast approximate query answers • general purpose streaming synopsis • features • efficient: performs in linear time and space (vs. histograms ~N2)) • high compression ratio, small-B property • generalizes to multiple dimensions

example assume a data vector d of 8 values iterativelyperform pair-wise averaging and semi differencing every node contributes positively to the leaves in its left subtree andnegatively to the leaves in its right subtree averages are not needed wavelet tree (a.k.a. error tree)

wavelet synopses • any set of B coefficients constitutes a B-term wavelet synopsis • stored as <index,value> pairs • implicitly all non-stored coefficients are set to zero • introduces reconstruction error per point estimate e = |d-d|

measuring accuracy use some norm to aggregate individual errors • L2 norm: Σei2 is the sum squared error (sse) • sse = 224 • L∞ norm: max eiis the maximum absolute error • max-abs-error = 10 • generalized to any weighted Lp norm: Σwieip • e.g. max-rel-error = max (1/di)ei = 10/4 = 250% vector of point errors e vector of data values d

optimal synopses a B-term wavelet synopsis can be optimized for any error metric • sse optimal synopses are straightforward • wavelet transformation is orthonormal (after normalization)  by Parseval’s theorem L2 norm is preserved • choose the highest in absolute (normalized) value coefficients • other (weighted or non) Lp norm optimal synopses require superlinear (quadratic) time in N • dynamic programming over the wavelet tree

interesting issues • I/O efficiency issues when dealing with massive multi-dimensional datasets [M. Jahangiri, D. Sacharidis, C. Shahabi ‘05] • during transformation try to minimize I/Os • efficient maintenance as new data are appended (requires more than just some updating) • how about optimizing for workloads of range-sum queries? • no known results (without using the prefix-sum array) • ranges overlap arbitrarily  no easy dynamic programming formulation exists

working over data streams • main challenges when data are streaming: • stream items are only seen once • require small working space • process stream items quickly • provide an answerquickly with quality guarantees two models depending on how a data vector a is rendered turnstile model stream elements are updates of type (i,±u) which implies a[i]  a[i] ± uand, further, do not appear ordered in i time series model stream elements are vector values of type (i,a[i]) and appear ordered in i (e.g., time)

streaming wavelet synopses • time series model • at most only logN coefficients are affected • a large number of coefficients has finalized value • can perform bottom-up dynamic programming (space required is prohibitive) • greedy techniques should be deployed instead • turnstile model • even optimizing for the sse is hard[G. Cormode, M. Garofalakis, D. Sacharidis ‘06] • other error metrics have not been studied

epilogue wavelet synopses are a highly successful data summarization technique yet, several problems remain open: • optimize for range query workloads • greedy (time-series) streaming algorithms • other metrics for general (turnstile) streaming data

thank you! http://www.dblab.ntua.gr/

unrestricted wavelet synopses • the retained coefficients can assume any value, not restricted to their decomposed value (even harder optimization problem!) • quick example: optimize for max-abs-error, d = {2, 10, 12, 8} and B=1 • restricted synopsis: keep the overall average 8  m.a.e. = 6 • unrestricted synopsis: keep the overall average but change its value to 7  m.a.e. = 5

Constructing Optimal Wavelet Synopses