180 likes | 203 Views
Constructing Optimal Wavelet Synopses. Dimitris Sacharidis dsachar@dblab.ntua.gr Timos Sellis timos@dblab.ntua.gr. outline. introduction background wavelet basics example wavelet synopses example error metrics optimal synopses interesting issues data streams models
E N D
Constructing Optimal Wavelet Synopses Dimitris Sacharidis dsachar@dblab.ntua.gr Timos Sellis timos@dblab.ntua.gr
outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue
introduction • analyzing massive multi-dimensional datasets • complex aggregate queries over large parts of the data • exploratory nature • promptness over accuracy, but with guarantees • resort in approximate query processing over precomputed synopses (e.g., histograms, samples, wavelets) • numerous data management applications require to continuously generate, process and analyze data on-line • the data streaming paradigm • summarize in real time, using small space and in one pass • provide approximate query answers with quality guarantees • provide useful data summarization • need to measure inaccuracy, application dependent
outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue
wavelets basics • wavelet decomposition is a mathematical tool for the hierarchical decomposition of functions • applications in signal/image processing • used extensively as a data reduction tool in db scenarios: • selectivity estimation for large aggregate queries • fast approximate query answers • general purpose streaming synopsis • features • efficient: performs in linear time and space (vs. histograms ~N2)) • high compression ratio, small-B property • generalizes to multiple dimensions
example assume a data vector d of 8 values iterativelyperform pair-wise averaging and semi differencing every node contributes positively to the leaves in its left subtree andnegatively to the leaves in its right subtree averages are not needed wavelet tree (a.k.a. error tree)
outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue
wavelet synopses • any set of B coefficients constitutes a B-term wavelet synopsis • stored as <index,value> pairs • implicitly all non-stored coefficients are set to zero • introduces reconstruction error per point estimate e = |d-d|
measuring accuracy use some norm to aggregate individual errors • L2 norm: Σei2 is the sum squared error (sse) • sse = 224 • L∞ norm: max eiis the maximum absolute error • max-abs-error = 10 • generalized to any weighted Lp norm: Σwieip • e.g. max-rel-error = max (1/di)ei = 10/4 = 250% vector of point errors e vector of data values d
optimal synopses a B-term wavelet synopsis can be optimized for any error metric • sse optimal synopses are straightforward • wavelet transformation is orthonormal (after normalization) by Parseval’s theorem L2 norm is preserved • choose the highest in absolute (normalized) value coefficients • other (weighted or non) Lp norm optimal synopses require superlinear (quadratic) time in N • dynamic programming over the wavelet tree
interesting issues • I/O efficiency issues when dealing with massive multi-dimensional datasets [M. Jahangiri, D. Sacharidis, C. Shahabi ‘05] • during transformation try to minimize I/Os • efficient maintenance as new data are appended (requires more than just some updating) • how about optimizing for workloads of range-sum queries? • no known results (without using the prefix-sum array) • ranges overlap arbitrarily no easy dynamic programming formulation exists
outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue
working over data streams • main challenges when data are streaming: • stream items are only seen once • require small working space • process stream items quickly • provide an answerquickly with quality guarantees two models depending on how a data vector a is rendered turnstile model stream elements are updates of type (i,±u) which implies a[i] a[i] ± uand, further, do not appear ordered in i time series model stream elements are vector values of type (i,a[i]) and appear ordered in i (e.g., time)
streaming wavelet synopses • time series model • at most only logN coefficients are affected • a large number of coefficients has finalized value • can perform bottom-up dynamic programming (space required is prohibitive) • greedy techniques should be deployed instead • turnstile model • even optimizing for the sse is hard[G. Cormode, M. Garofalakis, D. Sacharidis ‘06] • other error metrics have not been studied
outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue
epilogue wavelet synopses are a highly successful data summarization technique yet, several problems remain open: • optimize for range query workloads • greedy (time-series) streaming algorithms • other metrics for general (turnstile) streaming data
thank you! http://www.dblab.ntua.gr/
unrestricted wavelet synopses • the retained coefficients can assume any value, not restricted to their decomposed value (even harder optimization problem!) • quick example: optimize for max-abs-error, d = {2, 10, 12, 8} and B=1 • restricted synopsis: keep the overall average 8 m.a.e. = 6 • unrestricted synopsis: keep the overall average but change its value to 7 m.a.e. = 5