1 / 18

Constructing Optimal Wavelet Synopses

Constructing Optimal Wavelet Synopses. Dimitris Sacharidis dsachar@dblab.ntua.gr Timos Sellis timos@dblab.ntua.gr. outline. introduction background wavelet basics example wavelet synopses example error metrics optimal synopses interesting issues data streams models

ferneh
Download Presentation

Constructing Optimal Wavelet Synopses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Constructing Optimal Wavelet Synopses Dimitris Sacharidis dsachar@dblab.ntua.gr Timos Sellis timos@dblab.ntua.gr

  2. outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue

  3. introduction • analyzing massive multi-dimensional datasets • complex aggregate queries over large parts of the data • exploratory nature • promptness over accuracy, but with guarantees • resort in approximate query processing over precomputed synopses (e.g., histograms, samples, wavelets) • numerous data management applications require to continuously generate, process and analyze data on-line • the data streaming paradigm • summarize in real time, using small space and in one pass • provide approximate query answers with quality guarantees • provide useful data summarization • need to measure inaccuracy, application dependent

  4. outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue

  5. wavelets basics • wavelet decomposition is a mathematical tool for the hierarchical decomposition of functions • applications in signal/image processing • used extensively as a data reduction tool in db scenarios: • selectivity estimation for large aggregate queries • fast approximate query answers • general purpose streaming synopsis • features • efficient: performs in linear time and space (vs. histograms ~N2)) • high compression ratio, small-B property • generalizes to multiple dimensions

  6. example assume a data vector d of 8 values iterativelyperform pair-wise averaging and semi differencing every node contributes positively to the leaves in its left subtree andnegatively to the leaves in its right subtree averages are not needed wavelet tree (a.k.a. error tree)

  7. outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue

  8. wavelet synopses • any set of B coefficients constitutes a B-term wavelet synopsis • stored as <index,value> pairs • implicitly all non-stored coefficients are set to zero • introduces reconstruction error per point estimate e = |d-d|

  9. measuring accuracy use some norm to aggregate individual errors • L2 norm: Σei2 is the sum squared error (sse) • sse = 224 • L∞ norm: max eiis the maximum absolute error • max-abs-error = 10 • generalized to any weighted Lp norm: Σwieip • e.g. max-rel-error = max (1/di)ei = 10/4 = 250% vector of point errors e vector of data values d

  10. optimal synopses a B-term wavelet synopsis can be optimized for any error metric • sse optimal synopses are straightforward • wavelet transformation is orthonormal (after normalization)  by Parseval’s theorem L2 norm is preserved • choose the highest in absolute (normalized) value coefficients • other (weighted or non) Lp norm optimal synopses require superlinear (quadratic) time in N • dynamic programming over the wavelet tree

  11. interesting issues • I/O efficiency issues when dealing with massive multi-dimensional datasets [M. Jahangiri, D. Sacharidis, C. Shahabi ‘05] • during transformation try to minimize I/Os • efficient maintenance as new data are appended (requires more than just some updating) • how about optimizing for workloads of range-sum queries? • no known results (without using the prefix-sum array) • ranges overlap arbitrarily  no easy dynamic programming formulation exists

  12. outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue

  13. working over data streams • main challenges when data are streaming: • stream items are only seen once • require small working space • process stream items quickly • provide an answerquickly with quality guarantees two models depending on how a data vector a is rendered turnstile model stream elements are updates of type (i,±u) which implies a[i]  a[i] ± uand, further, do not appear ordered in i time series model stream elements are vector values of type (i,a[i]) and appear ordered in i (e.g., time)

  14. streaming wavelet synopses • time series model • at most only logN coefficients are affected • a large number of coefficients has finalized value • can perform bottom-up dynamic programming (space required is prohibitive) • greedy techniques should be deployed instead • turnstile model • even optimizing for the sse is hard[G. Cormode, M. Garofalakis, D. Sacharidis ‘06] • other error metrics have not been studied

  15. outline • introduction • background • wavelet basics • example • wavelet synopses • example • error metrics • optimal synopses • interesting issues • data streams • models • streaming wavelet synopses • epilogue

  16. epilogue wavelet synopses are a highly successful data summarization technique yet, several problems remain open: • optimize for range query workloads • greedy (time-series) streaming algorithms • other metrics for general (turnstile) streaming data

  17. thank you! http://www.dblab.ntua.gr/

  18. unrestricted wavelet synopses • the retained coefficients can assume any value, not restricted to their decomposed value (even harder optimization problem!) • quick example: optimize for max-abs-error, d = {2, 10, 12, 8} and B=1 • restricted synopsis: keep the overall average 8  m.a.e. = 6 • unrestricted synopsis: keep the overall average but change its value to 7  m.a.e. = 5

More Related