1 / 35

One-Pass Wavelet Synopses for Maximum-Error Metrics

One-Pass Wavelet Synopses for Maximum-Error Metrics. Panagiotis Karras Trondheim, August 31st, 2005. Research at HKU with Nikos Mamoulis. Outline. Preliminaries & Motivation Usefulness of Synopses Haar wavelet decomposition, conventional wavelet synopses The maximum error guarantee problem

Download Presentation

One-Pass Wavelet Synopses for Maximum-Error Metrics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis

  2. Outline • Preliminaries & Motivation • Usefulness of Synopses • Haar wavelet decomposition, conventional wavelet synopses • The maximum error guarantee problem • Earlier Approach: Wavelet Synopses with Optimal Error Guarantees • Impracticability of this approach • Solution: Practicable Wavelet Synopses for Maximum Error Metrics • Low-Complexity Algorithms that provide near-optimal error results • Extension to Data Streams • One-Pass adaptations of the proposed algorithms • Conclusions & Future Directions

  3. Compact Data Synopses useful in: • Approximate Query Processing (exact answers not always required) • Learning, Classification, Event Detection • Data Mining, Selectivity Estimation • Situations where massive data arrives in a stream

  4. Haar Wavelet Decomposition • Wavelet decomposition: orthogonal transform for the hierarchical representation of functions and signals • Haar wavelets: simplest wavelet system, easy to understand and implement • Extensible to many dimensions 18 • Error tree: structure for the visualization of decomposition and value reconstructions • Reconstructions require logarithmically many terms, along appropriate error tree paths 0 18 18 7 -8 26 25 11 10 9 -9 10 10 34 16 2 20 20 0 36 16

  5. Wavelet Synopses • Compute Haar wavelet decomposition of D • Coefficient thresholding : retain B coefficients, B<<|D| • Approximate query engine can operate over such compact synopses • [MVW, SIGMOD’98]; [VW, SIGMOD’99]; [CGRS, VLDB’00] • Conventional approach: Retain B largest coefficients in absolutenormalized value • Normalized Haar basis: divide coefficients at resolution j by • Minimizes the Total Squared (L2) Error • However…

  6. + 18 + - 0 7 -8 9 -9 10 10 + - + - + - + - + - + - 34 16 2 20 20 0 36 16 The Problem with Conventional Synopses • Example data vector and synopsis (|D|=8, B=4) • Large variation in answer quality • Root cause • Aggregate error measure may be optimal, but error distributed unevenly among individual values Original Data 18 18 18 18 20 0 36 16 Reconstruction

  7. Solution: Thresholding for Maximum-Error Metrics • Error Metrics providing tight error guarantees for all reconstructed values: • Maximum Absolute Error • Maximum Relative Error with Sanity Bound (to avoid domination by small data values) • Aim at minimization of these metrics

  8. Former Approach:Optimal Thresholding for Maximum-Error Metrics[GK, PODS’04] • Based on Dynamic-Programming Formulation • Relies on recursive function that computes minimum maximum error for a coefficient’s sub-tree given an allocated storage space • Optimally distributes allocated space b between a node’s two child sub-trees and decides whether to retain the coefficient on this node • Approximation schemes for multiple dimensions, also applicable in one dimension

  9. However: • Complexity: • time (reducible to ) • space (reducible to ) • 1-D Approximation Schemes • Impractical for the purpose it is meant for • All Inapplicable in Streaming Environments • Challenge: • Design efficient, low-complexitythresholding schemes that achieve competitive results in comparison to the optimal solution and are extensible to streaming data

  10. Solution:Greedy Thresholding for Maximum-Error Metrics • Key Idea: Greedy solution that makes the best choice of next coefficient to discard at each step • Each error-tree node stores the Maximum Potential Error that will be affected when the coefficient on it is discarded: • For Absolute Error: • For Relative Error: • Global Heap structure returns node of Least Maximum Potential Error • For Absolute Error: • Max and Min values of Accumulated Error below maintained on nodes • For Relative Error: • Accumulated Error on data level stored on leaf nodes • Heaps returning leaf of Maximum Potential Error augmented on nodes

  11. Solution:Greedy Thresholding for Maximum-Error Metrics After each discarding operation: • Changes in Accumulated Error values propagated up and down the tree • On each affected node: • For Absolute Error: • Max, Min Accumulated Error updated • NewMaximum Potential Absolute Error calculated as: • For Relative Error: • Descendants’ Heap updated • New Maximum Potential Relative Error returned from Heap • Update node’s position in Global Heap

  12. + 4 + - -1 2 -3 6 -7 -2 -4 + - + - + - + - + - + - 11 -1 -6 8 -2 6 6 10 An Example (absolute error) • First drop coefficient -1 • Error accumulates on leaf nodes • Next drop coefficient 2 of maximum potential error 3 • And so on… 1 1 1 1 -1 -1 -1 -1 -1 -1 -3 -3 1 -3

  13. Complexity Analysis • Absolute Error Algorithm: Time: O(Nlog2N) Space: O(N) • Relative Error Algorithm: Time: O(Nlog3N) Space: O(NlogN)

  14. Extension to Data Streams • Major application area • Existing methods inapplicable • Assumption: O(B ) available memory budget • Further Problem: • Extend proposed methods to streams • One-pass overall process • Construct and truncate error-tree on-the-fly

  15. Solution for Absolute Error • After first B data, pair of coefficients discarded for every arriving data pair • Scope limited to error-tree constructed so far • Higher tree level for higher power of 2 #data • Frontline structure storing: • Hanging coefficient nodes • Temporary average of data in hanging subtree • Error information from deleted orphan nodes • Error propagation similar to static case, with some elaboration in upward propagation due to tree sparseness

  16. -4 -1 5 1 7 8 -4 2 7 3 -3 2 8 -2 Example: Classic Error-Tree Frontline Error Tree + - + - + - + - 9 3 9 -5 5 13 13 17 14 -2 9 7 7 3 . . . Data Stream

  17. -4 8 1 8 7 -1 -4 2 7 3 -3 2 5 -2 Example: Sibling Error-Tree Frontline Error Tree 9 3 9 -5 5 13 13 17 14 -2 9 7 7 3 . . . Data Stream

  18. 2 3 7 -4 9 4 Example: B = 6, after 6 values Frontline Error Tree 9 3 9 -5 5 13 . . . Data Stream

  19. -4 -3 3 7 -4 9 4 8 -2 2 Example: B = 6, after 8 values Frontline Error Tree 9 3 9 -5 5 13 13 17 . . . Data Stream

  20. -4 7 6 - 8 8 -3 Example: B = 6, after 10 values Frontline Error Tree 9 3 9 -5 5 13 13 17 14 -2 . . . Data Stream

  21. -4 7 - 7 8 8 -3 Example: B = 6, after 12 values Frontline Error Tree 9 3 9 -5 5 13 13 17 14 -2 9 7 . . . Data Stream

  22. -4 7 5 7 8 8 Example: B = 6, after 14 values Frontline Error Tree 9 3 9 -5 5 13 13 17 14 -2 9 7 7 3 Data Stream

  23. -4 7 8 1 1 7 Example: B = 6, after padding Error Tree 4 4 11 -3 12 12 12 12 15 -1 7 7 5 5 Reconstruction

  24. Solution for Relative Error • Analogous Extension not feasible • Solution: Heuristic Techniques • Estimate of MRk calculated based on: • 4 quantities as in Absolute Error (with denominators) • Minimum Absolute values in each subtree (with errors) • A sample value (with error) for each subtree, initialized as Minimum Absolute value beneath, changed by error propagation process when a sample below involves larger relative error • Heuristic Estimate set as Maximum Relative Error among these 8 positions

  25. Experiments with Real Data: Frequency counts in US Forest Service Database Photon counts by Voyager 2 stellar occultation experiments Temperature measures from equatorial Pacific Comparison of both Static and Stream Algorithms with the Optimal Solution and the Conventional Method Streaming Algorithm can produce window-based synopses by discarding those retained coefficients whose scope falls entirely outside the window of interest We present results for fixed data sets arriving in stream in order to preserve comparability with those of the non-streaming algorithms We present results for the relative error heuristic in the static case as well Experimental Setting

  26. Experimental Results • Run-time, B = N / 16, Relative Error

  27. Experimental Results • Quality, Absolute Error, Real Data (frequency counts), N = 360

  28. Experimental Results • Quality, Relative Error, Real Data (frequency counts), N = 360

  29. Experimental Results • Scalability, Absolute Error, Real Data (photon counts), N = 16K

  30. Experimental Results • Scalability, Relative Error, Real Data (photon counts), N = 16K

  31. Experimental Results • Scalability, Absolute Error, Real Data (temperature measures), B = N / 16

  32. Experimental Results • Scalability, Relative Error, Real Data (temperature measures), B = N / 16

  33. Conclusions & Future Directions • Feasibility of Wavelet Synopses with near-optimal Error Guarantees at near-linear cost for both Static and Streaming Data • Extension to Multidimensional Wavelets? • Alternative Relative Error Heuristics? • Variable Coefficients? • Theoretical Worst-case Guarantee?

  34. Related Work • Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD 1998 • J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. SIGMOD 1999 • K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. VLDB Journal 2001 • A. Gilbert, Y. Kotidis, S. Muthukrishnan and Martin Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. VLDB 2001 • M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximum-error metrics. PODS 2004 • S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non-Euclidean Error. KDD 2005

  35. Thank you! Questions?

More Related