1 / 23

The Impact of Duality on Data Synopsis Problems

The Impact of Duality on Data Synopsis Problems. Panagiotis Karras KDD, San Jose, August 13 th , 2007 work with Dimitris Sacharidis and Nikos Mamoulis. Introduction. Data synopsis problems require the optimization of error under a bound on space.

marly
Download Presentation

The Impact of Duality on Data Synopsis Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Impact of Dualityon Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

  2. Introduction • Data synopsis problems require the optimization of error under a bound on space. • Classical approaches treat them in a direct manner, producing complicated solutions, and sometimes resorting to heuristics. • Parameters involved have a monotonic relationship. • Hence, an alternative approach is possible, based on the dual, error-bounded problems.

  3. Outline • Histograms. • Restricted Haar Wavelet Synopses. • Unrestricted Haar and Haar+ Synopses. • Experiments. • Conclusions.

  4. Histograms • Approximate a data set [d1, d2, …, dn] with B buckets, si = [bi, ei, vi] so that a maximum-error metric is minimized. • Classical solution: Jagadish et al. VLDB 1998 Guha et al. VLDB 2004, Guha VLDB 2005 • Recent solutions: Buragohain et al. ICDE 2007 Guha and Shim TKDE 19(7) 2007 For weighted error: Liner for:

  5. Histograms • Solve the error-bounded problem. Maximum Absolute Error bound ε = 2 4 5 6 2 15 17 3 6 9 12 … [ 4 ] [ 16 ] [ 4.5 ] [… • Generalized to any weighted maximum-error metric. Each value di defines a tolerance interval Bucket closed when running intersection of interval becomes null Complexity:

  6. Histograms • Apply to the space-bounded problem. Perform binary search in the domain of the error bound ε For error values requiring space , with actual error , run an optimality test: Error-bounded algorithm running under constraint instead of If requires space, then optimal solution has been reached. Complexity: Independent of buckets B

  7. Restricted Haar Wavelet Synopses • Select subset of Haar wavelet decomposition coefficients, so that a maximum-error metric is minimized. • Classical solution: Garofalakis and Kumar PODS 2004 Guha VLDB 2005 18 0 18 18 7 -8 26 11 10 25 9 -9 10 10 34 16 2 20 20 0 36 16

  8. Restricted Haar Wavelet Synopses • Solve the error-bounded problem. Muthukrishnan FSTTCS 2005 Local search within each of subtrees in bottom Haar tree levels Complexity: • Apply to the space-bounded problem. Complexity: no significant advantage

  9. co + C1 c1 c2 + - c3 + + C2 C3 c4 c7 c5 + - c6 c8 c9 + - + + + + d0 d1 d2 d3 Unrestricted Haar and Haar+ Synopses • Assign arbitrary values to Haar/Haar+ coefficients, so that a maximum-error metric is minimized. • Classical solutions: Guha and Harb KDD 2005, SODA 2006 Karras and Mamoulis ICDE 2007 time space

  10. Unrestricted Haar and Haar+ Synopses unrestricted Haar • Solve the error-bounded problem. Haar+ Complexity: time space • Apply to the space-bounded problem. Complexity: significant time & space advantage

  11. Experiments: Histograms, Time vs. n

  12. Experiments: Histograms, Time vs. B

  13. Experiments: Haar Wavelets, Time vs. n

  14. Experiments: Haar Wavelets, Time vs. B

  15. Experiments: Haar+, Time vs. n

  16. Experiments: Haar+, Time vs. B

  17. Conclusions • Offline space-bounded data synopsis problems are more easily solvable through their error-bounded counterparts. • Complexities lower & independent of synopsis space. • Dual-problem-based algorithms are simpler, more scalable, more general, more elegant, and more memory-parsimonious than the direct ones. • Future: application on other data representation models, multi-measure, multi-dimensional data.

  18. Related Work • H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. VLDB 1998 • S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. VLDB 2004 • M. Garofalakis and A. Kumar. Wavelet synopses for general error metrics. TODS, 30(4):888–928, 2005 (also PODS 2004). • S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005 • S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non-Euclidean Error. KDD 2005 • S. Guha and B. Harb. Approximation algorithms for wavelet transform coding of data streams. SODA 2006 • S. Muthukrishnan. Subquadratic algorithms for workload-aware haar wavelet synopses. FSTTCS 2005 • P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data structure. ICDE 2007

  19. Thank you! Questions? More discussion at Board 17 this evening

  20. Compact Hierarchical Histograms • Assign arbitrary values to CHHcoefficients, so that a maximum-error metric is minimized. • Heuristic solutions: Reiss et al. VLDB 2006 c0 c1 c2 time c5 c3 c6 space c4 d0 d1 d2 d3 The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node. [Reiss et al. VLDB 2006]

  21. ci ci c2i c2i c2i+1 Compact Hierarchical Histograms • Solve the error-bounded problem. Next-to-bottom level case

  22. Compact Hierarchical Histograms • Solve the error-bounded problem. General, recursive case time Complexity: space • Apply to the space-bounded problem. Complexity: Polynomially Tractable

More Related