230 likes | 392 Views
The Impact of Duality on Data Synopsis Problems. Panagiotis Karras KDD, San Jose, August 13 th , 2007 work with Dimitris Sacharidis and Nikos Mamoulis. Introduction. Data synopsis problems require the optimization of error under a bound on space.
E N D
The Impact of Dualityon Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis
Introduction • Data synopsis problems require the optimization of error under a bound on space. • Classical approaches treat them in a direct manner, producing complicated solutions, and sometimes resorting to heuristics. • Parameters involved have a monotonic relationship. • Hence, an alternative approach is possible, based on the dual, error-bounded problems.
Outline • Histograms. • Restricted Haar Wavelet Synopses. • Unrestricted Haar and Haar+ Synopses. • Experiments. • Conclusions.
Histograms • Approximate a data set [d1, d2, …, dn] with B buckets, si = [bi, ei, vi] so that a maximum-error metric is minimized. • Classical solution: Jagadish et al. VLDB 1998 Guha et al. VLDB 2004, Guha VLDB 2005 • Recent solutions: Buragohain et al. ICDE 2007 Guha and Shim TKDE 19(7) 2007 For weighted error: Liner for:
Histograms • Solve the error-bounded problem. Maximum Absolute Error bound ε = 2 4 5 6 2 15 17 3 6 9 12 … [ 4 ] [ 16 ] [ 4.5 ] [… • Generalized to any weighted maximum-error metric. Each value di defines a tolerance interval Bucket closed when running intersection of interval becomes null Complexity:
Histograms • Apply to the space-bounded problem. Perform binary search in the domain of the error bound ε For error values requiring space , with actual error , run an optimality test: Error-bounded algorithm running under constraint instead of If requires space, then optimal solution has been reached. Complexity: Independent of buckets B
Restricted Haar Wavelet Synopses • Select subset of Haar wavelet decomposition coefficients, so that a maximum-error metric is minimized. • Classical solution: Garofalakis and Kumar PODS 2004 Guha VLDB 2005 18 0 18 18 7 -8 26 11 10 25 9 -9 10 10 34 16 2 20 20 0 36 16
Restricted Haar Wavelet Synopses • Solve the error-bounded problem. Muthukrishnan FSTTCS 2005 Local search within each of subtrees in bottom Haar tree levels Complexity: • Apply to the space-bounded problem. Complexity: no significant advantage
co + C1 c1 c2 + - c3 + + C2 C3 c4 c7 c5 + - c6 c8 c9 + - + + + + d0 d1 d2 d3 Unrestricted Haar and Haar+ Synopses • Assign arbitrary values to Haar/Haar+ coefficients, so that a maximum-error metric is minimized. • Classical solutions: Guha and Harb KDD 2005, SODA 2006 Karras and Mamoulis ICDE 2007 time space
Unrestricted Haar and Haar+ Synopses unrestricted Haar • Solve the error-bounded problem. Haar+ Complexity: time space • Apply to the space-bounded problem. Complexity: significant time & space advantage
Conclusions • Offline space-bounded data synopsis problems are more easily solvable through their error-bounded counterparts. • Complexities lower & independent of synopsis space. • Dual-problem-based algorithms are simpler, more scalable, more general, more elegant, and more memory-parsimonious than the direct ones. • Future: application on other data representation models, multi-measure, multi-dimensional data.
Related Work • H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. VLDB 1998 • S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. VLDB 2004 • M. Garofalakis and A. Kumar. Wavelet synopses for general error metrics. TODS, 30(4):888–928, 2005 (also PODS 2004). • S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005 • S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non-Euclidean Error. KDD 2005 • S. Guha and B. Harb. Approximation algorithms for wavelet transform coding of data streams. SODA 2006 • S. Muthukrishnan. Subquadratic algorithms for workload-aware haar wavelet synopses. FSTTCS 2005 • P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data structure. ICDE 2007
Thank you! Questions? More discussion at Board 17 this evening
Compact Hierarchical Histograms • Assign arbitrary values to CHHcoefficients, so that a maximum-error metric is minimized. • Heuristic solutions: Reiss et al. VLDB 2006 c0 c1 c2 time c5 c3 c6 space c4 d0 d1 d2 d3 The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node. [Reiss et al. VLDB 2006]
ci ci c2i c2i c2i+1 Compact Hierarchical Histograms • Solve the error-bounded problem. Next-to-bottom level case
Compact Hierarchical Histograms • Solve the error-bounded problem. General, recursive case time Complexity: space • Apply to the space-bounded problem. Complexity: Polynomially Tractable