120 likes | 207 Views
Data-Streams and Histograms. Sudipto Guha, Nick Koudas & Kyuseok Shim. Background. Histogram Captures distribution statistics in an efficient manner Applications Query optimization Approximate query answering Data mining (time series in particular) Piecewise transmission of data
E N D
Data-Streams and Histograms Sudipto Guha, Nick Koudas & Kyuseok Shim
Background • Histogram • Captures distribution statistics in an efficient manner • Applications • Query optimization • Approximate query answering • Data mining (time series in particular) • Piecewise transmission of data • EquiWidth, EquiDepth, MHIST, MaxDiff, V-OPT
Background • Data Stream • An ordered sequence of points that can be read only once or a small number of times • Applications • Mission critical network components • Dynamic traffic configuration, fault identification, troubleshooting • Performance of algorithm measured by number of passes algorithm must make over the stream
Motivation • Since the end use of a histogram is to approximate a data distribution, why not use a near-optimal approximation of the best histogram if it means linear time computation?
Motivation • Approximate V-OPT histograms by improving the dynamic programming solution from quadratic to linear time • Revised algorithm uses little space, hence suitable for data stream model • Assumes cost of interval is monotonic under inclusion
Problem Statement • Given: • non-negative integers v1, ..., vn • k intervals or buckets to partition the index 1..n • Constraint: • Minimize k VARk where is the variance of values in the kth bucket • Dynamic Programming solution: • OPT[k, n] = min {OPT[k-1, x] + VAR[(x+1)..n]} • Runs in O(n2k) time with O(n) space x<n
Intuition of Improvement • For a x b, • VAR[a..n] VAR[x..n] VAR [b..n] (1) • OPT[a..n] OPT[x..n] OPT[b..n] (2) • Use this monotonicity property to reduce the search space by settling for an approximation • Instead of storing the whole OPT function, approximate it by a histogram!
Intuition of Improvement • For all 1 p k, maintain intervals (a1,b1),…, (al, bl) • Value of bi (1+)ai • The number of intervals l depends on p • The value for each interval substitutes for each value in the interval reducing space and time complexity
Results • Theorem: A (1+) approximation for V-OPT runs in O((k2/)log n) space and time O((nk2/)log n) in the data stream model
Advantages and Disadvantages • Accuracy/runtime tradeoff can be controlled by the parameter • For data-stream model, alternatives abound: • Random sampling (simple, assumption of distribution) • Other histogram techniques (faster, less optimal) • Wavelet (flexibility) • Sliding Windows (later paper)
Conclusion • The authors provided an algorithm for approximating a distribution that runs reasonably fast and with small space requirements • Proposed solution can be applied to data-stream model because values are not referred to unless they are stored