120 likes | 133 Views
This paper explores the use of histograms to efficiently capture distribution statistics in data streams. It discusses applications in query optimization, approximate query answering, data mining, and piecewise transmission of data. The authors propose an algorithm for approximating histograms that runs fast and requires minimal space, making it suitable for data stream models.
E N D
Data-Streams and Histograms Sudipto Guha, Nick Koudas & Kyuseok Shim
Background • Histogram • Captures distribution statistics in an efficient manner • Applications • Query optimization • Approximate query answering • Data mining (time series in particular) • Piecewise transmission of data • EquiWidth, EquiDepth, MHIST, MaxDiff, V-OPT
Background • Data Stream • An ordered sequence of points that can be read only once or a small number of times • Applications • Mission critical network components • Dynamic traffic configuration, fault identification, troubleshooting • Performance of algorithm measured by number of passes algorithm must make over the stream
Motivation • Since the end use of a histogram is to approximate a data distribution, why not use a near-optimal approximation of the best histogram if it means linear time computation?
Motivation • Approximate V-OPT histograms by improving the dynamic programming solution from quadratic to linear time • Revised algorithm uses little space, hence suitable for data stream model • Assumes cost of interval is monotonic under inclusion
Problem Statement • Given: • non-negative integers v1, ..., vn • k intervals or buckets to partition the index 1..n • Constraint: • Minimize k VARk where is the variance of values in the kth bucket • Dynamic Programming solution: • OPT[k, n] = min {OPT[k-1, x] + VAR[(x+1)..n]} • Runs in O(n2k) time with O(n) space x<n
Intuition of Improvement • For a x b, • VAR[a..n] VAR[x..n] VAR [b..n] (1) • OPT[a..n] OPT[x..n] OPT[b..n] (2) • Use this monotonicity property to reduce the search space by settling for an approximation • Instead of storing the whole OPT function, approximate it by a histogram!
Intuition of Improvement • For all 1 p k, maintain intervals (a1,b1),…, (al, bl) • Value of bi (1+)ai • The number of intervals l depends on p • The value for each interval substitutes for each value in the interval reducing space and time complexity
Results • Theorem: A (1+) approximation for V-OPT runs in O((k2/)log n) space and time O((nk2/)log n) in the data stream model
Advantages and Disadvantages • Accuracy/runtime tradeoff can be controlled by the parameter • For data-stream model, alternatives abound: • Random sampling (simple, assumption of distribution) • Other histogram techniques (faster, less optimal) • Wavelet (flexibility) • Sliding Windows (later paper)
Conclusion • The authors provided an algorithm for approximating a distribution that runs reasonably fast and with small space requirements • Proposed solution can be applied to data-stream model because values are not referred to unless they are stored