100 likes | 225 Views
Estimating Entropy for Data Streams. Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan. Review of Data Streams. Motivation: huge data stream that needs to be mined for info “efficiently.” Applications: monitoring IP traffic, mining email and text message streams, etc.
E N D
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan
Review of Data Streams • Motivation: huge data stream that needs to be mined for info “efficiently.” • Applications: monitoring IP traffic, mining email and text message streams, etc.
The Mathematical Model • Sequence of integers A = a1, …, am, where each ai N = {1, …, n}. • For each v N, the frequencymv of v is # occurrences of v in A. • Statistics to be estimated are functions on A, but usually just on the mv’s (e.g. frequency moments).
What is Entropy? • In physics: measure of disorder in a system. • In math: measure of randomness (or uniformity) of a probability distribution. • Formula:
Entropy on Data Streams • For big m, mv/m → Pr[v]. So formula becomes: • Suffices to compute m (easy) and
The Goal • Approximation algorithm to estimate μ. • Approximate means to output a number Y such that: Pr[|Y – μ| λμ] ε, for any user-specified λ, ε > 0. • Restrictions: o(n), preferably Õ(1), space, and only 1 pass over data.
The Algorithm • We want Y to have E[Y] = μ and very small variance, so find a computable random variable X with E[X] = μ and small variance, and compute it several times. • Y is the median of s2 RVs Yi, each of which is the mean of s1 RVs Xij = X (independently, identically computed).
Computing X • Choose p {1, …, m} uniformly at random. • Let r = #{q p | aq = ap} ( 1). • X = m[r log r – (r – 1) log (r – 1)].
The Analysis • Easy: E[Y] = E[X] = μ. • Hard: Var[Y] is very small. • Turns out s1 = O(log n), s2 = O(1) works. • Each X maintained in O(log n + log m) space. • Total: O(s1s2(log n + log m)) = O(log n log m).
Future Directions • Extension to insert/delete streams. Applications in: • DBMSs where massive secondary storage cannot be scanned quickly enough to answer real-time queries. • Monitoring open flows through internet routers. • Lowerbound proof showing algorithm is optimal, or an improved algorithm.