Estimating Entropy for Data Streams

Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan

Review of Data Streams • Motivation: huge data stream that needs to be mined for info “efficiently.” • Applications: monitoring IP traffic, mining email and text message streams, etc.

The Mathematical Model • Sequence of integers A = a1, …, am, where each ai N = {1, …, n}. • For each v  N, the frequencymv of v is # occurrences of v in A. • Statistics to be estimated are functions on A, but usually just on the mv’s (e.g. frequency moments).

What is Entropy? • In physics: measure of disorder in a system. • In math: measure of randomness (or uniformity) of a probability distribution. • Formula:

Entropy on Data Streams • For big m, mv/m → Pr[v]. So formula becomes: • Suffices to compute m (easy) and

The Goal • Approximation algorithm to estimate μ. • Approximate means to output a number Y such that: Pr[|Y – μ|  λμ]  ε, for any user-specified λ, ε > 0. • Restrictions: o(n), preferably Õ(1), space, and only 1 pass over data.

The Algorithm • We want Y to have E[Y] = μ and very small variance, so find a computable random variable X with E[X] = μ and small variance, and compute it several times. • Y is the median of s2 RVs Yi, each of which is the mean of s1 RVs Xij = X (independently, identically computed).

Computing X • Choose p {1, …, m} uniformly at random. • Let r = #{q  p | aq = ap} (  1). • X = m[r log r – (r – 1) log (r – 1)].

The Analysis • Easy: E[Y] = E[X] = μ. • Hard: Var[Y] is very small. • Turns out s1 = O(log n), s2 = O(1) works. • Each X maintained in O(log n + log m) space. • Total: O(s1s2(log n + log m)) = O(log n log m).

Future Directions • Extension to insert/delete streams. Applications in: • DBMSs where massive secondary storage cannot be scanned quickly enough to answer real-time queries. • Monitoring open flows through internet routers. • Lowerbound proof showing algorithm is optimal, or an improved algorithm.

Estimating Entropy for Data Streams

Estimating Entropy for Data Streams

Presentation Transcript

Managing Data Streams

Data Streams

Clustering Data Streams

Clustering Data Streams

Mining Data Streams

Massive data streams

Data Streams

Data Streams

Mining Data Streams

Algorithms for Data Streams

Algorithms for geometric data streams

Mining Data Streams

Streams: Infinite Data

Estimating PageRank on Graph Streams

Privacy Preservation for Data Streams

Data Streams

Mining Data Streams

Data Mining for Data Streams

Mining Data Streams

Algorithms for geometric data streams

Mining Data Streams