1 / 10

Estimating Entropy for Data Streams

Estimating Entropy for Data Streams. Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan. Review of Data Streams. Motivation: huge data stream that needs to be mined for info “efficiently.” Applications: monitoring IP traffic, mining email and text message streams, etc.

misty
Download Presentation

Estimating Entropy for Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan

  2. Review of Data Streams • Motivation: huge data stream that needs to be mined for info “efficiently.” • Applications: monitoring IP traffic, mining email and text message streams, etc.

  3. The Mathematical Model • Sequence of integers A = a1, …, am, where each ai N = {1, …, n}. • For each v  N, the frequencymv of v is # occurrences of v in A. • Statistics to be estimated are functions on A, but usually just on the mv’s (e.g. frequency moments).

  4. What is Entropy? • In physics: measure of disorder in a system. • In math: measure of randomness (or uniformity) of a probability distribution. • Formula:

  5. Entropy on Data Streams • For big m, mv/m → Pr[v]. So formula becomes: • Suffices to compute m (easy) and

  6. The Goal • Approximation algorithm to estimate μ. • Approximate means to output a number Y such that: Pr[|Y – μ|  λμ]  ε, for any user-specified λ, ε > 0. • Restrictions: o(n), preferably Õ(1), space, and only 1 pass over data.

  7. The Algorithm • We want Y to have E[Y] = μ and very small variance, so find a computable random variable X with E[X] = μ and small variance, and compute it several times. • Y is the median of s2 RVs Yi, each of which is the mean of s1 RVs Xij = X (independently, identically computed).

  8. Computing X • Choose p {1, …, m} uniformly at random. • Let r = #{q  p | aq = ap} (  1). • X = m[r log r – (r – 1) log (r – 1)].

  9. The Analysis • Easy: E[Y] = E[X] = μ. • Hard: Var[Y] is very small. • Turns out s1 = O(log n), s2 = O(1) works. • Each X maintained in O(log n + log m) space. • Total: O(s1s2(log n + log m)) = O(log n log m).

  10. Future Directions • Extension to insert/delete streams. Applications in: • DBMSs where massive secondary storage cannot be scanned quickly enough to answer real-time queries. • Monitoring open flows through internet routers. • Lowerbound proof showing algorithm is optimal, or an improved algorithm.

More Related