1 / 18

Efficient Computation of Frequent and Top- k Elements in Data Streams

Efficient Computation of Frequent and Top- k Elements in Data Streams. Motivation. Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks stream for advertisements to display.

Download Presentation

Efficient Computation of Frequent and Top- k Elements in Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Computation of Frequent and Top-k Elements in Data Streams

  2. Motivation • Motivated by Internet advertising commissioners • Before rendering an advertisement for user, query clicks stream for advertisements to display. • If the user's profile is not a frequent “clicker”, then s/he will probably not click any displayed advertisement. • Show Pay-Per-Impression advertisements. • If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement. • Show Pay-Per-Click advertisements. • Retrieve top advertisements to choose what to display.

  3. Problem Definition • Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN • Top-k elements are the k elements with highest frequency • Both problems: • Very related, though, no integrated solution has been proposed • Exact solution is O(min(N,A)) space • approximate variations

  4. φN (φ - ) N Practical Frequent Elements • -Deficient Frequent Elements [Manku ‘02]: • All frequent elements output should have F > (φ - )N, where  is the user-defined error.

  5. F4 (1 - ) F4 Practical Top-k • FindApproxTop(S, k, ) [Charikar ‘02]: • Retrieve a list of k elements such that every element, Ei, in the list has Fi > (1 - ) Fk, where Ek is the kthranked element.

  6. The Space-Saving Algorithm • Space-Saving is counter-based • Monitor only m elements • Only over-estimation errors • Frequency estimation is more accurate for significant elements • Keep track of max. possible errors

  7. Space-Saving By Example A B B A C A B B D D B E C Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error

  8. Space-Saving Observations S = ABBACABBDDBEC N = 13 • Observations: • The summation of the Counts is N • Minimum number of hits, min ≤ N/m • In this example, min = 4 • The minimum number of hits, min, is an upper bound on the error of any element

  9. Space-Saving Proved Properties S = ABBACABBDDBEC N = 13 S = ABBACABBDDBEC N = 13 • If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4.

  10. Space-Saving Data Structure • We need a data structure that • Increments counters in constant time • Keeps elements sorted by their counters • We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]

  11. Frequent Elements Queries • Traverse Stream-Summary, and report all elements that satisfy the user support • Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element

  12. Frequent Elements Example • For N = 73,m = 8,φ = 0.15: • Frequent Elements should have support of 11 hits. • Candidate Frequent Elements are B, D, and G. • Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.

  13. Top-k Elements Queries • Traverse the Stream-Summary, and report top-k elements. • From Property 2, we assert: • Guaranteed top-k elements: • Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k. • Guaranteed top-k’ (where k’≈k): • The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.

  14. Top-k Elements Example • For k = 3,m = 8: • B, D, and G are the top-3candidates. • B, and D are guaranteed to be in the top-3. • B , D, G and A are guaranteed to be the top-4. Here k’ = 4. • B , and D are guaranteed to be the top-2. Another k’ = 2.

  15. Frequent Elements Precision

  16. Frequent Elements Run Time

  17. Frequent Elements Space Used

  18. Max freq. element in stream • Can we promise to find it with less than m buckets?

More Related