Efficient Computation of Frequent & Top-k Elements in Data Streams

Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara

Motivation • Motivated by Internet advertising commissioners • Before rendering an advertisement for user, query clicks stream for advertisements to display. • If the user's profile is not a frequent “clicker”, then s/he will probably not click any displayed advertisement. • Show Pay-Per-Impression advertisements. • If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement. • Show Pay-Per-Click advertisements. • Retrieve top advertisements to choose what to display.

Problem Definition • Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN • Top-k elements are the k elements with highest frequency • Both problems: • Very related, though, no integrated solution has been proposed • Exact solution is O(min(N,A)) space • approximate variations

φN (φ - ) N Practical Frequent Elements • -Deficient Frequent Elements [Manku ‘02]: • All frequent elements output should have F > (φ - )N, where  is the user-defined error.

F4 (1 - ) F4 Practical Top-k • FindApproxTop(S, k, ) [Charikar ‘02]: • Retrieve a list of k elements such that every element, Ei, in the list has Fi > (1 - ) Fk, where Ek is the kthranked element.

Related Work • Algorithms Classification • Counter-Based techniques • Keep an individual counter for each element • If the observed ID is monitored, its counter is updated • If the observed ID is not monitored, algorithm dependent action • Sketch-Based techniques • Estimate frequency for all elements using bit-maps of counters • Each element is hashed into the counters’ space using a family of hash functions. • Hashed-to counters are queried for the frequencies

Recent Work (Comparison)

Outline • Problem Definition • Space-Saving: Summarizing the Data Stream • Answering Frequent Elements Queries • Answering Top-k Queries • Experimental Results • Conclusion

The Space-Saving Algorithm • Space-Saving is counter-based • Monitor only m elements • Only over-estimation errors • Frequency estimation is more accurate for significant elements • Keep track of max. possible errors

Space-Saving By Example A B B A C A B B D D B E C Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error Space-Saving Algorithm • For every element in the stream S • If a monitored element is observed • Increment its Count • If a non-monitored element is observed, • Replace the element with minimum hits, min • Increment the minimum Count to min + 1 • maximum possible over-estimation is error

Space-Saving Observations S = ABBACABBDDBEC N = 13 • Observations: • The summation of the Counts is N • Minimum number of hits, min ≤ N/m • In this example, min = 4 • The minimum number of hits, min, is an upper bound on the error of any element

Space-Saving Proved Properties S = ABBACABBDDBEC N = 13 S = ABBACABBDDBEC N = 13 • If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4.

Space-Saving Data Structure • We need a data structure that • Increments counters in constant time • Keeps elements sorted by their counters • We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]

Frequent Elements Queries • Traverse Stream-Summary, and report all elements that satisfy the user support • Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element

Frequent Elements Example • For N = 73,m = 8,φ = 0.15: • Frequent Elements should have support of 11 hits. • Candidate Frequent Elements are B, D, and G. • Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.

Frequent Elements Space Bounds

Top-k Elements Queries • Traverse the Stream-Summary, and report top-k elements. • From Property 2, we assert: • Guaranteed top-k elements: • Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k. • Guaranteed top-k’ (where k’≈k): • The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.

Top-k Elements Example • For k = 3,m = 8: • B, D, and G are the top-3candidates. • B, and D are guaranteed to be in the top-3. • B , D, G and A are guaranteed to be the top-4. Here k’ = 4. • B , and D are guaranteed to be the top-2. Another k’ = 2.

Top-k Elements Space Bounds

Outline • Problem Definition • Space-Saving: Summarizing the Data Stream • Answering Frequent Elements Queries • Answering Top-k Queries • Experimental Results • Conclusion

Experimental Results - Setup • Synthetic data: • Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0 • N =107 hits. • Real Data (ValueClick, Inc.): Similar results • Precision: • number of correct elements found / entire output • Recall: • number of correct elements found / number of actual correct • Run time: • Processing Stream + Query Time • Space used: • Including hash table

Frequent Elements Results • Query: φ = 10-2,  = 10-4, and δ = 10-2 • We compared with • GroupTest and Frequent • All algorithms had a recall of 1. • That is, they all output the correct elements among their output. • Space-Saving was able to guarantee all its output to be correct

Frequent Elements Precision

Frequent Elements Run Time

Frequent Elements Space Used

Top-k Elements Results • Query: k = 100,  = 10-4, and δ = 10-2 • We compared with • CountSketch: CountSketch was re-run several times. The hidden constant was estimated to be 16, in order to have output of competitive quality. • Probabilistic-InPlace: was allowed the same number of counters as Space-Saving • Space-Saving was able to guarantee all its output to be correct

Top-k Elements Precision

Top-k Elements Recall

Top-k Elements Run Time

Top-k Elements Space Used

Conclusion • Contributions: • An integrated approach to solve an interesting family of problems • Strict error bounds using little space • Guarantees on results • Special attention was given to Zipfian data • Experimental validation • Future Work: • Incremental frequent and top-k elements reporting

Efficient Computation of Frequent & Top-k Elements in Data Streams

Efficient Computation of Frequent & Top-k Elements in Data Streams

Presentation Transcript

Finding Frequent Items in Data Streams

Nectar: Efficient Management of Computation and Data in Data Centers

Mining Frequent Patterns in Data Streams at Multiple Time Granularities

Frequent Pattern Mining in Data Streams

Efficient Top-K Query Evaluation on Probabilistic Data

Space-Efficient Data Structures for Top-k Completion

Space-Efficient Data Structures for Top- k Completion

Finding Frequent Items in Distributed Data Streams

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Finding the Frequent Items in Streams of Data

Rectangle-Efficient Aggregation in Spatial Data Streams

Range-Efficient Computation of F 0 over Massive Data Streams

Constrained Frequent Itemset Mining from Uncertain Data Streams

Top-k and Skyline Computation

Efficient Top-k Query Evaluation on Probabilistic Data

Towards efficient processing of RDF data streams

Efficient Computation of Frequent and Top- k Elements in Data Streams

An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

Rectangle-Efficient Aggregation in Spatial Data Streams

Top- k and Skyline Computation in Database Systems

How to find frequent items continuously in data streams

Finding Frequent Items in Data Streams