How to find frequent items continuously in data streams

How to find frequent items continuously in data streams Speaker: 陳弘軒 Adviser: 王家祥

A naïve approach to find frequent items • Method: • Maintain an array of counters • Increment the corresponding counter by one whenever a new item arrives • Problem: • Available array size M << n (distinct item number) • Inappropriate to continuous query

Applications • The statistical property of sensor monitoring data • The statistical property of Internet packets through a router • The statistical property of searching keywords of a search engine

Basic idea: MJRTY (majority voting) (*) • Use one counter to find the majority of a group • Number of comparisons: n-1 • Example: 1222321 element_name Counter: value 2 1 ψ ψ 1 2 1 1 0 2 1 0 *: Tech. Report ICSCA-CMP-32, Robert S. Boyer and J Strother Moore, 1982

Why MJRTY works? • Assume a majority item α exists in group G, we randomly delete 2 different items from G: • If the two items are not α, α would naturally still be the majority after deleting them • If one of the two items is α, α would still be the majority since both α and its adversary are decrement by one

Apply MJRTY to distributed environment … • Merge two nodes with the same element_name • Add the values directly • Merge two nodes with different element_names • Set value to the abstract value of the difference between two values • Set element_name to the one with larger value d 12 d d 9 3 … d 6 d c 9 3

time element_name value Apply MJRTY to data stream (basic) • Required space  window size Ex: number of available counters = 9 now -8 -7 -6 -5 -4 -3 -2 -1 0 c d d c b d b a d time = t 3 9 7 3 10 2 22 8 15 Use the recycled counter element Going to be recycled… d d c b d b a d b time = t+1 9 7 3 10 2 22 8 15 5

time 32 8 manage time unit 4 1 1 4 2 2 2 4 8 16 16 32 16 1 1 8 2 32 2 32 1 1 2 4 8 2 16 2 32 4 1 d b c d d d b element_name b c c b d a d b a d d b d b c b b d b d d b d d d c 5 2 7 22 3 3 10 5 7 3 9 20 7 8 7 15 10 9 3 9 10 3 value 5 7 22 10 7 9 2 8 22 15 2 Apply MJRTY to data stream (improved) • Required space  log(window size) -66 -34 -18 -10 -6 -4 -2 -1 0 Use the recycled counter element Going to be recycled… “Three” counters are responsible for time unit with length 1  merge “Three” counters are responsible for time units with length 2  merge

Extend MJRTY to HI-FRQCY (high-frequency) • Frequent item: frequency > 1/(n+1) • Use n counters to get frequent items • Ex: when n=2 11233 element_name Counter: value 1 Φ 1 1 1 0 2 1 Φ Φ 2 3 0 0 1 1

Why HI_FRQCY works? • At most n items whose frequency are larger than 1/(n+1), so n counters are enough to record all frequent items • If frequent items exist in group G, randomly delete ndifferent items from G will not affect the status of frequent items

Apply HI_FRQCY to distributed and continuous environment • Merge two nodes • If any counters in the two nodes record the same item, merge them • Sort the counters • Choose the larger n counters as result • Can be applied to distributed systems • Can be applied to continuous query environment

How to find frequent items continuously in data streams

How to find frequent items continuously in data streams

Presentation Transcript

Finding Frequent Items in Data Streams

Tracking most frequent items dynamically.

Continuously Adaptive Continuous Queries (CACQ) over Streams

Mining Frequent Patterns in Data Streams at Multiple Time Granularities

Frequent Pattern Mining in Data Streams

Data Streams

Finding Frequent Items in Distributed Data Streams

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton]

Continuously Maintaining Order Statistics Over Data Streams

Continuously Adaptive Continuous Queries (CACQ) over Streams

Finding the Frequent Items in Streams of Data

Constrained Frequent Itemset Mining from Uncertain Data Streams

Efficient Computation of Frequent and Top- k Elements in Data Streams

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

Mining Frequent Patterns in Data Streams at Multiple Time Granularities

Efficient Computation of Frequent and Top- k Elements in Data Streams

Finding Frequent Items in Data Streams