110 likes | 131 Views
How to find frequent items continuously in data streams. Speaker: 陳弘軒 Adviser: 王家祥. A na ï ve approach to find frequent items. Method: Maintain an array of counters Increment the corresponding counter by one whenever a new item arrives Problem:
E N D
How to find frequent items continuously in data streams Speaker: 陳弘軒 Adviser: 王家祥
A naïve approach to find frequent items • Method: • Maintain an array of counters • Increment the corresponding counter by one whenever a new item arrives • Problem: • Available array size M << n (distinct item number) • Inappropriate to continuous query
Applications • The statistical property of sensor monitoring data • The statistical property of Internet packets through a router • The statistical property of searching keywords of a search engine
Basic idea: MJRTY (majority voting) (*) • Use one counter to find the majority of a group • Number of comparisons: n-1 • Example: 1222321 element_name Counter: value 2 1 ψ ψ 1 2 1 1 0 2 1 0 *: Tech. Report ICSCA-CMP-32, Robert S. Boyer and J Strother Moore, 1982
Why MJRTY works? • Assume a majority item α exists in group G, we randomly delete 2 different items from G: • If the two items are not α, α would naturally still be the majority after deleting them • If one of the two items is α, α would still be the majority since both α and its adversary are decrement by one
Apply MJRTY to distributed environment … • Merge two nodes with the same element_name • Add the values directly • Merge two nodes with different element_names • Set value to the abstract value of the difference between two values • Set element_name to the one with larger value d 12 d d 9 3 … d 6 d c 9 3
time element_name value Apply MJRTY to data stream (basic) • Required space window size Ex: number of available counters = 9 now -8 -7 -6 -5 -4 -3 -2 -1 0 c d d c b d b a d time = t 3 9 7 3 10 2 22 8 15 Use the recycled counter element Going to be recycled… d d c b d b a d b time = t+1 9 7 3 10 2 22 8 15 5
time 32 8 manage time unit 4 1 1 4 2 2 2 4 8 16 16 32 16 1 1 8 2 32 2 32 1 1 2 4 8 2 16 2 32 4 1 d b c d d d b element_name b c c b d a d b a d d b d b c b b d b d d b d d d c 5 2 7 22 3 3 10 5 7 3 9 20 7 8 7 15 10 9 3 9 10 3 value 5 7 22 10 7 9 2 8 22 15 2 Apply MJRTY to data stream (improved) • Required space log(window size) -66 -34 -18 -10 -6 -4 -2 -1 0 Use the recycled counter element Going to be recycled… “Three” counters are responsible for time unit with length 1 merge “Three” counters are responsible for time units with length 2 merge
Extend MJRTY to HI-FRQCY (high-frequency) • Frequent item: frequency > 1/(n+1) • Use n counters to get frequent items • Ex: when n=2 11233 element_name Counter: value 1 Φ 1 1 1 0 2 1 Φ Φ 2 3 0 0 1 1
Why HI_FRQCY works? • At most n items whose frequency are larger than 1/(n+1), so n counters are enough to record all frequent items • If frequent items exist in group G, randomly delete ndifferent items from G will not affect the status of frequent items
Apply HI_FRQCY to distributed and continuous environment • Merge two nodes • If any counters in the two nodes record the same item, merge them • Sort the counters • Choose the larger n counters as result • Can be applied to distributed systems • Can be applied to continuous query environment