1 / 11

How to find frequent items continuously in data streams

How to find frequent items continuously in data streams. Speaker: 陳弘軒 Adviser: 王家祥. A na ï ve approach to find frequent items. Method: Maintain an array of counters Increment the corresponding counter by one whenever a new item arrives Problem:

shackleford
Download Presentation

How to find frequent items continuously in data streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to find frequent items continuously in data streams Speaker: 陳弘軒 Adviser: 王家祥

  2. A naïve approach to find frequent items • Method: • Maintain an array of counters • Increment the corresponding counter by one whenever a new item arrives • Problem: • Available array size M << n (distinct item number) • Inappropriate to continuous query

  3. Applications • The statistical property of sensor monitoring data • The statistical property of Internet packets through a router • The statistical property of searching keywords of a search engine

  4. Basic idea: MJRTY (majority voting) (*) • Use one counter to find the majority of a group • Number of comparisons: n-1 • Example: 1222321 element_name Counter: value 2 1 ψ ψ 1 2 1 1 0 2 1 0 *: Tech. Report ICSCA-CMP-32, Robert S. Boyer and J Strother Moore, 1982

  5. Why MJRTY works? • Assume a majority item α exists in group G, we randomly delete 2 different items from G: • If the two items are not α, α would naturally still be the majority after deleting them • If one of the two items is α, α would still be the majority since both α and its adversary are decrement by one

  6. Apply MJRTY to distributed environment … • Merge two nodes with the same element_name • Add the values directly • Merge two nodes with different element_names • Set value to the abstract value of the difference between two values • Set element_name to the one with larger value d 12 d d 9 3 … d 6 d c 9 3

  7. time element_name value Apply MJRTY to data stream (basic) • Required space  window size Ex: number of available counters = 9 now -8 -7 -6 -5 -4 -3 -2 -1 0 c d d c b d b a d time = t 3 9 7 3 10 2 22 8 15 Use the recycled counter element Going to be recycled… d d c b d b a d b time = t+1 9 7 3 10 2 22 8 15 5

  8. time 32 8 manage time unit 4 1 1 4 2 2 2 4 8 16 16 32 16 1 1 8 2 32 2 32 1 1 2 4 8 2 16 2 32 4 1 d b c d d d b element_name b c c b d a d b a d d b d b c b b d b d d b d d d c 5 2 7 22 3 3 10 5 7 3 9 20 7 8 7 15 10 9 3 9 10 3 value 5 7 22 10 7 9 2 8 22 15 2 Apply MJRTY to data stream (improved) • Required space  log(window size) -66 -34 -18 -10 -6 -4 -2 -1 0 Use the recycled counter element Going to be recycled… “Three” counters are responsible for time unit with length 1  merge “Three” counters are responsible for time units with length 2  merge

  9. Extend MJRTY to HI-FRQCY (high-frequency) • Frequent item: frequency > 1/(n+1) • Use n counters to get frequent items • Ex: when n=2 11233 element_name Counter: value 1 Φ 1 1 1 0 2 1 Φ Φ 2 3 0 0 1 1

  10. Why HI_FRQCY works? • At most n items whose frequency are larger than 1/(n+1), so n counters are enough to record all frequent items • If frequent items exist in group G, randomly delete ndifferent items from G will not affect the status of frequent items

  11. Apply HI_FRQCY to distributed and continuous environment • Merge two nodes • If any counters in the two nodes record the same item, merge them • Sort the counters • Choose the larger n counters as result • Can be applied to distributed systems • Can be applied to continuous query environment

More Related