1 / 46

COMP5331

COMP5331. Data Stream. Prepared by Raymond Wong Presented by Raymond Wong raywong@cse. Data Mining over Static Data. Association Clustering Classification. Output (Data Mining Results). Static Data. Data Mining over Data Streams. Association Clustering Classification. Output

wilma-allen
Download Presentation

COMP5331

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP5331 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

  2. Data Mining over Static Data • Association • Clustering • Classification Output (Data Mining Results) Static Data

  3. Data Mining over Data Streams • Association • Clustering • Classification Output (Data Mining Results) … Unbounded Data Real-time Processing

  4. 1 2 … Less recent More recent Data Streams Each point: a transaction

  5. Data Streams

  6. 1 2 … Less recent More recent Entire Data Streams Each point: a transaction Obtain the data mining results from all data points read so far

  7. 1 2 … Less recent More recent Entire Data Streams Each point: a transaction Obtain the data mining results over a sliding window

  8. Data Streams • Entire Data Streams • Data Streams with Sliding Window

  9. Entire Data Streams Frequent pattern/item • Association • Clustering • Classification

  10. 1 2 … Less recent More recent Frequent Item over Data Streams • Let N be the length of the data streams • Let s be the support threshold (in fraction) (e.g., 20%) • Problem: We want to find all items with frequency >= sN Each point: a transaction

  11. Data Streams

  12. Data Streams • Frequent item • I1 • Infrequent item • I2 • I3 Output (Data Mining Results) Static Data Output (Data Mining Results) … Unbounded Data • Frequent item • I1 • I3 • Infrequent item • I2

  13. False Positive/Negative • False Positive • The item is classified as frequent item • In fact, the item is infrequent • E.g. • Expected Output • Frequent item • I1 • Infrequent item • I2 • I3 • Algorithm Output • Frequent item • I1 • I3 • Infrequent item • I2 Which item is one of the false positives? I3 More? No. No. of false positives = 1 If we say:The algorithm has no false positives. All true infrequent items are classified as infrequent items in the algorithm output.

  14. False Positive/Negative • False Negative • The item is classified as infrequent item • In fact, the item is frequent • E.g. • Expected Output • Frequent item • I1 • I3 • Infrequent item • I2 • Algorithm Output • Frequent item • I1 • Infrequent item • I2 • I3 Which item is one of the false negatives? I3 More? No. No. of false negatives = 1 No. of false positives = 0 If we say:The algorithm has no false negatives. All true frequent items are classified as frequent items in the algorithm output.

  15. Data Streams We need to introduce an input error parameter 

  16. Data Streams • Frequent item • I1 • Infrequent item • I2 • I3 Output (Data Mining Results) Static Data Output (Data Mining Results) … Unbounded Data • Frequent item • I1 • I3 • Infrequent item • I2

  17. N: total no. of occurrences of items Data Streams • Store the statistics of all items • I1: 10 • I2: 8 • I3: 12 N = 20 = 0.2 N = 4 Output (Data Mining Results) Static Data D <= N ? Diff. D 0 Yes 4 Yes 2 Yes Output (Data Mining Results) … Unbounded Data • Estimate the statistics of all items • I1: 10 • I2: 4 • I3: 10

  18. -deficient synopsis • Let N be the current length of the stream(or total no. of occurrences of items) • Let  be an input parameter (a real number from 0 to 1) • An algorithm maintains an -deficient synopsis if its output satisfies the following properties • Condition 1: There is no false negative. All true frequent items are classified as frequent items in the algorithm output. • Condition 2: The difference between the estimated frequency and the true frequency is at most N. • Condition 3: All items whose true frequencies less than (s-)N are classified as infrequent items in the algorithm output

  19. Frequent Pattern Mining over Entire Data Streams • Algorithm • Sticky Sampling Algorithm • Lossy Counting Algorithm • Space-Saving Algorithm

  20. s Statistics of items  Output  … Unbounded Data Sticky Sampling Algorithm Support threshold Stored in the memory Sticky Sampling Error parameter Confidence parameter Frequent items Infrequent items

  21. Sticky Sampling Algorithm • The sampling rate r varies over the lifetime of a stream • Confidence parameter  (a small real number) • Let t = 1/ ln(s-1-1)

  22. Sticky Sampling Algorithm • e.g. s = 0.02 = 0.01 • = 0.1 t = 622 • The sampling rate r varies over the lifetime of a stream • Confidence parameter  (a small real number) • Let t = 1/ ln(s-1-1) 1~1244 1 ~ 2*622 1245~2488 2*622+1 ~ 4*622 4*622+1 ~ 8*622 2489~4976

  23. Sticky Sampling Algorithm • e.g. s = 0.5 = 0.35 • = 0.5 t = 4 • The sampling rate r varies over the lifetime of a stream • Confidence parameter  (a small real number) • Let t = 1/ ln(s-1-1) 1~8 1 ~ 2*4 9~16 2*4+1 ~ 4*4 4*4+1 ~ 8*4 17~32

  24. Sticky Sampling Algorithm element Estimated frequency • S: empty list will contain (e, f) • When data e arrives, • if e exists in S, increment f in (e, f) • if e does not exist in S, add entry (e, 1) with prob. 1/r (where r: sampling rate) • Just after r changes, • For each entry (e, f), • Repeatedly toss a coin with P(head) = 1/r until the outcome of the coin toss is head • If the outcome of the toss is tail, • Decrement f in (e, f) • If f = 0, delete the entry (e, f) • [Output] Get a list of items where f + N >= sN

  25. Analysis • -deficient synopsis • Sticky Sampling computes an -deficient synopsis with probability at least 1- • Memory Consumption • Sticky Sampling occupies at most 2/ ln(s-1-1)  entries on average

  26. Frequent Pattern Mining over Entire Data Streams • Algorithm • Sticky Sampling Algorithm • Lossy Counting Algorithm • Space-Saving Algorithm

  27. s Statistics of items  Output … Unbounded Data Lossy Counting Algorithm Support threshold Stored in the memory Lossy Counting Error parameter Frequent items Infrequent items

  28. 1 2 … Bucket bcurrent Bucket 3 Bucket 2 Bucket 1 Width w = bcurrent = Less recent More recent Lossy Counting Algorithm N: current length of stream Each point: a transaction …

  29. Lossy Counting Algorithm element Frequency of element since this entry was inserted into D • D: Empty set • Will contain (e, f, ) Max. possible error in f • When data e arrives, • If e exists in D, • Increment f in (e, f, ) • If e does not exist in D, • Add entry (e, 1, bcurrent-1) Remove some entries in D whenever N  0 mod w(i.e., whenever it reaches the bucket boundary)The rule of deletion is: (e, f, ) is deleted if f +  <= bcurrent [Output] Get a list of items wheref + N >= sN

  30. Lossy Counting Algorithm • -deficient synopsis • Lossy Counting computes an -deficient synopsis • Memory Consumption • Lossy Counting occupies at most 1/ log(N)  entries.

  31. Comparison e.g. s = 0.02 = 0.01 = 0.1 N = 1000 Memory = 1243 Memory = 231

  32. Comparison e.g. s = 0.02 = 0.01 = 0.1 N = 1,000,000 Memory = 1243 Memory = 922

  33. Comparison e.g. s = 0.02 = 0.01 = 0.1 N = 1,000,000,000 Memory = 1243 Memory = 1612

  34. Frequent Pattern Mining over Entire Data Streams • Algorithm • Sticky Sampling Algorithm • Lossy Counting Algorithm • Space-Saving Algorithm

  35. s Statistics of items  Output  … Unbounded Data Sticky Sampling Algorithm Support threshold Stored in the memory Sticky Sampling Error parameter Confidence parameter Frequent items Infrequent items

  36. s Statistics of items  Output … Unbounded Data Lossy Counting Algorithm Support threshold Stored in the memory Lossy Counting Error parameter Frequent items Infrequent items

  37. s Statistics of items M Output … Unbounded Data Space-Saving Algorithm Support threshold Stored in the memory Space-Saving Memory parameter Frequent items Infrequent items

  38. Space-Saving • M: the greatest number of possible entries stored in the memory

  39. Space-Saving Frequency of element since this entry was inserted into D element • D: Empty set • Will contain (e, f, ) Max. possible error in f pe = 0 • When data e arrives, • If e exists in D, • Increment f in (e, f, ) • If e does not exist in D, • If the size of D = M • pe mineD {f + } • Remove all entries e where f +   pe • Add entry (e, 1, pe) [Output] Get a list of items where f +  >= sN

  40. Space-Saving • Greatest Error • Let E be the greatest error in any estimated frequency. E  1/M • -deficient synopsis • Space-Saving computes an -deficient synopsis if E  

  41. Comparison e.g. s = 0.02 = 0.01 = 0.1 N = 1,000,000,000 Memory = 1243 Memory = 1612 Memory can be very large (e.g., 4,000,000) Since E <= 1/M  the error is very small

  42. Data Streams • Entire Data Streams • Data Streams with Sliding Window

  43. Data Streams with Sliding Window Frequent pattern/itemset • Association • Clustering • Classification

  44. t1 t2 … Sliding window Sliding Window • Mining Frequent Itemsets in a sliding window • E.g. t1: I1 I2 t2: I1 I3 I4 … • To find frequent itemsets in a sliding window

  45. B3 B4 B1 B2 Sliding window Last 4 batches Sliding Window Storage Storage Storage Storage

  46. B5 Sliding window Last 4 batches Sliding Window B3 B4 B1 B2 Storage Storage Storage Storage Storage Remove the whole batch

More Related