COMP5331

COMP5331 Data Stream Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

Data Mining over Static Data • Association • Clustering • Classification Output (Data Mining Results) Static Data

Data Mining over Data Streams • Association • Clustering • Classification Output (Data Mining Results) … Unbounded Data Real-time Processing

1 2 … Less recent More recent Data Streams Each point: a transaction

Data Streams

1 2 … Less recent More recent Entire Data Streams Each point: a transaction Obtain the data mining results from all data points read so far

1 2 … Less recent More recent Entire Data Streams Each point: a transaction Obtain the data mining results over a sliding window

Data Streams • Entire Data Streams • Data Streams with Sliding Window

Entire Data Streams Frequent pattern/item • Association • Clustering • Classification

1 2 … Less recent More recent Frequent Item over Data Streams • Let N be the length of the data streams • Let s be the support threshold (in fraction) (e.g., 20%) • Problem: We want to find all items with frequency >= sN Each point: a transaction

Data Streams

Data Streams • Frequent item • I1 • Infrequent item • I2 • I3 Output (Data Mining Results) Static Data Output (Data Mining Results) … Unbounded Data • Frequent item • I1 • I3 • Infrequent item • I2

False Positive/Negative • False Positive • The item is classified as frequent item • In fact, the item is infrequent • E.g. • Expected Output • Frequent item • I1 • Infrequent item • I2 • I3 • Algorithm Output • Frequent item • I1 • I3 • Infrequent item • I2 Which item is one of the false positives? I3 More? No. No. of false positives = 1 If we say:The algorithm has no false positives. All true infrequent items are classified as infrequent items in the algorithm output.

False Positive/Negative • False Negative • The item is classified as infrequent item • In fact, the item is frequent • E.g. • Expected Output • Frequent item • I1 • I3 • Infrequent item • I2 • Algorithm Output • Frequent item • I1 • Infrequent item • I2 • I3 Which item is one of the false negatives? I3 More? No. No. of false negatives = 1 No. of false positives = 0 If we say:The algorithm has no false negatives. All true frequent items are classified as frequent items in the algorithm output.

Data Streams We need to introduce an input error parameter 

Data Streams • Frequent item • I1 • Infrequent item • I2 • I3 Output (Data Mining Results) Static Data Output (Data Mining Results) … Unbounded Data • Frequent item • I1 • I3 • Infrequent item • I2

N: total no. of occurrences of items Data Streams • Store the statistics of all items • I1: 10 • I2: 8 • I3: 12 N = 20 = 0.2 N = 4 Output (Data Mining Results) Static Data D <= N ? Diff. D 0 Yes 4 Yes 2 Yes Output (Data Mining Results) … Unbounded Data • Estimate the statistics of all items • I1: 10 • I2: 4 • I3: 10

-deficient synopsis • Let N be the current length of the stream(or total no. of occurrences of items) • Let  be an input parameter (a real number from 0 to 1) • An algorithm maintains an -deficient synopsis if its output satisfies the following properties • Condition 1: There is no false negative. All true frequent items are classified as frequent items in the algorithm output. • Condition 2: The difference between the estimated frequency and the true frequency is at most N. • Condition 3: All items whose true frequencies less than (s-)N are classified as infrequent items in the algorithm output

Frequent Pattern Mining over Entire Data Streams • Algorithm • Sticky Sampling Algorithm • Lossy Counting Algorithm • Space-Saving Algorithm

s Statistics of items  Output  … Unbounded Data Sticky Sampling Algorithm Support threshold Stored in the memory Sticky Sampling Error parameter Confidence parameter Frequent items Infrequent items

Sticky Sampling Algorithm • The sampling rate r varies over the lifetime of a stream • Confidence parameter  (a small real number) • Let t = 1/ ln(s-1-1)

Sticky Sampling Algorithm • e.g. s = 0.02 = 0.01 • = 0.1 t = 622 • The sampling rate r varies over the lifetime of a stream • Confidence parameter  (a small real number) • Let t = 1/ ln(s-1-1) 1~1244 1 ~ 2*622 1245~2488 2*622+1 ~ 4*622 4*622+1 ~ 8*622 2489~4976

Sticky Sampling Algorithm • e.g. s = 0.5 = 0.35 • = 0.5 t = 4 • The sampling rate r varies over the lifetime of a stream • Confidence parameter  (a small real number) • Let t = 1/ ln(s-1-1) 1~8 1 ~ 2*4 9~16 2*4+1 ~ 4*4 4*4+1 ~ 8*4 17~32

Sticky Sampling Algorithm element Estimated frequency • S: empty list will contain (e, f) • When data e arrives, • if e exists in S, increment f in (e, f) • if e does not exist in S, add entry (e, 1) with prob. 1/r (where r: sampling rate) • Just after r changes, • For each entry (e, f), • Repeatedly toss a coin with P(head) = 1/r until the outcome of the coin toss is head • If the outcome of the toss is tail, • Decrement f in (e, f) • If f = 0, delete the entry (e, f) • [Output] Get a list of items where f + N >= sN

Analysis • -deficient synopsis • Sticky Sampling computes an -deficient synopsis with probability at least 1- • Memory Consumption • Sticky Sampling occupies at most 2/ ln(s-1-1)  entries on average

s Statistics of items  Output … Unbounded Data Lossy Counting Algorithm Support threshold Stored in the memory Lossy Counting Error parameter Frequent items Infrequent items

1 2 … Bucket bcurrent Bucket 3 Bucket 2 Bucket 1 Width w = bcurrent = Less recent More recent Lossy Counting Algorithm N: current length of stream Each point: a transaction …

Lossy Counting Algorithm element Frequency of element since this entry was inserted into D • D: Empty set • Will contain (e, f, ) Max. possible error in f • When data e arrives, • If e exists in D, • Increment f in (e, f, ) • If e does not exist in D, • Add entry (e, 1, bcurrent-1) Remove some entries in D whenever N  0 mod w(i.e., whenever it reaches the bucket boundary)The rule of deletion is: (e, f, ) is deleted if f +  <= bcurrent [Output] Get a list of items wheref + N >= sN

Lossy Counting Algorithm • -deficient synopsis • Lossy Counting computes an -deficient synopsis • Memory Consumption • Lossy Counting occupies at most 1/ log(N)  entries.

Comparison e.g. s = 0.02 = 0.01 = 0.1 N = 1000 Memory = 1243 Memory = 231

Comparison e.g. s = 0.02 = 0.01 = 0.1 N = 1,000,000 Memory = 1243 Memory = 922

Comparison e.g. s = 0.02 = 0.01 = 0.1 N = 1,000,000,000 Memory = 1243 Memory = 1612

s Statistics of items  Output  … Unbounded Data Sticky Sampling Algorithm Support threshold Stored in the memory Sticky Sampling Error parameter Confidence parameter Frequent items Infrequent items

s Statistics of items  Output … Unbounded Data Lossy Counting Algorithm Support threshold Stored in the memory Lossy Counting Error parameter Frequent items Infrequent items

s Statistics of items M Output … Unbounded Data Space-Saving Algorithm Support threshold Stored in the memory Space-Saving Memory parameter Frequent items Infrequent items

Space-Saving • M: the greatest number of possible entries stored in the memory

Space-Saving Frequency of element since this entry was inserted into D element • D: Empty set • Will contain (e, f, ) Max. possible error in f pe = 0 • When data e arrives, • If e exists in D, • Increment f in (e, f, ) • If e does not exist in D, • If the size of D = M • pe mineD {f + } • Remove all entries e where f +   pe • Add entry (e, 1, pe) [Output] Get a list of items where f +  >= sN

Space-Saving • Greatest Error • Let E be the greatest error in any estimated frequency. E  1/M • -deficient synopsis • Space-Saving computes an -deficient synopsis if E  

Comparison e.g. s = 0.02 = 0.01 = 0.1 N = 1,000,000,000 Memory = 1243 Memory = 1612 Memory can be very large (e.g., 4,000,000) Since E <= 1/M  the error is very small

Data Streams • Entire Data Streams • Data Streams with Sliding Window

Data Streams with Sliding Window Frequent pattern/itemset • Association • Clustering • Classification

t1 t2 … Sliding window Sliding Window • Mining Frequent Itemsets in a sliding window • E.g. t1: I1 I2 t2: I1 I3 I4 … • To find frequent itemsets in a sliding window

B3 B4 B1 B2 Sliding window Last 4 batches Sliding Window Storage Storage Storage Storage

B5 Sliding window Last 4 batches Sliding Window B3 B4 B1 B2 Storage Storage Storage Storage Storage Remove the whole batch

COMP5331

COMP5331

Presentation Transcript

COMP5331

COMP5331

COMP5331

COMP5331

COMP5331: Knowledge Discovery and Data Mining

COMP5331

COMP5331