Mining Frequent Patterns in Data Streams at Multiple Time Granularities

CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, PengfeiGeng and Salah Ahmed Mining Frequent Patterns in Data Streams at Multiple Time Granularities Authors: Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu

Part 1 • Introduction • Problem definition and analysis • FP-Stream

Introduction • Frequent pattern mining has been widely studied and used on static transaction data set, but it is challenging to extend it to data streams. • Why it is difficult to mine frequent patterns in data streams? — Mining frequent itemsets is a set of join operations.

Problem definition and analysis • Our task is to find the complete set of grequent patterns in a data stream. • Apriori algorithm: count only those itemsets whose every proper subset is frequent. • Problems to use Apriori-like algorithm — Join is a blocking operator — Infrequent items can become frequent later on and hence cannot be ignored.

Definition • The frequency of an itemset I over a time period T is the number of transactions in T in which I occurs. The support of I is the frequency divide by the total number of transactions observed in I. • I is frequent if its support is no less than min_support σ. • I is sub frequent if its support is less than σ but no less than the maximun support error ε. • Otherwise, I is infrequent.

FP-Stream • This paper propose a time sensitive streaming model: FP-Stream, which includes two major components: • A global frequent pattern tree held in main memory. • Tilted time windows embedded in this pattern tree.

Part 2 • Mining Time-Sensitive Frequent Patterns in Data Streams • Maintaining Tilted-Time Windows

Natural tilted-time window • People are often interested in recent changes. • Recent changes are depicted at a fine granularity, but long term changes at a Coarse granularity.

Frequent patterns for tilted-time windows • To mine a variety of frequent patterns associated with time more flexibly, a frequent pattern set can be maintained.

Pattern tree • For each tilted-time window, one can register window-based count for each frequent pattern. • Each node represents a pattern and its frequency is recorded in the node

FP-Stream • Usually frequent patterns do not change dramatically over time. • Overlap may occur • To save space, embed the tilted-time window structure into each node

Maintaining Tilted-Time Windows • With the arrival of new data • In order to make the table compact • Tilted-time window maintenance mechanism is needed

Logarithmic Tilted-time Window • In the natural tilted-time window, at most 59 (4+24+31) tilted windows need to be maintained for a period of one month. • We can reduce the number of tilted-time windows using logarithmic tilted-time windows schema • According to logarithmic tilted-time window model, with one year of data and the finest precision at quarter, it needs units of time instead of units.

Logarithmic Tilted-time Window • Break the stream of transactions into fixed sized batches B1, B2, B3, …, Bn… • Bn is most current batch, B1 is the oldest • For i ≥ j, let B(i, j) denotes Uik=jBk • fI(i, j) denote the frequency of I in B(i, j) • Frequencies for itemset I with ratio 2 (the growth rate of window size): • Maintain intermediate buffer windows

Logarithmic Tilted-time Window Updating • Given a new batch of transactions B • Replace level 0: f(n, n) with f(B) • Shift f(n, n) back to the next finest level of time (level 1) • Check status of intermediate window for level 1: • Not full. Place f(n-1, n-1) in the intermediate window, stop the algorithm • Full. f(n-1, n-1) + f(intermediate window) is shifted back to level 2 • Continue this process until shifting stops

Logarithmic Tilted-time Window Updating…Example

Part 3 • Tail Pruning • Type I Pruning • Type II Pruning • Algorithm

Tail Pruning • Let be the tilted-time windows where is the oldest. • is the window size of . • Drop tail sequences when the following condition holds,

Type I and Type II Pruning • Type I Pruning: • If I is found in B but is not in the FP-stream structure, no superset is in the structure. • Hence, if , then none of the supersets need be examined. • Type II Pruning: • If all of I’s tilted-time window table entries are pruned (and I is dropped), then any superset will also be dropped.

An Algorithm • FP-streaming: Incremental update of the FP-stream structure with incoming stream data • 1. Initialize the FP-tree to empty . • 2. Sort each incoming transaction t, according to f list, and then insert it into the FP-tree without pruning any items. • 3. When all the transactions in Bi are accumulated, update the FP-stream as follows. • Mine itemsets out of the FP-tree using FP-growth algorithm • Scan the FP-stream structure

Part 4 • Experimental Set-Up • Experimental Results • Discussion

Experiments Set-Ups • Experiments are performed using • Sun UltraSPARC-Iii Processors, 512 MB RAM • Dataset Generation • 3 Million Transactions • 1k Distinct Items • Streams are broken into batches of size 50k transactions • For every 5 batches 200 random permutations are applied

FP-stream time requirements • Item permutations causes the behavior to jump at every 5 batches • Stability is regained quickly. • Required time increases as the average itemset length increases.

FP-stream space requirements • The overall space requirements are very attracting in call cases. It was less than 3MB.

FP-stream average itemset length • The average itemset length does not increase with the increase of average transaction length • This result was also verified by Apriori running on 50k transactions.

FP-stream total number of itemsets • The total number of itemsets increase with the increase of average transaction length. • This result was also verified by Apriori running on 50k transactions.

Discussion • Further compression is possible. • If the support is stable for lots of entries, the table can be compressed. • If the tilted time windows of parent node and child node are the same, only one tilted time window can be maintained. • It is a very nice idea to mine time sensitive frequent patterns. • Mining and maintaining frequent patterns become realistic even with limited main memory.

Feedback Comments and Questions

Thank You

Mining Frequent Patterns in Data Streams at Multiple Time Granularities