220 likes | 465 Views
Online Mining (Recently) Maximal Frequent Itemsets over Data Streams. Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker :董原賓 Advisor : 柯佳伶. Introduction. Difficulties of Data Stream Mining Huge High speed Continuous
E N D
Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA’05 speaker:董原賓 Advisor:柯佳伶
Introduction • Difficulties of Data Stream Mining • Huge • High speed • Continuous • Solution:one-pass algorithm • Summary data structure • Mines the maximal frequent itemsets
W1 abc bcd acd W2 cd abd bc WN a b cd ··· time Definition • Ψ= {i1, i2, …, in} :a set of items • Wi:basic window i • Data stream= [W1, W2, …, WN):an infinite sequence of basic windows • N:the window identifier of the latest basic window • Current length of data stream (CL) = |W1| + |W2| +…+ |WN| CL = 3xN
Definition • X.tsup:true supportof itemset X • X.esup:estimated supportof itemset X, 1 ≤ X.esup ≤ X.tsup • X.CL = |Wj|+|Wj+1|+…+|WN| • Wj:the first window containing X in the summary data structure • S:minimum support • ε:maximum support error threshold
Data Stream Mining for maximal Frequent Itemsets (DSM-MFI) • Step1, reads a window of transactions • Step2, constructs and maintains the summary data structure • Step3, prunes the infrequent information • Step4, searches the maximal frequent itemsets
Summary Frequent Itemsets forest (SFI-forest) • Composed of a FI-list and a set of SFI-trees • SFI-trees • item-id, the item identifier • esup, the number of transactions reaching the node with the item-id • window-id, assigned to a new node of the current basic window identifier • node-link, links to the next node with the same item-idin the same SFI-tree
Summary Frequent Itemsets forest (SFI-forest) • FI-list • item-id, the item identifier • esup, the number of transactions containing the item • window-id, assigned to a new entry of the current basic window identifier • head link, links to the root node of the item-id.SFI-tree
Summary Frequent Itemsets forest (SFI-forest) • Each SFI-tree has a specific opposite frequent item list (OFI-list) • OFI-list • (item-id, esup, window-id, head link) • head link links to the first node carrying the item-id in the SFI-tree
Example Transaction Projection (T) abc bc c W1 abc bcd acd T = abc FI-list X = a X = b X = c (item-id, esup, window-id, node link) (1,1,1) a.SFI-tree a.OFI-list X = b X = c (2,1,1) (2,1,1) 1:1:1 2:1:1 3:1:1 (3,1,1) (3,1,1) b.SFI-tree 2:1:1 3:1:1 b.OFI-list c.SFI-tree (3,1,1) 3:1:1 c.OFI-list SFI-tree-maintenance (abc) SFI-tree-maintenance (bc) SFI-tree-maintenance (c)
Example Transaction Projection (T) bcd cd d W1 abc bcd acd T = bcd FI-list X = b X = d X = c (item-id, esup, window-id, node link) a.SFI-tree (1,1,1) b.OFI-list X = d X = c b.SFI-tree (2,1,1) (2,1,2) (3,1,2) (3,1,1) 2:1:2 2:1:1 3:1:2 3:1:1 4:1:1 (3,1,2) (3,1,1) (4,1,1) (4,1,1) c.SFI-tree c.OFI-list 3:1:2 3:1:1 4:1:1 (4,1,1) d.SFI-tree d.OFI-list 4:1:1 SFI-tree-maintenance (cd) SFI-tree-maintenance (d) SFI-tree-maintenance (bcd)
Example Transaction Projection (T) acd cd d W1 abc bcd acd T = acd FI-list X = d X = a X = c (item-id, esup, window-id, node link) (1,1,2) (1,1,1) a.SFI-tree a.OFI-list X = c X = d (2,1,2) 1:1:2 1:1:1 2:1:1 3:1:1 (2,1,1) (3,1,2) (3,1,3) (3,1,2) (3,1,1) 3:1:1 (4,1,2) (4,1,1) (4,1,1) 4:1:1 d.SFI-tree c.SFI-tree b.SFI-tree SFI-tree-maintenance (acd)
Pruning infrequent items from SFI-forest • X:1-itemset in the FI-list • if X.esup < X.CL*ε then X and its supersets are deleted from SFI-forest • Step • 1 deletes • item-id.OFI-list • item-id.SFI-tree • the entry with item-id from the FI-list • 2 removes the infrequent item from other OFI-lists by traversing the FI-list
Pruning infrequent items from SFI-forest • 3 deletes the infrequent item from other SFI-trees • 4 reconstructs SFI-trees by reinserting these modified item-suffix transactions or join the remainder subtrees into SFI-tree
Example a.CL = b.CL = c.CL = d.CL = 12 s= 0.3, ε= 0.2 FI-list (1,1,3) (2,1,2) (3,1,3) (4,1,3) b.SFI-tree c.SFI-tree d.SFI-tree 12 x 0.2 = 2.4 a.SFI-tree 1:1:3 2:1:2 3:1:3 4:1:3 2:1:1 3:1:1 3:1:2 3:1:1 3:1:2 4:1:2 d.OFI-list 3:1:1 4:1:1 4:1:1 (4,1,2) c.OFI-list a.OFI-list (2,1,1) (3,1,2) (3,1,2) (4,1,1) b.OFI-list (4,1,1)
Determining maximal frequent itemsets • There are k frequent 1-itemsets, e1, e2, …, ek, in the FI-list • o1, o2, …, oj, the items in the ei.OFI-list • Generates a candidate maximal frequent (j+1)-itemset, E = (ei, o1, o2, …, oj) • starts from a frequent item with the smallest estimated support • traverses the path via node link to count E’s estimated support
Determining maximal frequent itemsets • if E.esup≥ s.ei.CL then E is MFI • else enumerate E into itemsets with size |E|−1 • until finds the set of all maximal frequent itemsets with respect to entry e
Example a.CL = b.CL = c.CL = d.CL = 5 s= 0.3, ε= 0.2 FI-list (1,1,3) (2,1,2) (3,1,3) (4,1,3) b.SFI-tree c.SFI-tree d.SFI-tree 5 x 0.3 = 1.5 a.SFI-tree 1:1:3 2:1:2 3:1:3 4:1:3 2:1:1 3:1:1 3:1:2 4:1:2 d.OFI-list 3:1:1 4:1:1 4:1:1 Caculate support (bc) Caculate support (bcd) = 1 (4,1,2) c.OFI-list a.OFI-list (2,1,1) (3,1,2) (3,1,2) (4,1,1) b.OFI-list (4,1,1)
Sliding Window Mining over Data Streams • Modifications: • uses DSM-MFI algorithm to construct a SFI-forest i for each basic window Wi • find local maximal frequent itemsets (local MFIi), all local MFI are stored in a queue • global MFI-list store all local MFI from W1 to WN
Sliding Window Mining over Data Streams • When basic window N+1 arrives • removes the local MFI 1 from the queue • subtracts the support of the local MFI 1 from the global MFI • uses DSMMFI algorithm to mine all local maximal frequent itemsets of WN+1 • Increases the support of global MFI or insert local MFIN+1 into it
Experiment • 1GHz IBMx24, 384MB, Visual C++ 6.0 • s= 0.1%, ε= 0.01%. • IBM synthetic datasets • T10.I5.D1000K • T30.I20.D1000K • the data is broken into 20 basic windows for simulating the streaming data