230 likes | 380 Views
CanTree: a tree structure for efficient incremental mining of frequent patterns. Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 å ±å‘Šè€…ï¼šæž—éœæ€¡ 2006/11/15. Introduction. Many existing incremental mining algorithms are Apriori-based
E N D
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM’05 報告者:林靜怡 2006/11/15
Introduction • Many existing incremental mining algorithms are Apriori-based • not easily adoptable to FP-tree based frequent-pattern mining
Related Work • The FELINE Algorithm with the CATS Tree • The AFPIM Algorithm
The FELINE Algorithm with the CATS Tree • CATS tree (Compressed and Arranged Transaction Sequences tree) • Allows frequent-pattern mining without the generation of candidate itemsets • requires one database scan to build the tree
CATS Tree • New transactions are added at the root level • At each level, items of the new transaction are compared with children (or descendant) nodes. • If the same items exist in both 1.the transaction is merged with the node at the highest frequency level 2.The remainder of the transaction is then added to the merged nodes • repeated recursively until all common items are found.
CATS Tree • Any remaining items of the transaction are added as a new branch to the last merged node. • The frequency of a node is lower than or equal to the frequencies of its ancestors • If the frequency of a node becomes higher than its ancestors, then it has to swap with the ancestors
Weaknesses • tree construction could be computationally expensive • checks existing tree paths one-by-one until a mergeable one is found • extra cost is required for the swapping or merging of nodes.
The AFPIM Algorithm • Adjusting FP-tree for Incremental Mining • all the “frequent” items are arranged in descending order of their global frequency • when the ordering is changed, items in the tree need to be adjusted • When previously infrequent item becomes “frequent” in the updated database, it needs to rescan and build a new FP-tree.
preMinsup:35% minsup:55% 4 x 0.35 = 1.4
Weaknesses • the amount of computation spent on swapping, merging, and splitting tree nodes • requirement for an additional mining parameter preMinsup • finding an appropriate value for this parameter is not easy
Weaknesses • when the database is updated, item frequencies may have changed. This results in changes in the ordering. • Both FELINE and AFPIM algorithms need lots of swapping, merging, and splitting of tree nodes
Canonical-Order Tree (CanTree) • requires one database scan • items are arranged according to some canonical order • in lexicographic order or alphabeticalorder • some specific order depending on the item properties
Property • Property 1 The ordering of items is unaffected by the changes in frequency caused by incremental updates. • Property 2 The frequency of a node in the CanTree is at least as high as the sum of frequencies of its children.
CanTree • Transactions can be easily added to the CanTree without any extensive searches for mergeable paths • mine frequent patterns from the tree in a fashion similar to FP-growth(a divide-and-conquer approach).
g: eg,deg,cdeg,bcdeg,abcdeg e: de,cde,bcde,abcde,ce, bce,abce,de,bde,abde f: ef,def,bdef,abdef d: cd,bcd,abcd,bd,abd c: bc,abc b: ab
Discussion • CanTrees can be used for incremental constrained mining • Efficiency and Memory Issues • On the surface, it appears CanTree may take a large amount of memory. • CanTree may not be as compact as the CATS tree,but it significantly reduce computation and time • assume we have enough main memory space
Experiment • Database:generated by the program developed at IBM Almaden Research Center • consists of 1M records with an average transaction length of 10 items and a domain of 1000 items • time-sharing environment in a 1 GHz machine
Conclusion • provide the user with a simple, but powerful, tree structure for efficient FP-tree based incremental mining • CanTree can be easily maintained • Can used for efficient incremental constrained mining