160 likes | 296 Views
COBRA: Closed Sequential Pattern Mining Using Bi-phase Reduction Approach. Kuo-Yu Hung, Chia-Hui Chang, Jiun-Hung Tung, Cheng-Tao Ho DaWak 2006. Outline. Introduction Problem Definition COBRA algorithm pruning strategies design and implementation Experimental result Conclusion.
E N D
COBRA: Closed Sequential Pattern Mining Using Bi-phase Reduction Approach Kuo-Yu Hung, Chia-Hui Chang, Jiun-Hung Tung, Cheng-Tao Ho DaWak 2006
Outline • Introduction • Problem Definition • COBRA algorithm • pruning strategies • design and implementation • Experimental result • Conclusion
Introduction • CloSpan ,BIDE: adopt the framework of PrefixSpan by itemset extension and sequence extension the last transaction of the current sequence is extended with a frequent item in the same transaction or different transaction • Drawback :duplicate item extensions expensive matching cost
Problem Definition • Absorb: α is a super-sequence of β and their supports are the same-> α absorbs β • Closed sequential pattern: a sequential pattern β if there exists no proper sequence α that absorb β
Problem Definition(cont.) • Sequence support: All subsets of {A,B,C} has sequence support 4 • Transaction support: Itemset {B}=8=itemset {B,C} Itemset {A,B} {A,C}=5=itemset {A,B,C} {A},{C},{B,C},{A,B,C} are frequent closed itemset A closed sequential pattern is composed of only closed itemsets minsup=3
3 major phase of COBRA algorithm • 1-phase :mining closed frequent itemset use CHARM • 2-phase :Database Encoding Vertical-base and Horizontal-base • 3-phase :Mining Closed sequential pattern
The COBRA algorithm C.F.I: Closed Frequent Itemset follows the idea of PrefixSpan, the locally frequent (extendable) codes in the projected database of a prefix sequence are the frequent C.F.I (closed frequent code) FML: First Matched Transaction list 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Vertical-based LocationList and FML
Pruning strategies • Layer pruning :prune non-closed sequences during sequence extension step of a prefix sequence #1.FML:{2,6,10,14} #2.FML:{2,5,8,14} #3.FML:{2,6,9,13} #4.FML:{1,5,8,12} #1.FML> L#4.FML and #3.FML>L#4.FML Skip prefix #1 and #3
2 • Reduce the cost of comparing any two FMLs (a total of O(|C.F.I|)) only C.F.Is that are hashed to the same bucket are compared to each other
EL: Extended list 1 2 3 4 5 6 7 8 9 10 11 #2.EL={3,6,9} #4.EL={2,6,9,13} The number of transactions in the EL represents the largest support an extended sequence of α can have 12 13 14
PDB (Projected database) 1 2 3 4 5 6 7 8 9 10 11 #2.PDB={3,4,6,7,9,10,11} #4.PDB={2,3,4,6,7,9,10,11,13,14} 12 13 14
No super-sequence of α can be generated as frequent patterns The supports of all super-sequence of α are less than α 3-phase: ext-pruning No extendable codes with the same support as α
ExtPruning: for two sequential patterns α and β, the rule of ExtPruning state that 1.if α.FML= L β.FML and α is a super sequence of β, then remove β and vice versa 2.if Sup(α)=Sup(β) and α is a super sequence of β, then β is not a closed pattern, vice versa
Conclusion • COBRA cost more memory but less time