460 likes | 634 Views
FS-Miner : Efficient and Incremental Mining of Frequent Sequence Patterns in Web L ogs. Maged EL-Sayed, Carolina Ruiz, and Elke A. Rundensteiner 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), pp.128-135, 2004 November 12-13, 2004, Washington, DC, USA
E N D
FS-Miner : Efficient and Incremental Mining of Frequent Sequence Patterns in Web Logs Maged EL-Sayed, Carolina Ruiz, and Elke A. Rundensteiner 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), pp.128-135, 2004November 12-13, 2004, Washington, DC, USA Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU 2005/10/11
Outline • Introduction • FS-Tree Construction • Mining the FS-Tree • Maintaining the FS-Tree Incrementally • Mining the FS-Tree Incrementally • Interactive Mining • Experimental Evaluation • Conclusions
Introduction • Path Traversal Pattern • FS, SS • ABC, BCD… • Web Traversal Pattern • IPA, MFTP • ABDCA, CACADB…
Introduction (Cont.) • Consider Backward Traversal • Subsequence • Need Continuous • MSuppRlink System Define • MSuppRseq User Define • MSuppClink • MSuppCseq
FS-Tree Construction SID InSeq SID InSeq Total # of links = 50 1 dgi 10 bdeh MSuppRlink=4% 2 dg 11 cdebfabc MSuppRseq=6% 3 cdehi 12 cdefabc MSuppClink=2 4 cde 13 aic MSuppCseq=3 5 cbcdg 14 die 6 cb 15 igdba 7 abcdgi 16 efa 8 abcd 17 ef 9 bdehi 18 efab System Define: MSuppRlink User Define: MSuppRseq
SID InSeq SID InSeq Link Count Link Count f-a 2 1 dgi 10 bdeh d-g 4 b-f 1 2 dg 11 cdebfabc g-i 2 e-f 1 3 cdehi 12 cdefabc c-d 7 a-i 1 4 cde 13 aic d-e 6 1 5 cbcdg e-h 3 i-c 14 die d-i 1 6 cb 15 igdba h-i 2 i-e 1 7 abcdgi 16 efa c-b 2 i-g 1 8 abcd 17 ef b-c 5 1 a-b 4 g-d 9 bdehi 18 efab d-b 1 b-d 2 b-a 1 e-b 1 FS-Tree Construction (Cont.)
Non-Frequent Links Table(NFLT) Link Count Link Count Header Table (HT) f-a 2 Link Count ListH d-g 4 SID Link Count b-f 1 d-g 4 g-i 2 11 e-b 1 e-f 1 g-i 2 c-d 7 11 b-f 1 a-i 1 c-d 7 d-e 6 12 e-f 1 1 e-h 3 i-c d-e 6 13 a-i 1 d-i 1 e-h 3 h-i 2 13 1 i-c i-e 1 h-i 2 c-b 2 14 d-i 1 i-g 1 c-b 2 b-c 5 14 i-e 1 1 a-b 4 g-d b-c 5 15 i-g 1 d-b 1 a-b 4 b-d 2 15 1 g-d b-a 1 b-d 2 e-b 1 15 d-b 1 f-a 2 15 b-a 1 FS-Tree Construction (Cont.) When FS-Tree Built
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 1 dgi d 1 g 1 i:1
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 2 dg d 2 g:2 1 i:1
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 3 cdehi d c 1 2 d g:2 1 1 e i:1 1 h 1 i:3
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 4 cde d c 2 2 d g:2 2 1 e:4 i:1 1 h 1 i:3
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 5 cbcdg d c 2 1 2 d b g:2 1 2 1 e:4 c i:1 1 1 h d 1 1 i:3 g:5
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 6 cb d c 2 2 2 d b:6 g:2 1 2 1 e:4 c i:1 1 1 h d 1 1 i:3 g:5
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 7 abcdgi d c a 2 2 1 2 d b:6 b g:2 1 1 2 1 e:4 c c i:1 1 1 1 d h d 1 1 1 g i:3 g:5 1 i:7
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 8 abcd d c a 2 2 2 2 d b:6 b g:2 1 2 2 1 e:4 c c i:1 2 1 1 d:8 h d 1 1 1 g i:3 g:5 1 i:7
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 9 bdehi d c a b 2 2 1 2 2 d b:6 b d g:2 1 1 2 2 1 e:4 c c e i:1 2 1 1 1 d:8 h d h 1 1 1 1 g i:9 i:3 g:5 1 i:7
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 10 bdeh d c a b 2 2 2 2 2 d b:6 b d g:2 2 1 2 2 1 e:4 c c e i:1 2 1 1 2 d:8 h d h:10 1 1 1 1 g i:9 i:3 g:5 1 i:7
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 11 cdebfabc d c a b f 3 2 2 2 1 2 d b:6 b d a g:2 2 1 2 1 3 1 e:4 c c e b i:1 2 1 1 2 1 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 12 cdefabc d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 13 aic d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 14 die d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 15 gdba d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7
Mining the FS-Tree Step 1: Extracting Derived Paths Step 2: Constructing Conditional Sequence Base Step 3: Constructing Conditional FS-Tree Step 4: Extracting Frequent Sequences
Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 Mining the FS-Tree (Cont.) Step 1 Root d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7
Root Root Root Root e e e e 1 1 3 3 d d d d 2 1 1 1 c c c b Mining the FS-Tree (Cont.) Step 2 Conditional Sequence base : Root (c-d:1, d-e:1), (b-d:2, d-e:2) c b Step 3 Conditional FS-Tree: 2 4 d d 4 2 e:4 e 2 1 h h:10
Root e 3 d 2 1 c b Mining the FS-Tree (Cont.) Step 4 Depth first traversal Output <deh : 3>
Link Link d-g c-d Derived Paths Derived Paths (c-d:4) (c-b:2, b-c:1, c-d:1) (a-b:2, b-c:2, c-d:2) (d-g:2) (c-b:2, b-c:1, c-d:1, d-g:1) (a-b:2, b-c:2, c-d:2, d-g:1) Conditional Sequence bases Conditional Sequence bases (c-b:1, b-c:1, c-d:1) (a-b:1, b-c:1, c-d:1) (c-b:1, b-c:1) (a-b:2, b-c:2) Conditional FS-Trees Conditional FS-Trees (b-c:3) Frequent Sequences Frequent Sequences <bcd : 3> Mining the FS-Tree (Cont.) • The Answers
Link Link e-h d-e Derived Paths Derived Paths (c-d:4, d-e:4) (b-d:3, d-e:2) (c-d:4, d-e:4, e-h:1) (b-d:3, d-e:2, e-h:2) Conditional Sequence bases Conditional Sequence bases (c-d:1, d-e:1) (b-d:2, d-e:2) (c-d:4) (b-d:2) Conditional FS-Trees Conditional FS-Trees (d-e:3) (c-d:4) Frequent Sequences Frequent Sequences <cde : 4> <deh : 3> Mining the FS-Tree (Cont.) • The Answers
Link b-c Derived Paths (c-b:2, b-c:1) (a-b:2, b-c:2) (f-a:2, a-b:2, b-c:2) a-b Link Derived Paths (a-b:2) (f-a:2, a-b:2) Conditional Sequence bases (c-b:1) (a-b:2) (f-a:2, a-b:2) Conditional Sequence bases (f-a:2) Conditional FS-Trees Conditional FS-Trees Frequent Sequences (a-b:4) Frequent Sequences <abc : 4> Mining the FS-Tree (Cont.) • The Answers
SID InSeq 16 efa 17 ef 18 efab Non-Frequent Links Table(NFLT) SID Link Count 12 e-f 1 Maintaining the FS-Tree Incrementally e-f:3 f-a:2 a-b:1 e-f in NFLT Becomes Frequent, Move to Table HT MSuppClink=2 SID InSeq MSuppCseq=3 12 cdefabc Retrieve the Sequence from Original DB Delete this record from NFLT (Move to HT)
Maintaining the FS-Tree Incrementally(Cont.) SID InSeq Header Table (HT) Root Delete Link Count ListH 12 cdefabc d-g 4 c g-i 2 4 a d b f c-d 7 d 2 2 2 4 d-e 6 2 1 b:6 b d g:2 e:4 e-h 3 a 1 1 2 2 1 h-i 2 f c e 1 c 1 i:1 c-b 2 b 1 h 2 2 1 a b-c 5 h:10 d:8 1 d a-b 5 1 c 1 b 1 1 b-d 2 1 i:3 g i:9 f-a 4 1 g:5 c:12 1 e-f 4 i:7
Maintaining the FS-Tree Incrementally(Cont.) SID InSeq Header Table (HT) Root Link Count ListH 16 efa d-g 4 c g-i 2 4 a e d b f c-d 7 d 2 2 2 1 4 d-e 6 2 1 b:6 b d f g:2 e:4 e-h 3 a 1 1 2 1 2 1 h-i 2 f c e 1 a:16 c 1 i:1 c-b 2 b 1 h 2 2 1 a b-c 5 h:10 d:8 1 d a-b 5 1 c 1 b 1 1 b-d 2 1 i:3 g i:9 f-a 4 1 g:5 c:12 1 e-f 4 i:7
Maintaining the FS-Tree Incrementally(Cont.) SID InSeq Header Table (HT) Root Link Count ListH 17 ef d-g 4 c g-i 2 4 a e d b f c-d 7 d 2 2 2 2 4 d-e 6 2 1 b:6 b d f:17 g:2 e:4 e-h 3 a 1 1 2 1 2 1 h-i 2 f c e 1 a:16 c 1 i:1 c-b 2 b 1 h 2 2 1 a b-c 5 h:10 d:8 1 d a-b 5 1 c 1 b 1 1 b-d 2 1 i:3 g i:9 f-a 4 1 g:5 c:12 1 e-f 4 i:7
Maintaining the FS-Tree Incrementally(Cont.) SID InSeq Header Table (HT) Root Link Count ListH 18 efab d-g 4 c g-i 2 4 a e d b f c-d 7 d 2 2 2 3 4 d-e 6 2 1 b:6 b d f:17 g:2 e:4 e-h 3 a 1 1 2 2 2 1 h-i 2 f c e 1 a:16 c 1 i:1 c-b 2 b 1 h 2 2 1 a 1 b-c 5 h:10 d:8 1 d b:18 a-b 5 1 c 1 b 1 1 b-d 2 1 i:3 g i:9 f-a 4 1 g:5 c:12 1 e-f 4 i:7
Non-Frequent Links Table (NFLT) Header Table (HT) 1 Frequent Links 5 Non-Frequent Links 9 4 3 2 6 Potentially Frequent Links 7 8 Mining the FS-Tree Incrementally • Type 1: • Mine for those Links if they are Affected • Type 2 and 4: • Mine for these Links • Type 3 and 5: • Delete Previously Discovered Patterns that Include these Links • Type 6, 7, 8, and 9: • Do Nothing
Link a-b Header Table (HT) Derived Paths Link Count ListH (c-d:4, d-e:4, e-f:1, f-a:1, a-b:1) (a-b:2) (f-a:1, a-b:1) (e-f:3, f-a:2, a-b:1) d-g 4 g-i 2 c-d 7 d-e 6 Conditional Sequence bases e-h 3 (c-d:1, d-e:1, e-f:1, f-a:1) (f-a:1) (e-f:1, f-a:1) h-i 2 c-b 2 b-c 5 Conditional FS-Trees a-b 5 (f-a:3) b-d 2 Frequent Sequences f-a 4 <fab : 3> e-f 4 Mining the FS-Tree Incrementally(Cont.) • The Answers
Link f-a Derived Paths Link e-f Derived Paths (c-d:4, d-e:4, e-f:1, f-a:1) (f-a:1) (e-f:3, f-a:2) (c-d:4, d-e:4, e-f:1) (e-f:3) Conditional Sequence bases Conditional Sequence bases (c-d:1, d-e:1) (c-d:1, d-e:1, e-f:1) (e-f:2) Conditional FS-Trees Conditional FS-Trees (e-f:3) Frequent Sequences Frequent Sequences <efa : 3> Mining the FS-Tree Incrementally(Cont.) • The Answers
Interactive Mining • Setting the MSuppClink to a Small Enough Value • Enough Information in the FS-Tree • Without to Reference the Original Database
Experimental Evaluation • MS Data Set • Microsoft Anonymous Web Data Set • 32,711 Sessions • 1 up to 35 page references • 294 distinct pages • MSNBC Data Set • MSNBCAnonymous Web Data Set • 989,818 Sections • 1 up to several thousands of page reference • 17 distinct pages • http://kdd.ics.uci.edu
Experimental Evaluation (Cont.) • Scalability with the Number of Input Sessions • MS Data Set No MSuppRseq??
Experimental Evaluation (Cont.) • Scalability with the Number of Input Sessions • MSNBC Data Set No MSuppRseq??
Experimental Evaluation (Cont.) • Scalability with Support Threshold • MS Data Set
Experimental Evaluation (Cont.) • Scalability with Support Threshold • MSNBC Data Set
Experimental Evaluation (Cont.) • Incremental Mining • MS Data Set
Experimental Evaluation (Cont.) • Incremental Mining • MSNBC Data Set
Conclusions • Two Scans for the Input Database • Allows for Incremental Discovery of Frequent Sequences when the Input Database is Updated • Allows Interactive Response to Changes to the Minimun Support