1 / 46

FS-Miner : Efficient and Incremental Mining of Frequent Sequence Patterns in Web L ogs

FS-Miner : Efficient and Incremental Mining of Frequent Sequence Patterns in Web L ogs. Maged EL-Sayed, Carolina Ruiz, and Elke A. Rundensteiner 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), pp.128-135, 2004 November 12-13, 2004, Washington, DC, USA

jariah
Download Presentation

FS-Miner : Efficient and Incremental Mining of Frequent Sequence Patterns in Web L ogs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FS-Miner : Efficient and Incremental Mining of Frequent Sequence Patterns in Web Logs Maged EL-Sayed, Carolina Ruiz, and Elke A. Rundensteiner 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), pp.128-135, 2004November 12-13, 2004, Washington, DC, USA Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU 2005/10/11

  2. Outline • Introduction • FS-Tree Construction • Mining the FS-Tree • Maintaining the FS-Tree Incrementally • Mining the FS-Tree Incrementally • Interactive Mining • Experimental Evaluation • Conclusions

  3. Introduction • Path Traversal Pattern • FS, SS • ABC, BCD… • Web Traversal Pattern • IPA, MFTP • ABDCA, CACADB…

  4. Introduction (Cont.) • Consider Backward Traversal • Subsequence • Need Continuous • MSuppRlink System Define • MSuppRseq User Define • MSuppClink • MSuppCseq

  5. FS-Tree Construction SID InSeq SID InSeq Total # of links = 50 1 dgi 10 bdeh MSuppRlink=4% 2 dg 11 cdebfabc MSuppRseq=6% 3 cdehi 12 cdefabc MSuppClink=2 4 cde 13 aic MSuppCseq=3 5 cbcdg 14 die 6 cb 15 igdba 7 abcdgi 16 efa 8 abcd 17 ef 9 bdehi 18 efab System Define: MSuppRlink User Define: MSuppRseq

  6. SID InSeq SID InSeq Link Count Link Count f-a 2 1 dgi 10 bdeh d-g 4 b-f 1 2 dg 11 cdebfabc g-i 2 e-f 1 3 cdehi 12 cdefabc c-d 7 a-i 1 4 cde 13 aic d-e 6 1 5 cbcdg e-h 3 i-c 14 die d-i 1 6 cb 15 igdba h-i 2 i-e 1 7 abcdgi 16 efa c-b 2 i-g 1 8 abcd 17 ef b-c 5 1 a-b 4 g-d 9 bdehi 18 efab d-b 1 b-d 2 b-a 1 e-b 1 FS-Tree Construction (Cont.)

  7. Non-Frequent Links Table(NFLT) Link Count Link Count Header Table (HT) f-a 2 Link Count ListH d-g 4 SID Link Count b-f 1 d-g 4 g-i 2 11 e-b 1 e-f 1 g-i 2 c-d 7 11 b-f 1 a-i 1 c-d 7 d-e 6 12 e-f 1 1 e-h 3 i-c d-e 6 13 a-i 1 d-i 1 e-h 3 h-i 2 13 1 i-c i-e 1 h-i 2 c-b 2 14 d-i 1 i-g 1 c-b 2 b-c 5 14 i-e 1 1 a-b 4 g-d b-c 5 15 i-g 1 d-b 1 a-b 4 b-d 2 15 1 g-d b-a 1 b-d 2 e-b 1 15 d-b 1 f-a 2 15 b-a 1 FS-Tree Construction (Cont.) When FS-Tree Built

  8. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 1 dgi d 1 g 1 i:1

  9. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 2 dg d 2 g:2 1 i:1

  10. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 3 cdehi d c 1 2 d g:2 1 1 e i:1 1 h 1 i:3

  11. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 4 cde d c 2 2 d g:2 2 1 e:4 i:1 1 h 1 i:3

  12. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 5 cbcdg d c 2 1 2 d b g:2 1 2 1 e:4 c i:1 1 1 h d 1 1 i:3 g:5

  13. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 6 cb d c 2 2 2 d b:6 g:2 1 2 1 e:4 c i:1 1 1 h d 1 1 i:3 g:5

  14. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 7 abcdgi d c a 2 2 1 2 d b:6 b g:2 1 1 2 1 e:4 c c i:1 1 1 1 d h d 1 1 1 g i:3 g:5 1 i:7

  15. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 8 abcd d c a 2 2 2 2 d b:6 b g:2 1 2 2 1 e:4 c c i:1 2 1 1 d:8 h d 1 1 1 g i:3 g:5 1 i:7

  16. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 9 bdehi d c a b 2 2 1 2 2 d b:6 b d g:2 1 1 2 2 1 e:4 c c e i:1 2 1 1 1 d:8 h d h 1 1 1 1 g i:9 i:3 g:5 1 i:7

  17. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 10 bdeh d c a b 2 2 2 2 2 d b:6 b d g:2 2 1 2 2 1 e:4 c c e i:1 2 1 1 2 d:8 h d h:10 1 1 1 1 g i:9 i:3 g:5 1 i:7

  18. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 11 cdebfabc d c a b f 3 2 2 2 1 2 d b:6 b d a g:2 2 1 2 1 3 1 e:4 c c e b i:1 2 1 1 2 1 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7

  19. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 12 cdefabc d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7

  20. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 13 aic d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7

  21. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 14 die d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7

  22. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 FS-Tree Construction (Cont.) SID InSeq Root 15 gdba d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7

  23. Mining the FS-Tree Step 1: Extracting Derived Paths Step 2: Constructing Conditional Sequence Base Step 3: Constructing Conditional FS-Tree Step 4: Extracting Frequent Sequences

  24. Header Table (HT) Link Count ListH d-g 4 g-i 2 c-d 7 d-e 6 e-h 3 h-i 2 c-b 2 b-c 5 a-b 4 b-d 2 f-a 2 Mining the FS-Tree (Cont.) Step 1 Root d c a b f 4 2 2 2 2 2 d b:6 b d a g:2 2 1 2 2 4 1 e:4 c c e b i:1 2 1 1 2 2 d:8 h d h:10 c 1 1 1 1 g i:9 i:3 g:5 1 i:7

  25. Root Root Root Root e e e e 1 1 3 3 d d d d 2 1 1 1 c c c b Mining the FS-Tree (Cont.) Step 2 Conditional Sequence base : Root (c-d:1, d-e:1), (b-d:2, d-e:2) c b Step 3 Conditional FS-Tree: 2 4 d d 4 2 e:4 e 2 1 h h:10

  26. Root e 3 d 2 1 c b Mining the FS-Tree (Cont.) Step 4 Depth first traversal Output <deh : 3>

  27. Link Link d-g c-d Derived Paths Derived Paths (c-d:4) (c-b:2, b-c:1, c-d:1) (a-b:2, b-c:2, c-d:2) (d-g:2) (c-b:2, b-c:1, c-d:1, d-g:1) (a-b:2, b-c:2, c-d:2, d-g:1) Conditional Sequence bases Conditional Sequence bases (c-b:1, b-c:1, c-d:1) (a-b:1, b-c:1, c-d:1) (c-b:1, b-c:1) (a-b:2, b-c:2) Conditional FS-Trees Conditional FS-Trees  (b-c:3) Frequent Sequences Frequent Sequences <bcd : 3>  Mining the FS-Tree (Cont.) • The Answers

  28. Link Link e-h d-e Derived Paths Derived Paths (c-d:4, d-e:4) (b-d:3, d-e:2) (c-d:4, d-e:4, e-h:1) (b-d:3, d-e:2, e-h:2) Conditional Sequence bases Conditional Sequence bases (c-d:1, d-e:1) (b-d:2, d-e:2) (c-d:4) (b-d:2) Conditional FS-Trees Conditional FS-Trees (d-e:3) (c-d:4) Frequent Sequences Frequent Sequences <cde : 4> <deh : 3> Mining the FS-Tree (Cont.) • The Answers

  29. Link b-c Derived Paths (c-b:2, b-c:1) (a-b:2, b-c:2) (f-a:2, a-b:2, b-c:2) a-b Link Derived Paths (a-b:2) (f-a:2, a-b:2) Conditional Sequence bases (c-b:1) (a-b:2) (f-a:2, a-b:2) Conditional Sequence bases (f-a:2) Conditional FS-Trees  Conditional FS-Trees Frequent Sequences (a-b:4)  Frequent Sequences <abc : 4> Mining the FS-Tree (Cont.) • The Answers

  30. SID InSeq 16 efa 17 ef 18 efab Non-Frequent Links Table(NFLT) SID Link Count 12 e-f 1 Maintaining the FS-Tree Incrementally e-f:3 f-a:2 a-b:1 e-f in NFLT Becomes Frequent, Move to Table HT MSuppClink=2 SID InSeq MSuppCseq=3 12 cdefabc Retrieve the Sequence from Original DB Delete this record from NFLT (Move to HT)

  31. Maintaining the FS-Tree Incrementally(Cont.) SID InSeq Header Table (HT) Root Delete Link Count ListH 12 cdefabc d-g 4 c g-i 2 4 a d b f c-d 7 d 2 2 2 4 d-e 6 2 1 b:6 b d g:2 e:4 e-h 3 a 1 1 2 2 1 h-i 2 f c e 1 c 1 i:1 c-b 2 b 1 h 2 2 1 a b-c 5 h:10 d:8 1 d a-b 5 1 c 1 b 1 1 b-d 2 1 i:3 g i:9 f-a 4 1 g:5 c:12 1 e-f 4 i:7

  32. Maintaining the FS-Tree Incrementally(Cont.) SID InSeq Header Table (HT) Root Link Count ListH 16 efa d-g 4 c g-i 2 4 a e d b f c-d 7 d 2 2 2 1 4 d-e 6 2 1 b:6 b d f g:2 e:4 e-h 3 a 1 1 2 1 2 1 h-i 2 f c e 1 a:16 c 1 i:1 c-b 2 b 1 h 2 2 1 a b-c 5 h:10 d:8 1 d a-b 5 1 c 1 b 1 1 b-d 2 1 i:3 g i:9 f-a 4 1 g:5 c:12 1 e-f 4 i:7

  33. Maintaining the FS-Tree Incrementally(Cont.) SID InSeq Header Table (HT) Root Link Count ListH 17 ef d-g 4 c g-i 2 4 a e d b f c-d 7 d 2 2 2 2 4 d-e 6 2 1 b:6 b d f:17 g:2 e:4 e-h 3 a 1 1 2 1 2 1 h-i 2 f c e 1 a:16 c 1 i:1 c-b 2 b 1 h 2 2 1 a b-c 5 h:10 d:8 1 d a-b 5 1 c 1 b 1 1 b-d 2 1 i:3 g i:9 f-a 4 1 g:5 c:12 1 e-f 4 i:7

  34. Maintaining the FS-Tree Incrementally(Cont.) SID InSeq Header Table (HT) Root Link Count ListH 18 efab d-g 4 c g-i 2 4 a e d b f c-d 7 d 2 2 2 3 4 d-e 6 2 1 b:6 b d f:17 g:2 e:4 e-h 3 a 1 1 2 2 2 1 h-i 2 f c e 1 a:16 c 1 i:1 c-b 2 b 1 h 2 2 1 a 1 b-c 5 h:10 d:8 1 d b:18 a-b 5 1 c 1 b 1 1 b-d 2 1 i:3 g i:9 f-a 4 1 g:5 c:12 1 e-f 4 i:7

  35. Non-Frequent Links Table (NFLT) Header Table (HT) 1 Frequent Links 5 Non-Frequent Links 9 4 3 2 6 Potentially Frequent Links 7 8 Mining the FS-Tree Incrementally • Type 1: • Mine for those Links if they are Affected • Type 2 and 4: • Mine for these Links • Type 3 and 5: • Delete Previously Discovered Patterns that Include these Links • Type 6, 7, 8, and 9: • Do Nothing

  36. Link a-b Header Table (HT) Derived Paths Link Count ListH (c-d:4, d-e:4, e-f:1, f-a:1, a-b:1) (a-b:2) (f-a:1, a-b:1) (e-f:3, f-a:2, a-b:1) d-g 4 g-i 2 c-d 7 d-e 6 Conditional Sequence bases e-h 3 (c-d:1, d-e:1, e-f:1, f-a:1) (f-a:1) (e-f:1, f-a:1) h-i 2 c-b 2 b-c 5 Conditional FS-Trees a-b 5 (f-a:3) b-d 2 Frequent Sequences f-a 4 <fab : 3> e-f 4 Mining the FS-Tree Incrementally(Cont.) • The Answers

  37. Link f-a Derived Paths Link e-f Derived Paths (c-d:4, d-e:4, e-f:1, f-a:1) (f-a:1) (e-f:3, f-a:2) (c-d:4, d-e:4, e-f:1) (e-f:3) Conditional Sequence bases Conditional Sequence bases (c-d:1, d-e:1) (c-d:1, d-e:1, e-f:1) (e-f:2) Conditional FS-Trees Conditional FS-Trees  (e-f:3) Frequent Sequences Frequent Sequences  <efa : 3> Mining the FS-Tree Incrementally(Cont.) • The Answers

  38. Interactive Mining • Setting the MSuppClink to a Small Enough Value • Enough Information in the FS-Tree • Without to Reference the Original Database

  39. Experimental Evaluation • MS Data Set • Microsoft Anonymous Web Data Set • 32,711 Sessions • 1 up to 35 page references • 294 distinct pages • MSNBC Data Set • MSNBCAnonymous Web Data Set • 989,818 Sections • 1 up to several thousands of page reference • 17 distinct pages • http://kdd.ics.uci.edu

  40. Experimental Evaluation (Cont.) • Scalability with the Number of Input Sessions • MS Data Set No MSuppRseq??

  41. Experimental Evaluation (Cont.) • Scalability with the Number of Input Sessions • MSNBC Data Set No MSuppRseq??

  42. Experimental Evaluation (Cont.) • Scalability with Support Threshold • MS Data Set

  43. Experimental Evaluation (Cont.) • Scalability with Support Threshold • MSNBC Data Set

  44. Experimental Evaluation (Cont.) • Incremental Mining • MS Data Set

  45. Experimental Evaluation (Cont.) • Incremental Mining • MSNBC Data Set

  46. Conclusions • Two Scans for the Input Database • Allows for Incremental Discovery of Frequent Sequences when the Input Database is Updated • Allows Interactive Response to Changes to the Minimun Support

More Related