1 / 17

Discovery of Significant Usage Patterns from Clickstream Data

Discovery of Significant Usage Patterns from Clickstream Data. Margaret H. Dunham, Lin Lu CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based upon work supported by the National Science Foundation under Grant No. IIS-0208741. OUTLINE.

sarah
Download Presentation

Discovery of Significant Usage Patterns from Clickstream Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovery of Significant Usage Patterns from Clickstream Data Margaret H. Dunham, Lin Lu CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based upon work supported by the National Science Foundation under Grant No. IIS-0208741

  2. OUTLINE • Web Usage Mining Overview • Our Work: Significant Usage Patterns • Ongoing/Future Research

  3. Web Usage Mining Applications • Personalization • Improve structure of a site’s Web pages • Aid in caching and prediction of future page references • Improve design of individual pages • Improve effectiveness of e-commerce (sales and advertising)

  4. Web Usage Mining Activities • Preprocessing Web log • Cleanse • Remove extraneous information • Sessionize Session: Sequence of pages referenced by one user at a sitting. • Pattern Discovery • Count patterns that occur in sessions • Pattern is sequence of pages referenced in session. • Pattern Analysis

  5. Pattern Types • Association Rules None of the properties hold • Episodes Only ordering holds • Sequential Patterns Ordered and maximal • Forward Sequences Ordered, consecutive, and maximal • Maximal Frequent Sequences All properties hold • User Preferred Navigation Trail Not a true pattern, but representative of many

  6. Web Usage Mining Issues • Identification of exact user not possible. • Exact sequence of pages referenced by a user not possible due to caching. • Session not well defined • Security, privacy, and legal issues

  7. 2003-10-0515:49:20050721435700000026210000000000               02652026520000000002003-10-0516:40:49050832595900000872710001142380               07107071070000000002003-10-0504:55:10050767799900000191300000670518               00000000000000000002003-10-0509:43:10050781766100000603030000000000               03657004690000000002003-10-0514:49:360508182420000007066200000000000811a39        09142071070000000002003-10-0521:23:57050759031600000465050002794335               11992071070000000002003-10-0511:30:16050730512600000465050000195747               1684600597corduroy+coats The BIG PICTURE CAN’T SEE THE FOREST FOR THE TREES • S-P1-P2-P3-P4-P5-P6-C1-C2-E • S-P1-P2-P3-P4-P5-C4-I6-I7-I8-E

  8. Solution • Clustering • Abstraction • User Preferred Navigation Trails SIGNIFICANT USAGE PATTERNS

  9. Interests… Motivations… WebLog Web Server Preprocess Web Data: Cleanse Sessionize … URL Abstraction Markov Model per Cluster Markov Model User defined beginning/ending Web pages Significant Usage Pattern User Preferred Navigation Trail Cluster Web Sessions Normalized Probability

  10. Significant Usage Pattern (SUP): • SUP is a path that is extracted from a Markov model with user defined starting and ending states, and its corresponding normalized product of probabilities along the path satisfies a given threshold. • Differences from previous research: - SUP is extracted from clusters of user sessions - user sessions are abstracted sessions - starting and ending with specific Web pages of user interests • Need not be an exact pattern found in any session, but rather is representative of patterns found.

  11. Abstraction Hierarchy Sessionized Web Log Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster Model

  12. JCPenney Homepage D1 Department level D2 Dn … C1 Cn Category level … … I1 In … … Item level Fig 2. Hierarchy of JCPenney Web site Abstract Web session data Web session example: D0|C875|I D0|C875|I P27593 P27592 P28 -507169015

  13. Alignment of Web Sessions • Compute the similarity between any two Web pages • The higher the level in the hierarchy, the more importance it is in determining the similarity of two Web pages, should give more weight. • - step 1: compare the two Web page representation strings from left to right and stop at the first pair where they are different. - step 2: compute the ratio of sum of the weights of those matching parts to the sum of total weights . • Example Page 1: D0|C875|I weight=6+1+4+1+2=14 Page 2: D0|C875 weight=6+1+4+1=12 Similarity=12/14=0.857

  14. 1 5 0.6 0.5 0.4 0.4 1 0.5 0.2 0.2 3 E S 0.5 0.4 0.4 0.2 0.2 0.4 0.2 2 0.2 4 0.5 0.2 Generating Significant Usage Patterns

  15. Examples

  16. Experimental Result

  17. Future/Ongoing Research • Scalability • Fewer patterns • Smaller patterns • MM less space than table • Clusters to identify Behaviors • Business vs Leisure • Cloaked Crawler • Online Identification of Cluster

More Related