170 likes | 291 Views
Discovery of Significant Usage Patterns from Clickstream Data. Margaret H. Dunham, Lin Lu CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based upon work supported by the National Science Foundation under Grant No. IIS-0208741. OUTLINE.
E N D
Discovery of Significant Usage Patterns from Clickstream Data Margaret H. Dunham, Lin Lu CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based upon work supported by the National Science Foundation under Grant No. IIS-0208741
OUTLINE • Web Usage Mining Overview • Our Work: Significant Usage Patterns • Ongoing/Future Research
Web Usage Mining Applications • Personalization • Improve structure of a site’s Web pages • Aid in caching and prediction of future page references • Improve design of individual pages • Improve effectiveness of e-commerce (sales and advertising)
Web Usage Mining Activities • Preprocessing Web log • Cleanse • Remove extraneous information • Sessionize Session: Sequence of pages referenced by one user at a sitting. • Pattern Discovery • Count patterns that occur in sessions • Pattern is sequence of pages referenced in session. • Pattern Analysis
Pattern Types • Association Rules None of the properties hold • Episodes Only ordering holds • Sequential Patterns Ordered and maximal • Forward Sequences Ordered, consecutive, and maximal • Maximal Frequent Sequences All properties hold • User Preferred Navigation Trail Not a true pattern, but representative of many
Web Usage Mining Issues • Identification of exact user not possible. • Exact sequence of pages referenced by a user not possible due to caching. • Session not well defined • Security, privacy, and legal issues
2003-10-0515:49:20050721435700000026210000000000 02652026520000000002003-10-0516:40:49050832595900000872710001142380 07107071070000000002003-10-0504:55:10050767799900000191300000670518 00000000000000000002003-10-0509:43:10050781766100000603030000000000 03657004690000000002003-10-0514:49:360508182420000007066200000000000811a39 09142071070000000002003-10-0521:23:57050759031600000465050002794335 11992071070000000002003-10-0511:30:16050730512600000465050000195747 1684600597corduroy+coats The BIG PICTURE CAN’T SEE THE FOREST FOR THE TREES • S-P1-P2-P3-P4-P5-P6-C1-C2-E • S-P1-P2-P3-P4-P5-C4-I6-I7-I8-E
Solution • Clustering • Abstraction • User Preferred Navigation Trails SIGNIFICANT USAGE PATTERNS
Interests… Motivations… WebLog Web Server Preprocess Web Data: Cleanse Sessionize … URL Abstraction Markov Model per Cluster Markov Model User defined beginning/ending Web pages Significant Usage Pattern User Preferred Navigation Trail Cluster Web Sessions Normalized Probability
Significant Usage Pattern (SUP): • SUP is a path that is extracted from a Markov model with user defined starting and ending states, and its corresponding normalized product of probabilities along the path satisfies a given threshold. • Differences from previous research: - SUP is extracted from clusters of user sessions - user sessions are abstracted sessions - starting and ending with specific Web pages of user interests • Need not be an exact pattern found in any session, but rather is representative of patterns found.
Abstraction Hierarchy Sessionized Web Log Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster Model
JCPenney Homepage D1 Department level D2 Dn … C1 Cn Category level … … I1 In … … Item level Fig 2. Hierarchy of JCPenney Web site Abstract Web session data Web session example: D0|C875|I D0|C875|I P27593 P27592 P28 -507169015
Alignment of Web Sessions • Compute the similarity between any two Web pages • The higher the level in the hierarchy, the more importance it is in determining the similarity of two Web pages, should give more weight. • - step 1: compare the two Web page representation strings from left to right and stop at the first pair where they are different. - step 2: compute the ratio of sum of the weights of those matching parts to the sum of total weights . • Example Page 1: D0|C875|I weight=6+1+4+1+2=14 Page 2: D0|C875 weight=6+1+4+1=12 Similarity=12/14=0.857
1 5 0.6 0.5 0.4 0.4 1 0.5 0.2 0.2 3 E S 0.5 0.4 0.4 0.2 0.2 0.4 0.2 2 0.2 4 0.5 0.2 Generating Significant Usage Patterns
Future/Ongoing Research • Scalability • Fewer patterns • Smaller patterns • MM less space than table • Clusters to identify Behaviors • Business vs Leisure • Cloaked Crawler • Online Identification of Cluster