Discovery of Significant Usage Patterns from Clickstream Data

Discovery of Significant Usage Patterns from Clickstream Data Margaret H. Dunham, Lin Lu CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based upon work supported by the National Science Foundation under Grant No. IIS-0208741

OUTLINE • Web Usage Mining Overview • Our Work: Significant Usage Patterns • Ongoing/Future Research

Web Usage Mining Applications • Personalization • Improve structure of a site’s Web pages • Aid in caching and prediction of future page references • Improve design of individual pages • Improve effectiveness of e-commerce (sales and advertising)

Web Usage Mining Activities • Preprocessing Web log • Cleanse • Remove extraneous information • Sessionize Session: Sequence of pages referenced by one user at a sitting. • Pattern Discovery • Count patterns that occur in sessions • Pattern is sequence of pages referenced in session. • Pattern Analysis

Pattern Types • Association Rules None of the properties hold • Episodes Only ordering holds • Sequential Patterns Ordered and maximal • Forward Sequences Ordered, consecutive, and maximal • Maximal Frequent Sequences All properties hold • User Preferred Navigation Trail Not a true pattern, but representative of many

Web Usage Mining Issues • Identification of exact user not possible. • Exact sequence of pages referenced by a user not possible due to caching. • Session not well defined • Security, privacy, and legal issues

2003-10-0515:49:20050721435700000026210000000000 02652026520000000002003-10-0516:40:49050832595900000872710001142380 07107071070000000002003-10-0504:55:10050767799900000191300000670518 00000000000000000002003-10-0509:43:10050781766100000603030000000000 03657004690000000002003-10-0514:49:360508182420000007066200000000000811a39 09142071070000000002003-10-0521:23:57050759031600000465050002794335 11992071070000000002003-10-0511:30:16050730512600000465050000195747 1684600597corduroy+coats The BIG PICTURE CAN’T SEE THE FOREST FOR THE TREES • S-P1-P2-P3-P4-P5-P6-C1-C2-E • S-P1-P2-P3-P4-P5-C4-I6-I7-I8-E

Solution • Clustering • Abstraction • User Preferred Navigation Trails SIGNIFICANT USAGE PATTERNS

Interests… Motivations… WebLog Web Server Preprocess Web Data: Cleanse Sessionize … URL Abstraction Markov Model per Cluster Markov Model User defined beginning/ending Web pages Significant Usage Pattern User Preferred Navigation Trail Cluster Web Sessions Normalized Probability

Significant Usage Pattern (SUP): • SUP is a path that is extracted from a Markov model with user defined starting and ending states, and its corresponding normalized product of probabilities along the path satisfies a given threshold. • Differences from previous research: - SUP is extracted from clusters of user sessions - user sessions are abstracted sessions - starting and ending with specific Web pages of user interests • Need not be an exact pattern found in any session, but rather is representative of patterns found.

Abstraction Hierarchy Sessionized Web Log Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster Model

JCPenney Homepage D1 Department level D2 Dn … C1 Cn Category level … … I1 In … … Item level Fig 2. Hierarchy of JCPenney Web site Abstract Web session data Web session example: D0|C875|I D0|C875|I P27593 P27592 P28 -507169015

Alignment of Web Sessions • Compute the similarity between any two Web pages • The higher the level in the hierarchy, the more importance it is in determining the similarity of two Web pages, should give more weight. • - step 1: compare the two Web page representation strings from left to right and stop at the first pair where they are different. - step 2: compute the ratio of sum of the weights of those matching parts to the sum of total weights . • Example Page 1: D0|C875|I weight=6+1+4+1+2=14 Page 2: D0|C875 weight=6+1+4+1=12 Similarity=12/14=0.857

1 5 0.6 0.5 0.4 0.4 1 0.5 0.2 0.2 3 E S 0.5 0.4 0.4 0.2 0.2 0.4 0.2 2 0.2 4 0.5 0.2 Generating Significant Usage Patterns

Examples

Experimental Result

Future/Ongoing Research • Scalability • Fewer patterns • Smaller patterns • MM less space than table • Clusters to identify Behaviors • Business vs Leisure • Cloaked Crawler • Online Identification of Cluster

Discovery of Significant Usage Patterns from Clickstream Data

Discovery of Significant Usage Patterns from Clickstream Data

Presentation Transcript

Mining of Frequent Patterns from Sensor Data

Web Usage Patterns

Discovery of Patterns in Digital Records

Usage Patterns of Collaborative Tagging System

Investigating Factors Affecting Actual Usage Patterns of Mobile Data Services

Discriminating Patterns for Empirical Discovery in Geospatial Data

On-Line Discovery of Flock Patterns in Spatio-Temporal Data *

Workflow Discovery from Empirical Data

E-Journal Usage Data From SFX

Usage patterns of collaborative tagging systems

Discovery of Temporal Patterns in Course-of-Disease Medical Data

From Data to Discovery

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

Anonymity of Clickstream data

Discovery of Significant Usage Patterns from Clusters of Clickstream Data

Mining of Massive Datasets: Knowledge discovery from data

MySQL and SSD: Usage Patterns

From Data to Discovery

Causal discovery of biomedical knowledge from big data

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

Discovery of Significant Usage Patterns from Clusters of Clickstream Data

Data mining and discovery of access patterns