Discovery of Significant Usage Patterns from Clusters of Clickstream Data

Discovery of Significant Usage Patterns from Clusters of Clickstream Data Lin Lu, Margaret Dunham, and Yu Meng Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275-0122 llu(mhd,ymeng)@engr.smu.edu WebKDD’051

Introduction • Significant Usage Patterns (SUP) - SUP is extracted from clusters of abstracted user sessions - Use a unique two-phase abstraction technique - With desired beginning and/or ending Web pages - With normalized probability WebKDD’052

Model Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions AbstractionHierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery SUPs per Cluster WebKDD’053

JCPenney Homepage D1 D2 Dn Department level … C1 Cn Category level … … I1 In Item level … … Fig 1. Hierarchy of J.C. Penney Web site Alignment of Web sessions • Create sub-abstracted Web sessions URL -> {<Concept hierarchy keyword> <Unique ID> <|>} Example: D0|C875|I D0|C875|I P27593 P27592 P28 -507169015 WebKDD’054

Alignment of Web sessions • Computing the similarity between any two Web pages • The higher the level in the hierarchy, the more importance in determining the similarity of two Web pages, should give more weight. • Scoring scheme - step 1: determine the longer page representation string in the two Web page representations. - step 2: weight is assigned to each level in the hierarchy: the lowest level in longer page representation string is given weight 2 to its abstract level, the second to the lowest level is given weight 4 to its abstract level, and so on. The corresponding ID is always given weight 1. WebKDD’055

Alignment of Web sessions • Computing the similarity between any two Web pages - step 1: compare the two Web page representation strings from the left to the right and stopped at the first pair which they are different. - step 2: compute the ratio of the sum of the weights of those matching parts to the weight of longer page representation string. Example: Page 1: D0|C875|I Weight=6+1+4+1+2=14 Page 2: D0|C875 Weight=6+1+4+1=12 Similarity=12/14=0.857 WebKDD’056

Apply Needleman-Wunsch global alignment algorithm Model Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions AbstractionHierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’057

A(i-1, j-1) A(i-1, j) A(i, j-1) Alignment of Web sessions • Computing optimal alignment of two sequences • using Needleman-Wunsch algorithm A(i, j) A(i, j) = max[A(i-1, j-1)+s(Xi, Yj); A(i-1, j)-d; A(i, j-1)-d] where s(Xi, Yj) is the similarity between Xi and Yj, d is the score of aligning Xi (Yj) with a gap WebKDD’058

Alignment of Web sessions • Apply Needleman-Wunsch global alignment algorithm • Scoring scheme [3] • if (matching) score = 20;//a pair of Web pages with similarity 1 • else if (mis-matching) score = –10;//a pair of Web pages with similarity 0 • else if (gap) score = –10; //a Web page aligns with a gap • elsescore = –10 ~ 20;//the pair of Web pages with similarity between 0 and 1 • Example: • P47104 D0|C0|I D469|C469 D2652|C2652 • D469|C16758|I D0|C0|I D469|C469 Thus, session similarity = 32.1/4 = 8.025 WebKDD’059

Apply Nearest neighbor clustering algorithm Model Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Clusters of User Sessions AbstractionHierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’0510

Model Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions AbstractionHierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’0511

Create Concept-based Abstracted Sessions • Represent the abstracted page accesses in a session as a sequence like: P1 D1 C1 I1 P2 D2 C2 I2 … • In a session, the same Pi, Di, Ci, and Ii (i=1, 2…) represents the same page. However, in different sessions, the same page may be represented by different elements. Example: Original session: D7107|C7121 D7107|C7126|I076bdf3 D7107|C7131|I084fc96 D7107|C7131 P55730 P96 P27 P14 P27592 P28 P33711 -505884861 Abstracted session: C1 I1 I2 C2 P1 P2 P3 P4 P5 P6 P7 -505884861 WebKDD’0512

1 5 0.75 0.5 0.4 0.33 0.5 0.25 0.2 0.17 3 E S 0.5 0.33 0.33 0.17 0.2 0.33 0.17 0.17 2 0.5 4 0.2 Generating Significant Usage Patterns • Use Markov model to represent sessions in each cluster Example: (1) 1, 2, 3, 5, 4 (2) 2, 4, 3, 5 (3) 3, 2, 4, 5 (4) 1, 3, 4, 3 (5) 4, 2, 3, 4, 5 • The probability of a path normalized where Pti is the transition probability between two adjacent states WebKDD’0513

Generating Significant Usage Patterns • Significant Usage Patterns Example: WebKDD’0514

Experimental Result • On average purchase sessions are longer than those sessions without purchase - review the information, compare the price, the quality and etc. - fill out the billing and shipping information to commit the purchase WebKDD’0515

Experimental Result SUPs in non-purchase cluster Interested in gathering information of products in different categories. S-C1-C1-C2-C3-C4-C5-C5-I1-E S-C1-C1-I1-C1-C2-C3-C4-C5-E S-I1-C1-C2-C3-C4-C5-C6-C7-E Interested in reviewing general pages (to gather general information). Not serious visitors (the average session length is 3) WebKDD’0516

review the information, compare among products, and fill out the payment and shipping information • The average length of SUPs is • longer in the purchase cluster • than in non-purchase cluster • SUPs in the purchase cluster have • higher probability than those in • non-purchase cluster. have purchase in mind vs. random browsing behavior Experimental Result WebKDD’0517

Conclusion and Future Work • Summary - By applying clustering to abstracted user sessions, it is more likely to find groups of users with similar motivations for visiting a specific website. - By giving the flexibility for user to specify the beginning and/or ending Web page(s), users can have more control in generating patterns of their interests. • Future - Scalability - Cluster to identify different user groups - Online identification of user to predefined cluster WebKDD’0518

References [1]J. Borges and M. Levene, “Data Mining of User Navigation Patterns”, In Proc. the Workshop on Web Usage Analysis and User Profiling (WEBKDD'99), 31-36, San Diego, August 15, 1999. [2]J. Borges and M. Levene, “An average linear time algorithm for web data mining”, International Journal of Information Technology and Decision Making, 3, (2004), 307-320. [3] W. Wang and O. R. Zaïane, “Clustering Web Sessions by Sequence Alignment”, Third International Workshop on Management of Information on the Web in conjunction with 13th International Conference on Database and Expert Systems Applications DEXA'2002, pp 394-398, Aix en Provence, France, September 2-6, 2002.

Thank you Questions? WebKDD’0520

Discovery of Significant Usage Patterns from Clusters of Clickstream Data