230 likes | 400 Views
Characterizing Visitors to a Website Across Multiple Sessions. NGDM Workshop, Nov 2002. Arindam Banerjee Joydeep Ghosh. Motivation. Why Characterize or Predict web user behavior? Site-centric view: Personalization, sticky websites
E N D
Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh Banerjee and Ghosh
Motivation Why Characterize or Predict web user behavior? • Site-centric view: Personalization, sticky websites • User-centric view: personal agents for information acquisition • Universalist approaches: Pagerank, web metrics,… Banerjee and Ghosh
Clustering Users from Web Logs • Wide variety of web behavior segment users based on surfing behavior as a first step to further analysis. • User: set of sessions • Session: sequence of • (page I.d., time spent on that page) tuples • How to cluster sets of sequences? Banerjee and Ghosh
The Approach • Cluster Sessions • Session Similarity Measure • Session Similarity Graph • Outlier Detection • Graph Partitioning • Create a Cluster Space • Cluster users in this Space Banerjee and Ghosh
A Similarity Measure for Sessions • Overlap between two sessions represented by the longest common subsequence (LCS) • Obtain session similarity using LCS and time informationsession similarity = (time similarity in LCS) x (importance of LCS) • The similarity component : • Average min-max similarity for each page in the LCS • The importance component : • Average of the fraction of overall session time spent in the LCS Banerjee and Ghosh
Session Clustering • Find the pairwise similarity values between all pair of sessions; record only similarities > q • Incrementally construct similarity graph Gq • the vertices are the sessions, the edge weights are the session similarity values • no isolated vertices (discard “outliers”) • Balanced Graph Partitioning • we used Metis [Karypis, Kumar] Banerjee and Ghosh
The Cluster Space • Given: each session assigned to one of k clusters (sets) • Sessions of a user are distributed among the k sets • vector u = [u1u2 … uk ]T where ui = number of sessions of the user belonging to cluster I • Stage II : User Clustering • find pairwise similarity values using the extended Jaccard measure • partition similarity graph • Gives l user clusters and a set of outlier users Banerjee and Ghosh
The Dataset : Sulekha.com Banerjee and Ghosh
Dataset details • Logs over a one month period • Raw log size 184 Mb • 453,953 files accessed • 37,753 sessions in all • 23,310 sessions after some preprocessing/filtering • 2,493 users Banerjee and Ghosh
Results : Session Clusters Banerjee and Ghosh
Results : User Clusters • user : [(128.194.xxx.xxx)] • (/authors,3)(/articles,129) • (/authors,8)(/articles,8) • (/authors,80)(/articles,2141) • user : [(209.30.xxx.xxx)] • (/home,77)(/articles,111)(/authors,93)(/articles,629)(/misc,58) (/coffeehouse,75)(/wo-men,967) • (/articles,2627) • user : [(171.68.xxx.xxx)] • (/home,323)(/articles,24)(/authors,45)(/articles,1290) A user cluster : people who read the articles Banerjee and Ghosh
Results : User Clusters • user : [(152.170.xxx.xxx)] • (/home,21)(/wo-men,1075)(/philosophy,52) • user : [(209.244.xxx.xxx)] • (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo-men,31) • (/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)(/coffeehouse,382)(/biztech,298)(/philosophy,290) • (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093) A user cluster : people interested in wo-men, philosophy, coffeehouse Banerjee and Ghosh
Results : User Clusters • user : [(216.154.xxx.xxx)] • (/coffeehouse,12)(/biztech,25)(/books,48) • (/coffeehouse,13)(/biztech,26)(/books,19) • user : [(204.220.xxx.xxx)] • (/coffeehouse,162) • (/coffeehouse,40) • user : [(32.100.xxx.xxx)] • (/coffeehouse,12)(/contests 12) • (/coffeehouse,43)(/contests 44) A user cluster : people interested in coffeehouse – bookmarked it ! Banerjee and Ghosh
Result Visualization using CLUSION [Strehl &Ghosh 01] Sessions Users Banerjee and Ghosh
Conclusions • Segmentation: a basic pre-processing step for Web Mining • Similarity measure + Cluster Space Concept: applicable to clustering of sets of any data-structure • For certain websites, time spent on the pages matters • not handled by current commercial tools • Outlier detection before clustering is important • Results QA-ed by human subjects • Results for clusters & outliers at both levels were subjectively good • No good way to find cluster quality analytically • Formation of similarity graph is a slow process Banerjee and Ghosh
Future Work • Improve the present method by: • using cluster seeds for cluster growing • using alternative clustering algorithms for each stage • studying the effect of thresholds, number of clusters on performance • studying the importance of order of page-visits • studying the importance of balanced clustering Banerjee and Ghosh
Backup Banerjee and Ghosh
Issues : Choice of Parameters • Number of session clusters, k, should be chosen appropriately • Thresholds for forming session & user similarity graphs : • threshold value should be chosen after looking at the distribution of edge weights Banerjee and Ghosh
Related Work • Research in Web Mining : • Extraction of navigational patterns : Spiliopoulou, Faulstich • Ordering relationships : Mannila, Meek • Surfing prediction : Pitkow, Pirolli • Clustering web usage sessions : Fu, Sandhu, Shih Banerjee and Ghosh
Example • Sessions : • Session1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)] • Session2 = [(b,5) (d,12) (f,1) (a,7) (c,5)] • LCS pages = [(b)(d)(c)] • Corresponding Index, Times Sequences : • Index1 = [(1)(2)(3)], Time1 = [(100) (8) (5)] • Index2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)] • Similarity over each LCS page : of the two times • Similarity on page b = 5/100 = 0.05 • Similarity on page d = 8/12 = 0.67 • Similarity on page c = 5/5 = 1.00 Banerjee and Ghosh
Example (contd.) • The similarity component = (0.05 + 0.67 + 1.00)/3 = 0.57 • The importance component : • Fraction of time spent in the LCS by Session1 = 113/149 = 0.76 • Fraction of time spent in the LCS by Session2 = 22/30 = 0.73 • The mean = (0.76+0.73)/2 = 0.75 • The overall similarity = 0.57 x 0.75 = 0.43 Banerjee and Ghosh
Issues : Session Resolution • Generate coarse resolution paths making use of the concept hierarchy of the website • Reduces computations; Increases interpretability of results Banerjee and Ghosh
Comments • Results QA-ed by human subject • Results for clusters & outliers at both levels were subjectively good • No good way to find cluster quality analytically • Clustering algorithms for the two stages • Stage I : Graph partitioning works well for large sparse graphs, so it is desirable in this stage • Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate • Cluster space • Gives a general framework for mapping any non-vector clustering problem to an equivalent vector clustering problem Banerjee and Ghosh