Characterizing Visitors to a Website Across Multiple Sessions

Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh Banerjee and Ghosh

Motivation Why Characterize or Predict web user behavior? • Site-centric view: Personalization, sticky websites • User-centric view: personal agents for information acquisition • Universalist approaches: Pagerank, web metrics,… Banerjee and Ghosh

Clustering Users from Web Logs • Wide variety of web behavior  segment users based on surfing behavior as a first step to further analysis. • User: set of sessions • Session: sequence of • (page I.d., time spent on that page) tuples • How to cluster sets of sequences? Banerjee and Ghosh

The Approach • Cluster Sessions • Session Similarity Measure • Session Similarity Graph • Outlier Detection • Graph Partitioning • Create a Cluster Space • Cluster users in this Space Banerjee and Ghosh

A Similarity Measure for Sessions • Overlap between two sessions represented by the longest common subsequence (LCS) • Obtain session similarity using LCS and time informationsession similarity = (time similarity in LCS) x (importance of LCS) • The similarity component : • Average min-max similarity for each page in the LCS • The importance component : • Average of the fraction of overall session time spent in the LCS Banerjee and Ghosh

Session Clustering • Find the pairwise similarity values between all pair of sessions; record only similarities > q • Incrementally construct similarity graph Gq • the vertices are the sessions, the edge weights are the session similarity values • no isolated vertices (discard “outliers”) • Balanced Graph Partitioning • we used Metis [Karypis, Kumar] Banerjee and Ghosh

The Cluster Space • Given: each session assigned to one of k clusters (sets) • Sessions of a user are distributed among the k sets • vector u = [u1u2 … uk ]T where ui = number of sessions of the user belonging to cluster I • Stage II : User Clustering • find pairwise similarity values using the extended Jaccard measure • partition similarity graph • Gives l user clusters and a set of outlier users Banerjee and Ghosh

The Dataset : Sulekha.com Banerjee and Ghosh

Dataset details • Logs over a one month period • Raw log size 184 Mb • 453,953 files accessed • 37,753 sessions in all • 23,310 sessions after some preprocessing/filtering • 2,493 users Banerjee and Ghosh

Results : Session Clusters Banerjee and Ghosh

Results : User Clusters • user : [(128.194.xxx.xxx)] • (/authors,3)(/articles,129) • (/authors,8)(/articles,8) • (/authors,80)(/articles,2141) • user : [(209.30.xxx.xxx)] • (/home,77)(/articles,111)(/authors,93)(/articles,629)(/misc,58) (/coffeehouse,75)(/wo-men,967) • (/articles,2627) • user : [(171.68.xxx.xxx)] • (/home,323)(/articles,24)(/authors,45)(/articles,1290) A user cluster : people who read the articles Banerjee and Ghosh

Results : User Clusters • user : [(152.170.xxx.xxx)] • (/home,21)(/wo-men,1075)(/philosophy,52) • user : [(209.244.xxx.xxx)] • (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo-men,31) • (/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)(/coffeehouse,382)(/biztech,298)(/philosophy,290) • (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093) A user cluster : people interested in wo-men, philosophy, coffeehouse Banerjee and Ghosh

Results : User Clusters • user : [(216.154.xxx.xxx)] • (/coffeehouse,12)(/biztech,25)(/books,48) • (/coffeehouse,13)(/biztech,26)(/books,19) • user : [(204.220.xxx.xxx)] • (/coffeehouse,162) • (/coffeehouse,40) • user : [(32.100.xxx.xxx)] • (/coffeehouse,12)(/contests 12) • (/coffeehouse,43)(/contests 44) A user cluster : people interested in coffeehouse – bookmarked it ! Banerjee and Ghosh

Result Visualization using CLUSION [Strehl &Ghosh 01] Sessions Users Banerjee and Ghosh

Conclusions • Segmentation: a basic pre-processing step for Web Mining • Similarity measure + Cluster Space Concept: applicable to clustering of sets of any data-structure • For certain websites, time spent on the pages matters • not handled by current commercial tools • Outlier detection before clustering is important • Results QA-ed by human subjects • Results for clusters & outliers at both levels were subjectively good • No good way to find cluster quality analytically • Formation of similarity graph is a slow process Banerjee and Ghosh

Future Work • Improve the present method by: • using cluster seeds for cluster growing • using alternative clustering algorithms for each stage • studying the effect of thresholds, number of clusters on performance • studying the importance of order of page-visits • studying the importance of balanced clustering Banerjee and Ghosh

Backup Banerjee and Ghosh

Issues : Choice of Parameters • Number of session clusters, k, should be chosen appropriately • Thresholds for forming session & user similarity graphs : • threshold value should be chosen after looking at the distribution of edge weights Banerjee and Ghosh

Related Work • Research in Web Mining : • Extraction of navigational patterns : Spiliopoulou, Faulstich • Ordering relationships : Mannila, Meek • Surfing prediction : Pitkow, Pirolli • Clustering web usage sessions : Fu, Sandhu, Shih Banerjee and Ghosh

Example • Sessions : • Session1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)] • Session2 = [(b,5) (d,12) (f,1) (a,7) (c,5)] • LCS pages = [(b)(d)(c)] • Corresponding Index, Times Sequences : • Index1 = [(1)(2)(3)], Time1 = [(100) (8) (5)] • Index2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)] • Similarity over each LCS page : of the two times • Similarity on page b = 5/100 = 0.05 • Similarity on page d = 8/12 = 0.67 • Similarity on page c = 5/5 = 1.00 Banerjee and Ghosh

Example (contd.) • The similarity component = (0.05 + 0.67 + 1.00)/3 = 0.57 • The importance component : • Fraction of time spent in the LCS by Session1 = 113/149 = 0.76 • Fraction of time spent in the LCS by Session2 = 22/30 = 0.73 • The mean = (0.76+0.73)/2 = 0.75 • The overall similarity = 0.57 x 0.75 = 0.43 Banerjee and Ghosh

Issues : Session Resolution • Generate coarse resolution paths making use of the concept hierarchy of the website • Reduces computations; Increases interpretability of results Banerjee and Ghosh

Comments • Results QA-ed by human subject • Results for clusters & outliers at both levels were subjectively good • No good way to find cluster quality analytically • Clustering algorithms for the two stages • Stage I : Graph partitioning works well for large sparse graphs, so it is desirable in this stage • Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate • Cluster space • Gives a general framework for mapping any non-vector clustering problem to an equivalent vector clustering problem Banerjee and Ghosh

Characterizing Visitors to a Website Across Multiple Sessions

Characterizing Visitors to a Website Across Multiple Sessions

Presentation Transcript

Characterizing Audience for Informational Website Design: A Case Study

Gaming across Multiple Devices

Converting Website Visitors to Members

RTP Multiple Stream Sessions and Simulcast

Already there, or soon: multiple scenes on same window, save projects across sessions.

Population dynamics across multiple sites

Driving Visitors to Your Website

Query Construction across Multiple Terminologies

How to Increase Visitors to Your Word Press Website

Buy Website Visitors

How a Well Designed Website Influences Your Visitors

Tips to increase website visitors

Ways to improve your Website traffic through buying website visitors

Chat with your website visitors

10 Ways to Engage Visitors on Your WEBSITE

A domain name allows your visitors to reach your website easily!

Hacks to keep Visitors longer on your website

Convert Website Visitors Into Paying Customers

The Way To Deliver Visitors To The Website

3 WAYS TO MAKE YOUR WEBSITE VISITORS BOUNCE

Ways to Keep Visitors on Your Website

5 Ways to Keep Visitors on Your Website