240 likes | 364 Views
Separating the Swarm Categorization Methods for User Sessions on the Web. Jeffrey Heer, Ed H. Chi Palo Alto Research Center. 2002.04.24 – CHI Web Behavior Patterns. Web Analytics: What can you measure?. Want to improve site design, content, and performance. content page traffic.
E N D
Separating the Swarm Categorization Methods for User Sessions on the Web Jeffrey Heer, Ed H. Chi Palo Alto Research Center CHI Web Behavior Patterns 2002.04.24 – CHI Web Behavior Patterns
Web Analytics: What can you measure? • Want to improve site design, content, and performance • content • page traffic • load testing Marketing Infrastructure • user intent • usability • user experience Site Design CHI Web Behavior Patterns
Site Complexity Time The Change in Web Sites:What should you measure? USER EXPERIENCE Activity-based websites I’d like information on used cars. Search for a car dealer in my neighborhood. TRAFFIC Page-based websites Products Management Team CHI Web Behavior Patterns
Motivation • System Description • Evaluation • Implications • Conclusion What are users’ information goals? Strategy: Use all available data to discover user goals. (Content, Usage, Topology) • Understanding the composition of web user traffic. CHI Web Behavior Patterns
System Description • Generate a user profile for each user session. • How: Use access logs and site content to to build a multi-featured model of user activity (multi-modal clustering). • Group user profiles into common activities like “product browsing” and “job seeking” • How: Apply clustering algorithms to user profiles CHI Web Behavior Patterns
System Description Access Logs Web Crawl • Steps: • Process Access Logs • Crawl Web Site • Build Document Model • Extract User Sessions • Build User Profiles • Cluster Profiles User Sessions Document Model User Profiles Clustered Profiles CHI Web Behavior Patterns
Access Logs Web Crawl Document Model User Sessions Document Model User Profiles • Site is crawled • Pay special attention to pages in logs. • Documents described by feature vectors: Content: TF.IDF weighted keyword vector URL: Tokenized and TF.IDF weighted Inlinks: Column vectors in topology matrix Outlinks: Row vectors in topology matrix • Vectors are concatenated to form a single multi-modal vector Pd for each document. Clustered Profiles CHI Web Behavior Patterns
Access Logs Web Crawl User Sessions User Sessions Document Model User Profiles • Sessions extracted and represented by a vector s: • For path i = ABD, si = <1,1,0,1,0> (For site with 5 documents <A,B,C,D,E>) • Different weightings can be employed in creating the session vector s: Frequency: number of times each page is accessed. ABD, s = <1,1,0,1,0> TF.IDF: hits / # paths including page Position: Use order of pages within surfing path. ABD, s = <1,2,0,3,0> View Time: Use time spent viewing pages. A10sB20sD15s, s = <10,20,0,15,0> Clustered Profiles CHI Web Behavior Patterns
Access Logs Web Crawl User Profiles User Sessions Document Model User Profiles • User profiles are linear combination of the viewed pages. • “You are what you see.” Clustered Profiles User Profiles Document Vectors Session weights CHI Web Behavior Patterns
Access Logs Web Crawl Clustering User Sessions Document Model User Profiles • Clustering is a form of statistical analysis which organizes data into individual clusters. • Groupings are determined by a shared similarity. • Similarity is defined by a computable similarity metric. • Clustering proceeds by recursive bisection, using K-Means to perform the bisections [Zhao01]. Clustered Profiles weights wm specify the contribution of each modality CHI Web Behavior Patterns
User population breakdown Keywords describing user groups Frequent documents accessed by group Detailed stats CHI Web Behavior Patterns
Clustering Results http://www.diamondreview.com Users reached end of tutorial, had nowhere to go. CHI Web Behavior Patterns
User Intent Compare System Evaluation Does the system correctly infer user intentions? Logs System User Intent Groupings CHI Web Behavior Patterns
User Study • Asked users to surf specific tasks on www.xerox.com • captured actions using the WebQuilt proxy logger [Hong01] • done at their leisure. • 15 unique tasks: • Tasks developed after exploring xerox.com and reading user e-mail feedback • 5 task groups with 3 tasks per group. • Products, TechSupport, Supplies, Company Info, and Jobs • Participation: • 21 users signed up, 18 went through, 104 usable sessions. CHI Web Behavior Patterns
Results: 340 combinations of clustering schemes Outlink-based schemes performed poorly (omitted). CHI Web Behavior Patterns
Content is King! Mean=0.96, StdDev=0.07 Analysis: Modalities Linear Contrast shows Content sig. different: (unimodal) F(1,105)=32.51, MSE=.005361, p<0.0001 (multimodal) F(1,35)=33.36, MSE=.007332, p<0.0001 CHI Web Behavior Patterns
View Time is best! Analysis: Path Weighting Paired t-Test between Time-based and non-Time based weightings: n=60, t(59)=4.85, p=4.68e-6 V.T.mean=89.5%, s.d.=12.7%, non-V.T.mean=83.2%, s.d.=12.0% CHI Web Behavior Patterns
Observation: Multi-Modal vs. Unimodal • In practice, Multi-Modal should be more robust • Some pages don’t have much content • Images, Audio, Video • PDF, PS (if you don’t have necessary software) • URL Tokens: All pages have URLs. • Inlinks: don’t depend on any features of a page! • In our experience, Content-based Multi-Modal Clustering retains accuracy. Linear Contrast shows no significant difference between multi-modal and uni-modal schemes: F(1,77)=1.63, MSE=.004407, p=.21 CHI Web Behavior Patterns
Findings • Incorporating View Time improves clustering accuracy. • Though it involves extra work, extracting Content can provide very high accuracy. • Adding other modalities make clustering more robust. • Modalities should be chosen carefully, and tailored for each specific site. CHI Web Behavior Patterns
Implications for Designers • Good design means understanding your users. • It’s possible to understand trends of user activities accurately. • Requires well-defined user tasks doable on the site. • Now you can design and tailor user experience. • Address discovered usability issues. • Update design to facilitate common tasks. CHI Web Behavior Patterns
Summary: “You are what you see.” Users follow the best Information Scent to accomplish their goals. Web site Page Content Topology User Information Goals Observed Usage InfoScent Clustering CHI Web Behavior Patterns
Future Work • Determining # of clusters • Currently done semi-manually • Model unstructured task more directly • Directly recommend design changes • Integrate with • Clustering Visualization • User Path Visualization • Lots of Commercial Interest, Licensing CHI Web Behavior Patterns
Conclusion • Performed first known user study to characterize the analytic space of session clustering techniques. • Found that session clustering can be highly accurate with respect to user intentions. • Demonstrated our method is scalable and useful in real-world scenarios. • This should prove to be a useful tool for web designers and researchers! CHI Web Behavior Patterns
Acknowledgements • Peter Pirolli, Stu Card, Adam Rosien, Pam Schraedley and the the UIR and Bloodhound Team at PARC. • George Karypis for CLUTO software • Participants in our user study • Office of Naval Research Contact: Jeff Heer (jheer@parc.com) Ed H. Chi (echi@parc.com) Separating the Swarm Categorization Methods for User Sessions on the Web CHI Web Behavior Patterns