Discovery of Aggregate Usage Profiles for Web Personalization

Discovery of Aggregate Usage Profiles for Web Personalization Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, Jim Wiltshire School of Computer Science, Telecommunications, and Information Systems DePaul University

Web Personalization • The Problem • dynamically serve customized content (pages, products, etc.) to users based on their profiles, preferences, or expected interests • Current Approaches • rule-based filtering • usually relies on static profile for users in part obtained through explicit registration • collaborative filtering • usually requires explicit ratings from users on similar types of objects • content-based filtering: learn/store personal profiles locally or on server-side • based on content similarity of user profile to pages or product descriptions • Limitations of Current Technologies • user input may be subjective and prone to bias • explicit (and non-binary) user ratings may not be available • profiles may be static and can become outdated quickly • collaborative filtering: problems with scalability due to sparse data • content-based filtering: may miss other semantic relationships among objects

Usage-Based Web Personalization • Basic Idea • find aggregate user profiles by automatically discovering user access patterns through Web usage mining (offline process) • data sources for mining include server logs, other click-stream data (e.g., product-oriented user events), and site structure • match a user’s active session against the discovered profiles to provide dynamic content (online process) • Advantages / Goals • profiles are based on objective information (how users actually use the site) • no explicit user ratings or interaction with users (to enter a profile, etc.) • helps preserve user privacy, by making effective use of anonymous data • usage data captures relationships missed by content-based approaches • can help enhance the effectiveness of collaborative or content-based filtering techniques

Site Files User Transaction File Frequent Itemsets Automatic Web Personalization:Offline Process Data Preparation Usage Mining Transaction Clustering Pageview Clustering Usage Profiles Data Cleaning Session Identification Pageview Identification Transaction Identification Support Filtering Server Logs & Other Click-Stream Data Association-Rule Discovery Domain Knowledge

Usage Profiles Web Server Client Browser Automatic Web Personalization:Online Process Recommendation Engine Input from the batch process Recommendations Active Session

Data Preparation Tasks • Preprocess and filter logs and other usage data • remove redundant references and create pageviews • domain knowledge to assign types to pageviews • handle references to scripts creating dynamic pages • map logs against site topology • Identify user sessions and transactions • heuristics based on IP, referrer, agent fields, and session time-outs used to identify unique user sessions (may need to infer missing references) • intra-session transactions can be obtained based on a model of user behavior (involves classifying references as “content” or “navigational” for each user) • weights are assigned to each pageview based on static pageview types as well as some measure of user interest (e.g., duration of pageview) • Support filtering - remove very low/high support pageviews

Aggregate Usage Profiles • Characteristics of Aggregate Profiles • the goal is to effectively capture common usage patterns from potentially anonymous click-stream data • profiles are represented as weighted collections of pageviews • weights represent the significance of pageviews within each profile • profiles are overlapping in order to capture common interests among different groups/types of users • multiple profiles may contribute to the recommendation set for a given user • Example Profiles from the ACR (Assoc. for Consumer Research) Site: 1.00 Call for Papers 0.67 ACR News Special Topics 0.67 CFP: Journal of Psychology and Marketing I 0.67 CFP: Journal of Psychology and Marketing II 0.67 CFP: Journal of Consumer Psychology II 0.67 CFP: Journal of Consumer Psychology I 1.00 CFP: Winter 2000 SCP Conference 1.00 Call for Papers 0.36 CFP: ACR 1999 Asia-Pacific Conference 0.30 ACR 1999 Annual Conference 0.25 ACR News Updates 0.24 Conference Update

Methodologies for the Discovery of Aggregate Profiles • Discovery of Profiles Based on Transaction Clusters • cluster user transactions - features are significant pageviews identified in the preprocessing stage • derive usage profiles (set of pageview-weight pairs) based on characteristics of each transaction cluster • Cluster Pageviews • directly compute overlapping clusters of pageviews based on co-occurrence patterns across transactions • features are user transactions, so dimensionality poses a problem for traditional clustering algorithms • we use Association-Rule Hypergraph Partitioning with an overlap factor

Profile Aggregation Based on Clustering Transactions (PACT) • Input • set of relevant pageviews in preprocessed log • set of user transactions • each transaction is a pageview vector • Transaction Clusters • each cluster contains a set of transaction vectors • for each cluster compute centroid as cluster representative • Aggregate Usage Profiles • a set of pageview-weight pairs: for transaction cluster C, select each pageview pi such that (in the cluster centroid) is greater than a pre-specified threshold

Hypergraph-Based Clustering • Recursively partition hypergraph so that each partition contains only highly connected data items • Given a hypergraph G=(V,E) we find a k-way partitioning such that the weight of the hyperedges that are cut is minimized • The fitness of partitions measured in terms of the ratio of weights of cut edges to the weights of uncut edges within the partitions • The connectivity measures the percentage of edges within the partition with which the vertex is associated -- used for filtering partitions • Vertices from partial edges can be added back to clusters based on a user-specified overlap factor • Construct a hypergraph from sets of related items • Each hyperedge represents a frequent itemset • Weight of each hyperedge can be based on the characteristics of frequent itemsets or association rules

Profiles Based on Hypergraph Clusters of Pageviews • Input • input for clustering is the set of large itemsets from association rule module • each itemset is a hyperedge (weights are a function of the interest of the itemset) • Aggregate Profiles (Pageview Clusters) • hMETIS used as the underlying hypergraph partitioning algorithm • clustering program directly outputs a set of overlapping pageview clusters • the weight associated with pageview p in a cluster C is based on the connectivity value of p in hypergraph partition:

Recommendations Based on Usage Profiles • Match current user’s activity against the discovered usage profiles • a sliding window over the active session to capture the current user’s “short-term” history depth • usage profiles and the active session are treated as vectors • matching score is computed based on the similarity between vectors (e.g, normalized cosine similarity) • Recommendations • each pageview is assigned a recommendation score based on • matching score to aggregate profiles • “information value” of the pageview based on domain knowledge (e.g., link distance of the candidate recommendation to the active session) • recommendations are contributed by multiple matching aggregate profiles

Experimental Set-up • The Data Sets • Log data from the Association for Consumer Research Web site • 18342 transactions, 62 pageview URLs (after filtering) • Data set divided into training and evaluation sets • Evaluation Methodology • Portion of each transaction (based on a specified window size) in evaluation set was used to generate a recommendation set (based on a given recommendation threshold) • For each transaction, the overall coverage of the recommendation set was divided by the number of recommendations to produce an accuracy measure • The overall score was computed (for each threshold) by taking the average scores over all transactions in the evaluation set

Average Visit Percentage AVP measures the likelihood that a user who visits any page in a Given profile, also visits other pages in that profile

Evaluation: Measuring Recommendation Accuracy Recommendation accuracy results, using a active session window of size 3.

Evaluation: Impact of Filtering Comparison of PACT and Hypergraph (using window size 2) for filtered and unfiltered data sets. Filtering involved the removal of top-level navigational pages from the data set, leaving only deeper content-oriented pages.

Conclusions • Usage-Based Web Personalization • results suggest that effective personalization can be achieved even with anonymous and short-term click-stream data • possibly useful in the early stages of personalization when more detailed profiles are not available for individual users • could be used effectively in conjunction with other methods based on content-based or collaborative filtering • Which Method is Best? • PACT may be most appropriate when the goal is to provide a more general personalization solution involving a variety of objects across the whole site • Hypergraph may be most appropriate when the goal is to provide a highly focused set of recommendations for specific portions of the site • In practice, usage-based methods need to be combined with other techniques to provide an integrated solution

Discovery of Aggregate Usage Profiles for Web Personalization