A Content-Based Approach to Collaborative Filtering

A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation

Collaborative Filtering • Method of automating word-of-mouth • Large groups of users collaborate by rating products, services, news articles, etc. • Analyze ratings data of the group to produce recommendations for individual users • Find users with similar tastes

Problems with Collaborative Filtering Methods • Performance • Prohibitively large dataset • Scalability • Will the solution scale to millions of users on the Internet? • Sparsity of data • User who has rated few items • Item with few ratings

Problems with Collaborative Filtering Methods • Cannot compare users that have no common ratings (Ratings on a scale of 1-5)

A Content-Based Approach • Build a feature list for each user based on content of items rated • Compare users’ features to make recommendations • Now we can find similarity between users with no common ratings

Data Source • EachMovie Project • Compaq Systems Research Center • Over 18 months collected 2,811,983 ratings for 1,628 movies from 72,916 users • Ratings given on 1-5 scale • Dataset split into 75% training, 25% testing • Internet Movie Database (IMDb) • Huge database of movie information • Actors, director, genre, plot description, etc.

Creating the Feature List • Retrieve content information for each movie from IMDb dataset – create “bag of words” • Throw out common words (i.e.: the, and, but) • Calculate frequency of remaining words, create movie’s feature list • Frequencies weighted based on total number of terms

Comparing Users • Each user has positive and negative feature list • Combine feature lists of movies they have rated • Compare user’s feature lists using Pearson Correlation Coefficient • Users can be compared with no common ratings • Able to recommend items with few ratings • Users only need to rate a few items to receive recommendations

Methods • Three methods attempted to improve performance: • Clustering of users • Random groups of users • Compare users directly to items

User Clustering • Simple algorithm, starting with first user: • Compare to existing clusters first • If similarity is high, merge user into cluster • Compare to each remaining user • Stop if correlation is above threshold • Once a similar user is found, create a new cluster from the two users • Cluster has combined feature list of all its users • Not as efficient as possible - O(n2)

User Clustering • Once clusters are formed, we can predict ratings for each item • For each user, find their 10 nearest neighbors • Predicted rating is the average rating of item from these neighbors

Selecting a Random Group • Randomly select 5000 users as a (hopefully) representative sample • As before, find a user’s 10 nearest neighbors from the random group • Predicted rating is the average rating of item from these neighbors • Much less work than clustering • How much accuracy (if any) will be lost?

Comparing Users to Items • No collaborative filtering involved • Compare the positive and negative feature lists of user to feature list of item • Make prediction based on which feature list has higher correlation with item • Pretty quick and easy to do • How accurate will this be?

Analyzing Predictions • Collected 3 metrics to evaluate predictions • Accuracy: all items predicted correctly • Precision: positive items predicted correctly • Recall: unseen positive items predicted correctly • Precision and recall have inverse relationship

Results

Conclusions • Large gain from clustering users • Is the extra work worth it? • Depends on the application • Purely content-based predictions worked pretty well • Simple, fast solution • Random group prediction also performed reasonably well • Problems solved by content-based analysis: • Sparsity of data • Performance • Scalability

A Content-Based Approach to Collaborative Filtering