1 / 16

A Content-Based Approach to Collaborative Filtering

A Content-Based Approach to Collaborative Filtering. Brandon Douthit-Wood CS 470 – Final Presentation. Collaborative Filtering. Method of automating word-of-mouth Large groups of users collaborate by rating products, services, news articles, etc.

nedaa
Download Presentation

A Content-Based Approach to Collaborative Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation

  2. Collaborative Filtering • Method of automating word-of-mouth • Large groups of users collaborate by rating products, services, news articles, etc. • Analyze ratings data of the group to produce recommendations for individual users • Find users with similar tastes

  3. Problems with Collaborative Filtering Methods • Performance • Prohibitively large dataset • Scalability • Will the solution scale to millions of users on the Internet? • Sparsity of data • User who has rated few items • Item with few ratings

  4. Problems with Collaborative Filtering Methods • Cannot compare users that have no common ratings (Ratings on a scale of 1-5)

  5. A Content-Based Approach • Build a feature list for each user based on content of items rated • Compare users’ features to make recommendations • Now we can find similarity between users with no common ratings

  6. Data Source • EachMovie Project • Compaq Systems Research Center • Over 18 months collected 2,811,983 ratings for 1,628 movies from 72,916 users • Ratings given on 1-5 scale • Dataset split into 75% training, 25% testing • Internet Movie Database (IMDb) • Huge database of movie information • Actors, director, genre, plot description, etc.

  7. Creating the Feature List • Retrieve content information for each movie from IMDb dataset – create “bag of words” • Throw out common words (i.e.: the, and, but) • Calculate frequency of remaining words, create movie’s feature list • Frequencies weighted based on total number of terms

  8. Comparing Users • Each user has positive and negative feature list • Combine feature lists of movies they have rated • Compare user’s feature lists using Pearson Correlation Coefficient • Users can be compared with no common ratings • Able to recommend items with few ratings • Users only need to rate a few items to receive recommendations

  9. Methods • Three methods attempted to improve performance: • Clustering of users • Random groups of users • Compare users directly to items

  10. User Clustering • Simple algorithm, starting with first user: • Compare to existing clusters first • If similarity is high, merge user into cluster • Compare to each remaining user • Stop if correlation is above threshold • Once a similar user is found, create a new cluster from the two users • Cluster has combined feature list of all its users • Not as efficient as possible - O(n2)

  11. User Clustering • Once clusters are formed, we can predict ratings for each item • For each user, find their 10 nearest neighbors • Predicted rating is the average rating of item from these neighbors

  12. Selecting a Random Group • Randomly select 5000 users as a (hopefully) representative sample • As before, find a user’s 10 nearest neighbors from the random group • Predicted rating is the average rating of item from these neighbors • Much less work than clustering • How much accuracy (if any) will be lost?

  13. Comparing Users to Items • No collaborative filtering involved • Compare the positive and negative feature lists of user to feature list of item • Make prediction based on which feature list has higher correlation with item • Pretty quick and easy to do • How accurate will this be?

  14. Analyzing Predictions • Collected 3 metrics to evaluate predictions • Accuracy: all items predicted correctly • Precision: positive items predicted correctly • Recall: unseen positive items predicted correctly • Precision and recall have inverse relationship

  15. Results

  16. Conclusions • Large gain from clustering users • Is the extra work worth it? • Depends on the application • Purely content-based predictions worked pretty well • Simple, fast solution • Random group prediction also performed reasonably well • Problems solved by content-based analysis: • Sparsity of data • Performance • Scalability

More Related