1 / 8

Netflix Challenge: Combined Collaborative Filtering

Netflix Challenge: Combined Collaborative Filtering. Greg Nelson Alan Sheinberg. Overall Idea. Data 480,189 users 17,770 movies 99% sparse Query: how will user u rate movie m ? One Idea

christian
Download Presentation

Netflix Challenge: Combined Collaborative Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Netflix Challenge:Combined Collaborative Filtering Greg Nelson Alan Sheinberg

  2. Overall Idea • Data • 480,189 users • 17,770 movies • 99% sparse • Query: how will user u rate movie m? • One Idea • Find similar users of u and average ratings given by them to movie m (if they exist) weighted by user similarity • Another Idea • Find similar movies of m and average ratings given by u to those movies (if they exist) weighted by movie similarity • Combine Ideas • Consider how uand similar users rated movie mand similar movies • Take average of existing ratings weighted by product of similarities • More pairs to consider will overcome sparsity (need about 100) • Normalize rating scales: subtract mean, divide standard deviation

  3. Finding Similar Users/Movies • Data-set is huge, can’t just use naïve approach • View users as vectors, entry i = 1 if movie i was rated, 0 if not • Use Minhash (Jaccard Similarity) to create signatures • Use LSH to find similar users • Same idea to find similar movies

  4. LSH Implementation • Each band requires a disk based hash table • Minimize the number of IOs • Minimize the number of seeks • Batch entries in memory • Batch in FIFO order ~Really long time • Group by bucket (minimize IO) ~5 hours • Group by bucket + sort writes by bucket # (minimize IO and seeks) ~25 minutes • Could sort data by bucket #, but would make changing hash functions and LSH parameters a big pain. Also, not much faster than last approach.

  5. Results -- Terminology • U := neighborhood of user u • M := neighborhood of movie m • Support := # existing ratings in U x M • Graph RMSE vs. |U x M|

  6. Results -- Support = 5

  7. Results -- Support = 10

  8. Future Ideas • Missing values biggest problem • Use content-based predictor to fill in the “holes”? • We did begin some work on this: • Scraped IMDb for genre, director, producer, cast, plot summary for most movies • Use classifier instead of just using average in place of missing values • Focused mainly on CF and didn’t get far with this

More Related