90 likes | 231 Views
Netflix Challenge: Combined Collaborative Filtering. Greg Nelson Alan Sheinberg. Overall Idea. Data 480,189 users 17,770 movies 99% sparse Query: how will user u rate movie m ? One Idea
E N D
Netflix Challenge:Combined Collaborative Filtering Greg Nelson Alan Sheinberg
Overall Idea • Data • 480,189 users • 17,770 movies • 99% sparse • Query: how will user u rate movie m? • One Idea • Find similar users of u and average ratings given by them to movie m (if they exist) weighted by user similarity • Another Idea • Find similar movies of m and average ratings given by u to those movies (if they exist) weighted by movie similarity • Combine Ideas • Consider how uand similar users rated movie mand similar movies • Take average of existing ratings weighted by product of similarities • More pairs to consider will overcome sparsity (need about 100) • Normalize rating scales: subtract mean, divide standard deviation
Finding Similar Users/Movies • Data-set is huge, can’t just use naïve approach • View users as vectors, entry i = 1 if movie i was rated, 0 if not • Use Minhash (Jaccard Similarity) to create signatures • Use LSH to find similar users • Same idea to find similar movies
LSH Implementation • Each band requires a disk based hash table • Minimize the number of IOs • Minimize the number of seeks • Batch entries in memory • Batch in FIFO order ~Really long time • Group by bucket (minimize IO) ~5 hours • Group by bucket + sort writes by bucket # (minimize IO and seeks) ~25 minutes • Could sort data by bucket #, but would make changing hash functions and LSH parameters a big pain. Also, not much faster than last approach.
Results -- Terminology • U := neighborhood of user u • M := neighborhood of movie m • Support := # existing ratings in U x M • Graph RMSE vs. |U x M|
Future Ideas • Missing values biggest problem • Use content-based predictor to fill in the “holes”? • We did begin some work on this: • Scraped IMDb for genre, director, producer, cast, plot summary for most movies • Use classifier instead of just using average in place of missing values • Focused mainly on CF and didn’t get far with this