80 likes | 235 Views
Netflix Prize: Predicting Ratings. Data. mv_00(movieID).txt: 1: (1-2,649,429) (1-5) Over 17,000 movie txt files Over 400,000 userID Two Gigs zipped. Overall Plan. Compute user similarity using: termFrequency: # of movies in common documentFrequency: 1/|rating 1 – rating 2 |
E N D
Data • mv_00(movieID).txt: 1: (1-2,649,429) (1-5) • Over 17,000 movie txt files • Over 400,000 userID • Two Gigs zipped
Overall Plan • Compute user similarity using: • termFrequency: # of movies in common • documentFrequency: 1/|rating1 – rating2| • tfdf = (# of movies in common) * 1/|rating1 – rating2|
Plan 1 • Store it all in memory (haha) in java • Store a User class with: • UserID • Array of Movies classes: • movieID • Rating • Then have matrix of users with an array of top similar users using (tfdf) • Problem 1 - Memory issues
Plan 2* • Step 1: store in text files on hard drive in java • text file for each user • Step 2: compute similarity (tfdf) • text file of top then users for each user • Step 3: predictions • Run through two directories of text files to compute an average movie rating prediction • Problem 2 - Very Slow: • Step 1: 3 days – ~5000 movie text files currently • Step 2: 1 user every 35 mins | 1 user every 5 mins • Step 3: ~10 minutes currently
Plan 3 • Step 1: Store in text file’s data in a database using php • Table: userID | movieID | rating • Primary keys: userID, movieID • Step 2: Compute Similarity • Table: userID | 1st userIDs | 2nd userID | etc. • Primary key: userID • Step 3: Predictions • Problem 3 - Very Slow: • Step 1: 4 days – 7000 movie text files currently • Step 2: n/a • Step 3: n/a
Results • Predicting everything 3.0: • RMSE = 1.3149 • Similarities I have so far: • RMSE = 1.3149| 384 users • RMSE = 1.3149| 575 users • http://www.netflixprize.com/leaderboard • Grand Prize RMSE = 0.8563 • RMSE: • sqrt(avg((actual_rating - predicted rating) * (actual_rating - predicted rating))).