80 likes | 91 Views
Utilize the Netflix Prize dataset to forecast movie ratings based on user similarities through memory storage and database computation, addressing memory and speed challenges. Explore advanced predictive strategies for accurate predictions. Solve complex data analysis dilemmas for optimal outcome.
E N D
Data • mv_00(movieID).txt: 1: (1-2,649,429) (1-5) • Over 17,000 movie txt files • Over 400,000 userID • Two Gigs zipped
Overall Plan • Compute user similarity using: • termFrequency: # of movies in common • documentFrequency: 1/|rating1 – rating2| • tfdf = (# of movies in common) * 1/|rating1 – rating2|
Plan 1 • Store it all in memory (haha) in java • Store a User class with: • UserID • Array of Movies classes: • movieID • Rating • Then have matrix of users with an array of top similar users using (tfdf) • Problem 1 - Memory issues
Plan 2* • Step 1: store in text files on hard drive in java • text file for each user • Step 2: compute similarity (tfdf) • text file of top then users for each user • Step 3: predictions • Run through two directories of text files to compute an average movie rating prediction • Problem 2 - Very Slow: • Step 1: 3 days – ~5000 movie text files currently • Step 2: 1 user every 35 mins | 1 user every 5 mins • Step 3: ~10 minutes currently
Plan 3 • Step 1: Store in text file’s data in a database using php • Table: userID | movieID | rating • Primary keys: userID, movieID • Step 2: Compute Similarity • Table: userID | 1st userIDs | 2nd userID | etc. • Primary key: userID • Step 3: Predictions • Problem 3 - Very Slow: • Step 1: 4 days – 7000 movie text files currently • Step 2: n/a • Step 3: n/a
Results • Predicting everything 3.0: • RMSE = 1.3149 • Similarities I have so far: • RMSE = 1.3149| 384 users • RMSE = 1.3149| 575 users • http://www.netflixprize.com/leaderboard • Grand Prize RMSE = 0.8563 • RMSE: • sqrt(avg((actual_rating - predicted rating) * (actual_rating - predicted rating))).