Story of IBM Research’s success at KDD/Netflix Cup 2007

Story of IBM Research’s success atKDD/Netflix Cup 2007 Saharon Rosset TAU Statistics (Formerly IBM) IBM Research’s teams:Task 1: Yan Liu, Zhenzhen Kou (CMU intern) Task 2: Saharon Rosset, Claudia Perlich, Yan Liu

October 2006 Announcement of the NETFLIX Competition USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: • Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on absolute rating error prior to 2011 • $50,000 for the annual progress price (relative to baseline) • Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies • Performance is evaluated on holdout movies-users pairs • NETFLIX competition has attracted 24,396 contestants on 19,799 teams from 155 different countries • 14891 valid submissions from 2282 different teams • current best result is 7.8% better than baseline (from 6.7% as of March)

Data Overview: NETFLIX Internet Movie Data Base All movies (80K) 17K Selection unclear All users (6.8 M) 480 K At least 20 Ratings by end 2005 NETFLIX Competition Data 100 M ratings Qualifier Dataset 3M

NETFLIX data generation process KDD CUPNO User or Movie Arrival User Arrival Movie Arrival Task 1 17K movies Task 2 Training Data 1998 Time 2005 2006 Qualifier Dataset 3M

KDD-CUP 2007 based on the NETFLIX competition • Knowledge Discovery and Data Mining (KDD)-CUP • Annual competition of the premier conference in Data Mining • Training: NETFLIX competition data from 1998-2005 • Test: 2006 ratings randomly split by movie in to two tasks • Task 1: Who rated what in 2006 • Given a list of 100,000 pairs of users and movies, predict for each pair the probability that the user rated the movie in 2006 • Result: We are the second runner-up, No 3 out of 39 teams • Many of the competing teams have been working on the Netflix data for over six months, giving them a decided advantage in Task 1 here • Task 2: Number of ratings per movie in 2006 • Given a list of 8863 movies, predict the number of additional reviews that all existing users will give in 2006 • Result: We are the winner, No 1 out of 34 teams

Generation of Test sets from 2006 for Task 1 and Task 2 Marginal 2006 Distribution of rating Users Sample (movie, user) pairs according to product of marginals Task 1 Remove Pairs that were rated prior to 2006 Movies Task 2 log(n+1) Rating Totals Task 2 Test Set (8.8K) Task 1 Test Set (100K) Back

Insights from the battlefields: What makes a model successful? Previous successful ‘engagements’ of our team: • Competitions: KDD-CUP 1999, 2000, 2003, ILP-Challenge 2005 • Applications: MAP, OnTarget, … Components of successful modeling: 1. Data and domain understanding • Generation of data and task • Cleaning and representation/transformation 2. Statistical insights • Statistical properties • Test validity of assumptions • Performance measure 3. Modeling and learning approach • Most “publishable” part • Choice or development of most suitable algorithm Importance?

Task 1: Did User A review Movie B in 2006? • Task formulation • A classification task to answer question whether “existing” users will review “existing” movies • Challenges • Huge amount of data • how to sample the data so that any learning algorithms can be applied is critical • Complex affecting factors • decrease of interest in old movies, growing tendency of watching (reviewing) more movies by Netflix users • Key solutions • Effective sampling strategies to keep as much information as possible • Careful feature extraction from multiple sources

Task 1: Effective Sampling Strategies • Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing set • The probability of picking a movie was proportional to the number of ratings that movie received; the same strategy for users Movies …… Movie5 .0011 …… Movie3 .001 …… Movie4 .0007 History Samples …. 1488844,3,2005-09-06 822109,5,2005-05-13 885013,4,2005-10-19 30878,4,2005-12-26 823519,3,2004-05-03 … …… Movie5 User 7 …… Movie3 User 7 …… Movie4 .User 8 Users …… User7 .0007 …… User6 .00012 …… User8 .00003 ……

Task 1: Effective Sampling Strategies • Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing set • The probability of picking a movie was proportional to the number of ratings that movie received; the same strategy for users The Ratio of Positive Examples Movies …… Movie5 .0011 …… Movie3 .001 …… Movie4 .0007 …… Movie5 User 7 …… Movie3 User 7 …… Movie4 .User 8 Users …… User7 .0007 …… User6 .00012 …… User8 .00003 ……

Task 1: Multiple Information Sources user user • Graph-based features based on NETFLIX training set : construct a graph with users and movies as nodes, create an edge if the user reviews the movie • Content-based features: Plot, director, actor, genre, movie connections, box office, scores of the movie crawled from Netflix and IMDB 1488844,3,2005-09-06 822109,5,2005-05-13 885013,4,2005-10-19 30878,4,2005-12-26 823519,3,2004-05-03 … movie user user movie

Task 1: Feature Extraction • Movie-based features • Graph topology: # of ratings per movie (across different years), adjacent scores between movies calculated using SVD on the graph matrix • Movie content: similarity of two movies calculated using Latent Semantic Indexing based on bag of words from (1) plots of the movie and (2) other information, such as director, actors, and genre • User profile • Graph topology: # of ratings per user (across different years) • User preferences based on the movies being rated: key word match count, average/min/max of similarity scores between the movie being predicted and movies having been rated by the user movie (rated) movie (rated) key word match count, average/min/max of similarity scores movie (to predict) user … movie (rated)

Task 1: Learning strategy • Learning Algorithm: • Single classifiers: logistic regression, Ridge regression, decision tree, support vector machines • Naïve Ensemble: combining sub-classifiers built on different types of features with pre-set weights • Ensemble classifiers: combining sub-classifiers with weights learned from the development set

Task 2 description: How many reviews did a Movie receive in 2006? • Task formulation • Regression task to predict the total count of reviewers from “existing” users for 8863 “existing” movies • Challenges • Movie dynamics and life-cycle • Interest in movies changes over time • User dynamics and life-cycle • No new users are added to the database • Key solutions • Use counts from test set of Task 1 to learn a model for 2006 adjusting for pair removal • Build set of quarterly lagged models to determine the overall scalar • Use Poisson regression

Some data observations • Task 1 test set is a potential response for training a model for Task 2 • Was sampled according to marginal (= # reviews for movie in 06 / # reviews in 06)which is proportional to the Task 2 response (= # reviews for movie in 06) • BIG advantage: we get a view of 2006 behavior for half the movies Build model on this half, apply to the other half (Task 2 test set) • Caveats: • Proportional sampling implies there is a scaling parameter left, which we don’t know • Recall that after sampling (movie, person) pairs that appeared before 2006 were dropped from Task 1 test set Correcting it is interesting research challenge of inverse rejection sampling • No new movies and reviewers in 2006 • Need to emphasize modeling the life-cycle of movies (and reviewers) • How are older movies reviewed relative to newer movies? • Does this depend on other features (like movie’s genre)? • This is especially critical when we consider the scaling caveat above

Some statistical perspectives • Poisson distribution is very appropriate for counts • Clearly true of overall counts for 2006 • Assuming any kind of reasonable reviewers arrival process • Implies appropriate modeling approach for true counts is Poisson regression:ni ~ Pois (it)log(i) = j j xij* = arg max l(n ; X,) (maximum likelihood solution) • What happens when we sub-sample for Task 1 test set? • Sum is fixed  multinomial • Large N, small p  each sub-sampled count well approximated by Poisson • Can be shown that Poisson regression (=assuming independence) is appropriate • What does this imply for model evaluation approach? • Variance stabilizing transformation for Poisson is square root ni has roughly constant variance RMSE of log (prediction +1) against log(# ratings +1) emphasizes performance on unpopular movies (small Poisson parameter  larger log scale variance) • We still assumed that if we do well in a likelihood formulation, we will do well with any evaluation approach

Some statistical perspectives (ctd.) • Can we invert the rejection sampling mechanism? • This can be viewed as a missing data problem • Can we design a practical EM algorithm with our huge data size? Interesting research problem… • We implemented ad-hoc inversion algorithm • Iterate until convergence between:- assuming movie marginals are correct and adjusting reviewer marginals- assuming reviewer marginals are correct and adjusting movie marginals • We verified that it indeed improved our data since it increased correlation with 4Q2005 counts

Modeling Approach Schema Task 1 Test (100K) Estimate Poison Regression M1 & Predict on Task 1 movies Inverse Rejection Sampling Count ratings by Movie from Scale Predictions To Total Use M1 to Predict Task 2 movies IMDB Validate against 2006 Task 1 counts Movie Features Estimate 4 Poison Regression G1…G4 & Predict for 2006 Construct Movie Features Find optimal Scalar NETFLIX challenge Estimate 2006 total Ratings for Task 2Test set Construct Lagged Features Q1-Q4 2005 

Some observations on modeling approach • Lagged datasets are meant to simulate forward prediction to 2006 • Select quarter (e.g., Q105), remove all movies & reviewers that “started” later • Build model on this data with e.g., Q305 as response • Apply model to our full dataset, which is naturally cropped at Q405  Gives a prediction for Q206 • With several models like this, predict all of 2006 • Two potential uses: • Use as our prediction for 2006 – but only if better than the model built on Task 1 movies! • Consider only sum of their predictions to use for scaling the Task 1 model • We evaluated models on Task 1 test set • Used holdout when also building them on this set • How can we evaluate the models built on lagged datasets? • Missing a scaling parameter between the 2006 prediction and sampled set • Solution: select optimal scaling based on Task 1 test set performance Since other model was still better, we knew we should use it!

Some details on our models and submission • All models at movie level. Features we used: • Historical reviews in previous months/quarters/years (on log scale) • Movie’s age since premier, movie’s age in Netflix (since first review) • Also consider log, square etc  have flexibility in form of functional dependence • Movie’s genre • Include interactions between genre and age  “life cycle” seems to differ by genre! • Models we considered (MSE on log-scale on Task 1 holdout): • Poisson regression on Task 1 test set (0.24) • Log-scale linear regression model on Task 1 test set (0.25) • Sum of lagged models on built on 2005 quarters + best scaling (0.31) • Scaling based on lagged models • Our estimated of number of reviews for all models in Task 1 test set: about 9.5M • Implied scaling parameter for predictions about 90 • Total of our submitted predictions for Task 2 test set was 9.3M

Competition evaluation • First we were informed that we won with RMSE of ~770 • They mistakenly evaluated on non-log scale • Strong emphasis on most popular movies • We won by large margin Our model did well on popular movies! • Then they re-evaluated on log scale, we still won • On log scale the least popular movies are emphasized • Recall that variance stabilizing transformation is in between (square root) • So our predictions did well on unpopular movies too! • Interesting question: would we win on square root scale (or similarly, Poisson likelihood-based evaluation)? Sure hope so!

Competition evaluation (ctd.) • Results of competition (log-scale evaluation): • Components of our model’s MSE: • The error of the model for the scaled-down Task 1 test set (which we estimated at about 0.24) • Additional error from incorrect scaling factor • Scaling numbers: • True total reviews: 8.7M • Sum of our predictions: 9.3M • Interesting question: what would be best scaling • For log-scale evaluation? Conjecture: need to under-estimate true total • For square-root evaluation? Conjecture: need to estimate about right

Effect of scaling on the two evaluation approaches

Effect of scaling on the two evaluation approaches Legend Log-scale MSE SQRT MSE True sum Submitted sum Sum predictions (M)

Acknowledgements • Rick Lawrence • Naoki Abe • Prem Melville • Hisashi Kashima (TRL) • Shohei Hido (TRL) • Chandan Reddy • Grzegorz Swirszcz • And many more ..

Story of IBM Research’s success at KDD/Netflix Cup 2007