130 likes | 255 Views
Probability based Recommendation System. Course : ECE541 Chetan Tonde Vrajesh Vyas Ashwin Revo Under the guidance of Prof. R. D. Yates. Problem Statement. movie_id.
E N D
Probability based Recommendation System Course : ECE541 ChetanTonde VrajeshVyas AshwinRevo Under the guidance of Prof. R. D. Yates
Problem Statement movie_id • Given Netflix™ data of users U = {u1, u2,.. un} each with movie history Y= {y1, y2,.. ym} and corresponding ratings V = {v1, v2,.. vm}. • To estimate rating’s for unseen movies. [To recommend movies to a user we select K highest estimates of the ratings for each movie .] user_id rating
Basic Idea Method suggested by T. Hoffman called probabilistic Latent Semantic Analysis Random variable u and y are not independent. u u y y =? Introduce a latent variable z which makes user u and movie y conditionally independent. (θ = model parameter) z In a way z relates user u and movie y into independent groups
Basic Idea-Prediction of ratings * u y • Based on z and y the model is extended to predict rating v. • The spread of ratings is assumed to be a mixture of Gaussians with µ and σ dependent on y and z (µy,z and σy,z). • * Hofmann [1] z v
Model-fitting • Expectation Maximization algorithm • EM algorithm is an iterative procedure which converges to a (local) maximum of the maximum a posteriori probability function. • P(θ|X) = p(X|θ)p(θ) • where θ={σyz, μ yz}is a set of unknown parameters of data x. • In other words, EM is a general method to finding the maximum-likelihood estimate of the parameters from a given data-set. • The purpose is to estimate θ of the real data distribution. * * Hofmann[1]
Results-(total log-likelihood) * *Hofmann[1]
Removing user outliers • Expected movie views µ andσ across all users . • Replace users with views above µ + 3σ and below µ - 3σ with some other user from data set ratings with maximum (first) user-user correlation of common movies. • This helps in removing outliers , thus improving the rmse value. • Rmse before : 0.6905 • Rmse after : 0.6598
Results -observations • We ran the database having 8000 users and 600 movies and ratings ranging from 0-5 (0 representing movie not viewed) • The results were verified by blanking out a few hundred ratings and predicting expected ratings. (observed rmse ≈0.6598) • The number of latent variables z was decided empirically depending on the performance i.e. the one giving least rmse (k=6) ( k<6 ’under-fitting’ and k>6 over-fitting) • Since users give ratings on a personal scale , there is a need to normalize user ratings to mean 0 and variance 1 for every user. • But for certain movie ratings variance across different users the computed σyz will b very small for certain z , to avoid this we replaced σyz by 1.5 for all σyz less than 1e-4.
Results - Observations • The EM algorithm is assumed to converged when the increase in maximum likelihood is very less (< 100), that implies that data is clustered to best possible approximation. • The matrix is sparse which makes the method sensitive to unreliable ratings, makes the model fitting to true ratings difficult. • Convergence of the EM ALGO significantly depends on the initial estimate of the model, which is mixture of Gaussians so initialization was done using observed ratings. • Model over-fits to the given data which isn't good because it also fits the sampling noise. Seen by rmse results on training data and unknown data (rmse high). • The execution time for the algorithm is around a minute for every iteration which is high. This is because of the complexity of the equations involved and large dataset.
References • T. Hofmann: Collaborative Filtering via Gaussian Latent Semantic Analysis. In Proceedings of ACM Transaction on Information Systems, Volume -22, no. January 2004. • A. Das, M. Datar, A, Garg, S. Rajaram, Google News Personalization: Scalable online Collaborative Filtering. WWW 2007 / Track: Industrial Practice and Experience, May 8-12, Banff, Alberta, Canada. • Andrew Ng: Lectures on Machine Learning, WWW, http://www.youtube.com/user/stanforduniversity.