50 likes | 223 Views
A shot at Netflix Challenge. Hybrid Recommendation System Priyank Chodisetti. Problem and Approach. A data set of 240,000 users and their ratings for 17770 movies is provided. Given a user ‘p’ and movie ‘m’, we should predict how much the ‘p’ will rate the movie ‘m’
E N D
A shot at Netflix Challenge Hybrid Recommendation System Priyank Chodisetti
Problem and Approach • A data set of 240,000 users and their ratings for 17770 movies is provided. • Given a user ‘p’ and movie ‘m’, we should predict how much the ‘p’ will rate the movie ‘m’ • My Idea: Take entirely two different approaches and merge those results. • Applied Latent Semantic Analysis and Collaborative filtering techniques on the dataset independently • Through LSI, Mapped the dataset to lower dimensional space and tried to extract relation between different movies • Through Collaborative filtering, tried to find the user tastes by comparing with other similar users • Major Problems: • Computationally large, for example one soultion of mine ran for 14 hrs with most diappointing results • Matrix is Sparse for almost ~99 and hence ~99% missing values
Handling Major Problems • Generally missing values are handled by taking the average rating given by the user or overall average rating of all users. But I believe that, $1,000,000 winner will be the one who handles the missing values well. • Adopted method described in [2] which aptly fits in the current situtation. • LSI: • Apply SVD on the Matrix, retain the first ‘k’ higher singular values. It gives us the space in ‘k’ dimensions or best ‘k’ rank approximation • But to How Many Dimensions?? Experiment • To make a prediction for person p's rating for movie m, we would take the mth row of U, matrix multiply with S, and matrix multiply that with the pth column of V(t) • Collaborative Filtering: • Find the kNN and come out with predicted rating. • If we consider Euclidean distance as distance measure, we have 17770 dimensions. So consider Pearson Co-efficient
Implementation • Mixing LSI and Collaborative Filtering • Find kNN in reduced dimension space, and consider euclidean distance as the distance measure. • Used SVDLIBC which used Lanczo method for Singular Value Decomposition • Computational Challenges: • All the files in the training set are converted into one single larget file, so as to reduce disk access and increase the response time • Converted the whole data into sparse text format • Also generated a large data set, which gives in terms of user: movie, his rating format in contrast to given movie: user, his rating format • Using C++ • Future Extensions this Winter • Plans to implement General Hebbian Algorithm, so as to reduce the computation time and will be easier to handle missing values. • Interested and motivated friends can join me this winter
References • M Brand. Fast Online SVD revisions for lightweight recommender systems. In Proc. SIAM International Conference on Data Mining. 2003 • M. W. Berry. Incremental Singular Value Decomposition of uncertain data. In Proceedings, European conference on the SIGIR. ACM. 1999 • B. Sarwar, G. Karypis, J.Konstan, and J.Riedi. Application of Dimensionality Reduction in recommender System - a case study. In ACM WebKDD Workshop, 2000