60 likes | 197 Views
Netflix. Netflix. Movie. User. Rents. MID Mname Date. UID CNAME AGE. TranID Rating Date. We Classify using a TraingingTable on a column (possibly composite), called class label column.
E N D
Netflix Netflix Movie User Rents MID Mname Date UID CNAME AGE TranID Rating Date We Classify using a TraingingTable on a column (possibly composite), called class label column. The Netflix Contest classification use the RentsTrainingTable, Rents(MID,UID,Rating,Date) and class label Rating, to classify new (MID,UID,Date) tuples (i.e., predict ratings). How? Since there are no features except date? Nearest Neighbor User Voting: We won’t base “near” on a distance over features since there is only one feature, date. uid votes if it is near enough to UID in it’s ratings of movies M={mid1, ..., midk} (i.e., near is based on a User-User correlation over M ). We need to select the User-User-Correlation (Pearson or Cosine or?) and the set M={mid1,…, midk }. Nearest Neighbor Movie Voting: mid votes if it’s ratings by U={uid1,..., uidk} are near enough to those of MID (i.e., near is based on a Movie-Movie correlation over U). We need to select the Movie-Movie-Correlation (Pearson or Cosine or?) and the set U={uid1,…, uidk }. Netflix Cinematch predicts potential new ratings. Using these predicted ratings, Netflix recommend next rentals (those that are predicted to be 5’s or 4’s?).
The Program: Code Structure -the main modules mpp-mpred.C mpp-user.C movie-vote.C user-vote.C prune.C mpp-mpred.Creads a Neflix PROBE file, loops thru the (Mi, ProbeSupport(Mi) records passing each to mpp-user.C. mpp-mpred.C can also call separate instances of mpp-user.C for many Us, to be processed in parallel (governed by the number of slots specified in the 1st code line.) ( Mi, ProbeSupport(Mi)={Ui1, …, Uik}) mpp-user.Cloops thru ProbeSupport(M), reads in the matchedup config file, prints prediciton(M,U) to file, predictions. For the user-vote-approach, mpp-user.C calls user-vote.C For the movie-vote-approach, mpp-user.C calls movie-vote.C Loops thru ProbeSupport and from the user-vote and/or movie-VOTE calculates and writes Predict(Mi,Uik ) to predictions UikProbeSupport(Mi) user-vote.Cdoes specified pruning (from config etc) by calling prune.C, then loops thru pruned set of user voters. For each V, calculates a vote. Combines those votes (weighted?) into one vote and returns it. vote(Mi ,Uik ) ( Mi , Support(Mi), Uik , Support(Uik )) ( Mi , Support(Mi), Uik , Support(Uik )) VOTE(Mi ,Uik ) movie-vote.Cdoes similarly. Must we loop thru V’s (Vertical Processing of Horizontal Data or VPHD) rather than Horizontally Processing across Vertical Data (HPVD) consisting of the MoviePtrees for mid1, ..., midk, recalling from slide1 that V votes if it is near enough to UID in it’s ratings of movies M={mid1,..., midk} The reason: the Horizontal Processing (Ptree processing) required of most correlation calculations is nearly impossible to formulate using AND/OR/COMP (can anyone do it???).
What kind of pruning can be specified? mpp-mpred.C mpp-user.C movie-vote.C user-vote.C prune.C Again, all parameters are specified in a configuration file and the values specified there are consumed at runtime using, e.g., the call: mpp -i Input_.txt_file -c config -n 16 where Input_.txt_file is the input Probe subset file and 16 is the number of parallel threads that mpp-mpred.C will generate (here, 16 movies are processed in parallel, each sent to a separate instantiation of mpp-user.C) A sample config file is given later. There are up to 3 types of pruning used (for pruning down support(M) as the set of all users that rate M or pruning down support(U) as the set of all movies that rate U: 1. correlation or similarity threshold based pruning 2. count based pruning 3. ID window based pruning Under correlation or similarity threshold based pruning, and using support(M)=supM for example (pruning support(U) is similar) we allow any function f:supMsupM [0,HighValue] to be called a user correlation provided only that f(u,u)=HighValue for every u in supM. Examples include Pearson_Correlation, Gaussian_of_Distance, 1_perp_Correlation (see appendix of these notes), relative_exact_rating_match_count (Tingda is using), dimension_of_common_cosupport, and functions based on Standard Deviations. Under count based pruning, we usually order by one of the correlations above first (into a multimap) then prune down to a specified count of the most highly correlated. Under ID window based pruning we prune down to a window of userIDs within supM (or movieIDs within supU) by specifying a leftside (number added to U, so leftside is relative to U as a userID) and a width.
How does one specify prunings? mpp-mpred.C specifies type of prune ( 3 types: UserPrune with a full range of possibilities; UserFastPrune with just PearsonCorrelation pruning; CommonCoSupportPrune which orders users, V, according to the size of their CommonCoSupport with U only (note that this is a correlation of sorts too.) mpp-user.C movie-vote.C user-vote.C threshold "diff of vectors" population-based std_dev prune specify leftside (from Uid) of an ID interval prune of supM specify the width of an ID interval prune of supM specify starting movie (intercept and slope) for N loop specify starting movie (intercept and slope) for V loop threshol for count based prune specify PearsonCorr threshold (b=bill, meaning: use bill's formula - note if prior pruning this will have a different value than Amal's) specify PearsonCorr threshold (a=Amal, meaning: use Amal's table lookup) threshold "vectorof diffs" population-based std_dev prune threshold "vector of diffs"sample-based std_dev prune threshold (Gaussian of) Euclidean distance based prune threshold for (Gaussian of) 1perpendicular distance prune exponent for (Gaussian of) 1perpendicular distance prune threshold (Gaussian of) a variation based prune threshold std_dev based prune Picks odering for count-based prune below: 1=Amal_Pearson, 2=Bill_Pearson, etc. threshold "diff of vectors"sample-based std_dev prune prune.C Again, in a file (this one is named config) there is a section for specifying the parameters for user-voting and a separate section for specifying parameters for movie-voting. E.g., for movie voting, at the bottom, there are 3 external prunings possible (0 or more can be chosen): 1. an intial pruning of dimensions to be used (since dimensions are user, it prunes supM): 2. a pruning of movie voters, N, (in supU) 3 a final pruning of dimensions (CoSupport(M,N) for the specific movie voter, N. E.g., parameters are specified for this final prune as follows: [movie_voting Prune_Users_in_CoSupMN] method = UserCommonCoSupportPrune leftside = 0 width = 8000 mstrt = 0 mstrt_mult = 0.0 ustrt = 0 ustrt_mult = 0.0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 2 Note: all thresholds are for similarities, not distance i.e., when we start with a distance we follow it with the Gaussian to make it a similarity or correlation.
APPENDIX: The Program (old) mpp-mpred.C mpp-user.C movie-vote.C user-vote.C prune.C mpp-mpred.C reads a Neflix PROBE file, loops thru the (Mi, ProbeSupport(Mi)’s passing each to mpp-user.C, which calculates and prints PredictedRating(Mi,U) to the file, “prediction”, UProbeSupport(Mi). mpp-mpred.C can also call separate instances of mpp-user.C for many Us, to be processed in parallel (governed by the number of "slots" specified in 1st code line.) (Mi, ProbeSupport(Mi)) From votes, calculates and writes Predict(Mi,U) to predictions UProbeSupport(Mi) mpp-user.Cloops thru ProbeSupport(M), reads in the matchedup config file, prints prediciton(M,U) to file, predictions. For the user-vote-approach, mpp-user.C calls user-vote.C, passing (M, Support(M), U, Support(U)). For the movie-vote-approach, mpp-user.C calls movie-vote.C, passing (M, Support(M), U, Support(U). vote(M,U) ( M, Support(M), U, Support(U)) ( M, Support(M), U, Support(U) ) VOTE(M,U) user-vote.Cdoes the specified pruning (from the config etc.) by calling prune.C, then loops thru the pruned set of user voters. For each V, calculates a vote, then combines those votes and returns predictedRating(M,U). Must we loop thru V’s (Vertical Processing of Horizontal Data or VPHD) rather than to calculate the vote by Horizontally Processing across the Vertical Data (HPVD) consisting of the MoviePtrees for mid1, ..., midk, recalling from slide1 that V=uid votes if it is near enough to UID in it’s ratings of movies M={mid1, ..., midk}. The reason: the Horizontal Processing (Ptree processing) required of most correlation calculations is nearly impossible to formulate using AND/OR/COMP (so far, anyway). movie-vote.C similar.
APPENDIX: Collaborative Filteringis the prediction of likes and dislikes (retail or rental) from the history of previous expressed ratings (filtering new likes thru the historical filter of “collaborator” likes) E.g., the $1,000,000 Netflix Contestwas to develop aratings prediction program that can beat the one Netflix currently uses (called Cinematch) by 10% in predicting what rating users gave to movies. I.e., predict rating(M,U) where (M,U) QUALIFYING(MovieID, UserID). Netflix uses Cinematch to decide which movies a user will probably like next (based on all past rating history). All ratings are "5-star" ratings (5 is highest. 1 is lowest. Caution: 0 means “did not rate”). Unfortunately rating=0 does not mean that the user "disliked" that movie, but that it wasn't rated at all. Most “ratings” are 0. That’s the main reason we don’t want to use standard vector space distance for “near”. A "history of ratings given by users to movies“, TRAINING(MovieID, UserID, Rating, Date) is provided, with which to train your predictor, which will predict the ratings given to QUALIFYING movie-user pairs (Netflix knows the rating given to Qualifying pairs, but we don't.) Since the TRAINING is very large, Netflix also provides a “smaller, but representative subset” of TRAINING, PROBE(MovieID, UserID)(~2 orders of magnitude smaller than TRAINING). Netflix gave 5 years to submit QUALIFYING predictions. That contest was won in the late summer of 2009, when the submission window was about 1/2 gone. The Netflix Contest Problem is an example of the Collaborative Filtering Problem which is ubiquitous in the retail business world (How do you filter out what a customer will want to buy or rent next, based on similar customers?).