Netflix Contest: Ratings Prediction Program Development

The $1,000,000 Netflix Contest is to develop a "ratings prediction program“ that can beat Netflix’s (called Cinematch) by 10% in predicting what rating users gave to movies. I.e., predict rating(M,U) where (M,U)  QUALIFYING(MovieID, UserID). Netflix uses Cinematch to decide which movies a user will probably like next (based on all past rating history). All ratings are "5-star" ratings (5 is highest. 1 is lowest. Caution: 0 means “did not rate”). Unfortunately rating=0 does not mean that the user "disliked" that movie, but that it wasn't rated at all. Most “ratings” are 0. Therefore, the ratings data sets are NOT vector spaces! One can approach the Netflix contest problem as a data mining Classification or Prediction problem. A "history of ratings by users to movies“, TRAINING(MovieID, UserID, Rating, Date) is given with which to train your predictor, which will predict the ratings given to QUALIFYING movie-user pairs (Netflix knows the rating given to Qualifying pairs, but you don't.) Since the TRAINING is very large, Netflix also provides a “smaller, but representative subset” of TRAINING, PROBE(MovieID, UserID)(~2 orders of magnitude smaller than TRAINING). Netflix gives 5 years to submit QUALIFYING predictions. That contest window is about 1/2 gone now. A team can submit as many solution as they wish and at any time. Each October, Netflix give $50,000 to the team on top the so-called Netflix Leaderboard. Bellcore has won that twice.

The Netflix Contest(USER versus MOVIE voting) One can address the prediction or classification problem using several different "approaches". USER VOTERs (approach 1): To predict the rating of a pair, (M,U), we take TRAINING as a vector space of user ratings vectors. The users are the points in the vector space and the movies are the dimensions in that vector space. Since there are 17,770 movies each user is tuple of 17770 ratings, if all movies are used as dimensions. That’s too many dimensions! The first dimension pruning: restrict to only those movies that U has rated ( =supportU ). We also allow another round of dimension pruning based on correlation with M. Once the dimensions movie set is pruned, we pick a “Set of Near Neighbor users to U”, (NNS) from the users, V, who have rated M ( =supportM ). “Near” is defined based on correlation with U. One can think of this step as the voter pruning step. Note: most correlations calculations involve the other variable also. I.e., the result of a user pruning depends on the pruned movie set and vice versa. Thus, theoretically, the movie/user pruning steps could be alternated ad infinitum! Our current approach is to allow an initial global dimension prune, then the voter prune, then a final dimension prune. You will see these 3 prune steps in the .config files. We then let voters vote, but they don’t necessarily cast the straight-forward rating(M,V) vote. The best way to think about the 3 pruning steps (and there could be more!) is: We prune down the dimensions so that vector space methods are tractable, emeliorating the curse of dimensionality (the first, which may be turned off, is a global dimension prune (not based on individual voters). The second is the voter prune based on the currently pruned dimensions. The third is a final dimension prune (different for each voter) which give the final vector space over which the vote by that voter is calculated. Then we let those VOTERS vote as to the best rating prediction to be made. There are many ways toprune, vote, tally, and decide on the final prediction. These choices make up the .config file. MOVIE VOTERs(approach 2) is identical with roles of Movies (voters) and Users (dimensions) reversed

The Netflix Contest(Using SLURM to generate a clustering) SLURM has been set up to run on the Penryn Cluster2 (32 8 processor machines - 1 terrabyte of main memory) so that one can create a .config file (must end in .config) which specifies all the parameters for the program. Issuing: ./mpp-submit -S -i Data/probe-full.txt -c pf.0001/u.00.00/u.00.00.config -t .0001 -d ./pf.0001 The program pulls parameters from .config: -t .0001 means SquareError threshold = .0001 -d ./pf.0001 means results goto ./pf.0001 dir. The prog takes as input, the file Data/probe-full.txt (which is not quite the full probe but close) with format: Takes InputFile.txt (MovieID with interleaved UserIDs format or .txt format. See next slide) ConfigFile.config (shows which program to run. In .config format. See next slide) SqErrThrhld (if PredictionSqErr ≤ SqErrThrhld, put pair in Dir/lo-InputFile.txt, else put in Dir/hi-InputFile.txt) Directory (existing directory for the output) as input Puts as output (in Dir) lo-InputFileName.txt Hi-InputFileName.txt InputFileName.config InputFileName.rmse mpp-submit –S –i InputFile.txt –c ConfigFile.config –t SqErrThrhd –d Dir

The Netflix Contest(Using SLURM to generate a clustering) ./mpp-submit -S -i Data/probe-full.txt -c pf.0001/u.00.00/u.00.00.config -t .0001 -d ./pf.0001 InputFile ConfigFile: pf.0001/u.00.00/u.00.00.config Data/probe-full.txt Program sets parameters as specified in the .config: user_voting = enabled movie_voting = disabled user_vote_weight = 1 # processed only if user voting enabled. [user_voting] Prune_Movie_in_SupU = disabled Prune_Users_in_SupM = enabled Prune_Movies_in_CoSupUV = enabled [Prune_Movies_in_SupU] method=MoviePrune leftside = 0 width = 30000 mstrt = 0 mstrt_mult=0 ustrt = 0 ustrt_mult=0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 1 1: 30878 2647871 1283744 2488120 317050 1904905 1989766 14756 1027056 1149588 1394012 1406595 2529547 1682104 2625019 2603381 1774623 470861 712610 1772839 1059319 2380848 548064 2: 1959936 748922 1131325 1312846 2314531 1636093 584750 2418486 715897 1172326 etc. where 1: and 2: are movieIDs and the others are userIDs. Note, this in an interleaved format of a 2-column DB file, probe-full(movieID,userID) [Prune_Movies_in_CoSupUV] method=MovieCommonCoSupportPrune leftside = 0 width = 2000 mstrt = 0 mstrt_mult=0 ustrt = 0 ustrt_mult=0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 1 (Part identical to blue for movie voting params) [Prune_Users_in_SupM] method=UserCommonCoSupportPrune leftside = 0 width = 30000 mstrt = 0 mstrt_mult=0 ustrt = 0 ustrt_mult=0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 1 Only the method, leftside, width, Ch=Choice, Ct=Count parameters are used at this time. Using this program, the many "lo-u.xx.xx" and, if movie voting is also enabled, "lo-m.yy.yy" files constitute what we have called a clustering (tho they’re not mutually exclusive). Once we have {z-lo.xx.yy | z=u of m } we can make a submission by:  qualifying pair (m,u), use correlations to pick program to make that prediction.

The Netflix Contest(Using this scheme to predict Qualifying pair ratings) The above prediction scheme requires the existence of Square Errors (SqErr), e.g., clusters files, lo-u.vv.nn.txt and lo-m.nn.vv.txt are composed of all input pairs such that SqErr ≤ .0001 To predict rating(M,U) for pairs from Qualifying, we won’t have answers, so we won’t have SqErrs of our predictions relative to those answers. So how can we form good cluster then? Once that’s decided what matchup algorithm should we use to match a cluster (program) to a Qualifying pair to be predicted? After the clusters are created, we can try the matchup algorithms that worked best for Probe predictions, but We may want to develop new ones because the performance of those matchup algorithms may depend on the way the clusters were created. We could use the same 288 configs to generate a new config-subset-collection of Qualifying pairs using, e.g., prediction some kind of prediction variation instead of thresholded prediction SqErr? lo-u.vv.nn.txt could be constructed to consist of Qualifying pairs as follows (a variation based method): Set all answers in Qualifying to 1. Use ./mpp-submit to create clusters as above (threshold=.0001) in a directory, q1. Set all answers in Qualifying to 2. Use ./mpp-submit to create clusters as above (threshold=.0001) in a directory, q2, etc. This will create a clustering of 288*5=1440 cluster sets (but, of course, only 288 different programs configs). One could matchup a Qualifying pair using count-based correlations, Pearson-correlations, 1-perpendicular-correlations, or? One could matchup (M,U) with the cluster in which the sum of the M and U counts (or counts relative to cluster size) is max? Other?

The Netflix Files {Mi} i=1..17770 given by Netflix as: Training (Uid,Mid,R,D) ordered by Uid: Training (Mid,Uid,R,D) orderd by Mid: avg: 209 m/u Mi: u\rd avg: 5655 u/m uIDrating date u i1rmk,u dmk,u ui2 . . . ui ni uIDmIDrating day_number u1 m 1 ru,m du,m u1 m2 . . . u480189m17770 mIDuIDrating day_number m1 u 1 rm,u dm,u m1 u2 . . . m17770 u480189 r17770,480189 d17770,480189 or U2649429 -------- 100,480,507 -------- -------- 17,770 -------------- b=13 . Pu480189,0 m\u . u1 ... uk ... u480189 m1 : mh : m17770 day_numbers . b=4 b=3 b=2 ratings rmhuk b=1 b=0 1 m1 mh m17770   47B   0 0 1 0 0 1 0 1 u1 uk u480189 Mi ( uID, Rating, Date ) For each MovieID, Mi, this is a file of all users who rated it, the rating, the rating date. TRAINING as M-U interaction cube (Rolodex Model,m\u) TRAININGin MySQL with key (mID, uID) 11-bit day numbers starting at 1=1/1/99 and ending at 2922=12/31/06. bit-sliced TRAINING: M-U interaction cube (Rolodex Model, m\u) Pmh, 2 TRAININGin MySQL with key (uID, mID) 11-bit day numbers starting at 1=1/1/99 and ending at 2922=12/31/06.

The Program: Code Structure -the main modules mpp-mpred.C mpp-user.C movie-vote.C user-vote.C prune.C mpp-mpred.C reads a Neflix PROBE file Mi(Uid) and passes Mi and ProbeSupport(Mi) to mpp-user.C to make predictions for each pair (Mi,U), foreach UProbeSupport(Mi). It can also calls separate instances of mpp-user.C for many Us, to be processed in parallel (governed by the number of "slots" specified in 1st code line.) mpp-user.Cloops thru ProbeSupport(M), the ULOOP, reading in the designated (matchedup) config file, then writing out a (Mi,U) prediction for each U. If the user-vote-approach is used , mpp-user.C calls user-vote.C, passing it (M, Support(M), U, Support(U)). If the movie-vote-approach is used, mpp-user.C calls movie-vote.C, passing it (M, Support(M), U, Support(U). user-vote.Cdoes the specified pruning by calling prune.C, looping through the pruned set of user voters, V, calculating a vote for each, combining those votes and returning a prediction_vote(M,U) movie-vote.Cdoes similarly.

What kind of pruning can be specified? mpp-mpred.C mpp-user.C movie-vote.C user-vote.C prune.C Again, all parameters are specified in a configuration file and the values specified there are consumed at runtime using, e.g., the call: mpp -i Input_.txt_file -c config -n 16 where Input_.txt_file is the input Probe subset file and 16 is the number of parallel threads that mpp-mpred.C will generate (here, 16 movies are processed in parallel, each sent to a separate instantiation of mpp-user.C) A sample config file is given later. There are up to 3 types of pruning used (for pruning down support(M) as the set of all users that rate M or pruning down support(U) as the set of all movies that rate U: 1. correlation or similarity threshold based pruning 2. count based pruning 3. ID window based pruning Under correlation or similarity threshold based pruning, and using support(M)=supM for example (pruning support(U) is similar) we allow any function f:supMsupM [0,HighValue] to be called a user correlation provided only that f(u,u)=HighValue for every u in supM. Examples include Pearson_Correlation, Gaussian_of_Distance, 1_perp_Correlation (see appendix of these notes), relative_exact_rating_match_count (Tingda is using), dimension_of_common_cosupport, and functions based on Standard Deviations. Under count based pruning, we usually order by one of the correlations above first (into a multimap) then prune down to a specified count of the most highly correlated. Under ID window based pruning we prune down to a window of userIDs within supM (or movieIDs within supU) by specifying a leftside (number added to U, so leftside is relative to U as a userID) and a width.

How does one specify prunings? mpp-mpred.C specifies type of prune (there are 3 types: UserPrune with a full range of possibilities; UserFastPrune with just PearsonCorrelation pruning; CommonCoSupportPrune which orders users, V, according to the size of their CommonCoSupport with U only (note that this is a correlation of sorts too.) mpp-user.C movie-vote.C user-vote.C threshold "diff of vectors" population-based std_dev prune specify leftside (from Uid) of an ID interval prune of supM specify the width of an ID interval prune of supM specify starting movie (intercept and slope) for N loop specify starting movie (intercept and slope) for V loop threshol for count based prune specify PearsonCorr threshold (b=bill, meaning: use bill's formula - note if there has been prior pruning this will have a different value than Amal's) specify PearsonCorr threshold (a=Amal, meaning: use Amal's table lookup) threshold "vectorof diffs" population-based std_dev prune threshold "vector of diffs"sample-based std_dev prune threshold (Gaussian of) Euclidean distance based prune threshold for (Gaussian of) 1perpendicular distance prune exponent for (Gaussian of) 1perpendicular distance prune threshold (Gaussian of) a variation based prune threshold std_dev based prune Picks odering for count-based prune below: 1=Amal_Pearson, 2=Bill_Pearson, etc. threshold "diff of vectors"sample-based std_dev prune prune.C Again, in a file (this one is named config) there is a section for specifying the parameters for user-voting and a separate section for specifying parameters for movie-voting. E.g., for movie voting, at the bottom, there are 3 external prunings possible (0 or more can be chosen): 1. an intial pruning of dimensions to be used (since dimensions are user, it prunes supM): 2. a pruning of movie voters, N, (in supU) 3 a final pruning of dimensions (CoSupport(M,N) for the specific movie voter, N. E.g., parameters are specified for this final prune as follows: [movie_voting Prune_Users_in_CoSupMN] method = UserCommonCoSupportPrune leftside = 0 width = 8000 mstrt = 0 mstrt_mult = 0.0 ustrt = 0 ustrt_mult = 0.0 TSa = -100 TSb = -100 Tdvp = -1 Tdvs = -1 Tvdp = -1 Tvds = -1 TD = -1 TP = -1 PPm = .1 TV = -1 TSD = -1 Ch = 1 Ct = 2 Note: all thresholds are for similarities, not distance i.e., when we start with a distance we follow it with the Gaussian to make it a similarity or correlation.

mpp-mpred.C1 * The MovieSet has movie rating PTree's across the vertical axis of * the table. Each movie is encoded using three PTree's. */ UserSet Users; MovieSet Movies; int topMovK = 5, verK = 50; bool use_pearson_movies = false; /* * The minimum user correlation required to be eligible to participate * in voting. */ float Minimum_User_Correlation = 0.5; float corData[17771]; unsigned short int supData[17771]; string probe; /* External functions. */ extern int Mpred_User_Predict(mppConfig &, unsigned long int, vector <int> &, \ PTree &); /** * Internal private function. * * This function prints the current status of the task table. It is * an encapsulation function for reducing the complexity of the * job_table function. * In the case of either transaction a status table is printed out * which reflects the current progress of the simulation. * * \param max_slots The maximum number of subordinates process * which will be managed. * * \param table A pointer to the task table which is to * be changed. * * \param changed The slot number in the task table which is * being updated. /** \file * * This contains the main entry point and contains the code for driving * the multi-process shared memory implementation of the vertical PTree * based predictor system. */ /* Standard includes. */ #include <stdlib.h> #include <unistd.h> #include <stdio.h> #include <wait.h> #include <sys/types.h> #include <time.h> /* Standard C++ includes. */ #include <fstream> #include <iostream> #include <vector> /* Local C++ includes. */ #include "mppConfig.H" #include "PredictionConfig.H" #include "UserSet.H" #include "MovieSet.H" #include "mpp.h" using namespace std; /* Definition of structures static to this module. */ struct task_table { int pid; int movie; int predictions; time_t start; }; /* * The following two global variables define the two sets of PTree's * which will be used to carry out the predictions. * * The UserSet of PTree's have user rating PTree's across the vertical * axis of the table. Each rating is encoded using three PTree's.

mpp-mpred.C2 * \param pid The process ID number. * \param movie_number A movie value of zero causes this function to locate * and return the PID of the subordinate slave process * which is processing the momvie. A non-zero value * causes the PID to be stored in the relationship array. * \param predictions This arguement is only referenced when an update * is made to the task table. This arguement is * the number of customer predictions to be made * for the movie being scheduled * \return No return values are defined. */ extern void job_table(int max_slots, int const pid, int const movie_number, \ int const predictions) { auto char msg[50]; auto int lp, changed = 0; auto time_t now = time(NULL); static int movie_count = 0, prediction_count = 0; static bool first = true; static struct task_table *table; /* Initialize the process table on the first call. */ if ( first ) { size_t amt = max_slots * sizeof(struct task_table); table = (struct task_table *) malloc(amt); if ( table == NULL ) { fputs("Cannot allocate job table.\n", stderr); exit(1); } for (lp= 0; lp < max_slots; ++lp) { table[lp].pid = 0; table[lp].movie = 0; table[lp].predictions = 0; table[lp].start = 0; * \param reason A character pointer to a description string * indicating why the table is being updated. */ extern void print_job_table(int max_slots, \ struct task_table const * const table, \ int const changed, char const * const reason) { auto int entry; auto time_t now = time(NULL); fprintf(stdout, "Task status change: %s", ctime(&now)); fputs("\tSlot\t PID\tMovie\tUsers\n", stdout); fputs("\t----\t-----\t-----\t-----\n", stdout); for (entry= 0; entry < max_slots; ++entry) { fprintf(stdout, "\t%-5d\t%5d\t%5d\t%5d", entry, \ table[entry].pid, table[entry].movie, \ table[entry].predictions); if ( entry == changed ) fprintf(stdout, "\t<- %s\n", reason); else fputs("\n", stdout); } fputs("\n", stdout); return; } /** * Internal private function. * * This function maintains a table which correllates process ID's with * the movies they are processing, the total number of predictions * required per movie and the time required to process a movie. * * Depending on the value of the movie number arguement this function * either stores the relationship or retrieves the movie associated * with the PID. * * In the case of either transaction a status table is printed out * which reflects the current progress of the simulation. * * \param max_slots The maximum number of subordinate processes * which are under management.

mpp-mpred.C3 * \param pid The process ID number. * \param movie_number A movie value of zero causes this function to locate * and return the PID of the subordinate slave process * which is processing the momvie. A non-zero value * causes the PID to be stored in the relationship array. * \param predictions This arguement is only referenced when an update * is made to the task table. This arguement is * the number of customer predictions to be made * for the movie being scheduled * \return No return values are defined. */ extern void job_table(int max_slots, int const pid, int const movie_number, \ int const predictions) { auto char msg[50]; auto int lp, changed = 0; auto time_t now = time(NULL); static int movie_count = 0, prediction_count = 0; static bool first = true; static struct task_table *table; /* Initialize the process table on the first call. */ if ( first ) { size_t amt = max_slots * sizeof(struct task_table); table = (struct task_table *) malloc(amt); if ( table == NULL ) { fputs("Cannot allocate job table.\n", stderr); exit(1); } for (lp= 0; lp < max_slots; ++lp) { table[lp].pid = 0; table[lp].movie = 0; table[lp].predictions = 0; table[lp].start = 0; } * \param reason A character pointer to a description string * indicating why the table is being updated. */ extern void print_job_table(int max_slots, \ struct task_table const * const table, \ int const changed, char const * const reason) { auto int entry; auto time_t now = time(NULL); fprintf(stdout, "Task status change: %s", ctime(&now)); fputs("\tSlot\t PID\tMovie\tUsers\n", stdout); fputs("\t----\t-----\t-----\t-----\n", stdout); for (entry= 0; entry < max_slots; ++entry) { fprintf(stdout, "\t%-5d\t%5d\t%5d\t%5d", entry, \ table[entry].pid, table[entry].movie, \ table[entry].predictions); if ( entry == changed ) fprintf(stdout, "\t<- %s\n", reason); else fputs("\n", stdout); } fputs("\n", stdout); return; } /** * Internal private function. * * This function maintains a table which correllates process ID's with * the movies they are processing, the total number of predictions * required per movie and the time required to process a movie. * * Depending on the value of the movie number arguement this function * either stores the relationship or retrieves the movie associated * with the PID. * * In the case of either transaction a status table is printed out * which reflects the current progress of the simulation. * * \param max_slots The maximum number of subordinate processes * which are under management.

mpp-mpred.C4 /** * Main program starts here. */ int main(int argc, char **argv) { /* The following variable controls whether or not movie predictions * are to be run in parallel, ie. each in its own process. */ auto bool have_input = false, single_threaded = true; char snbufr[10]; int movie_count = 0; int max_process_slots, process_count = 0; pid_t pid; time_t run_start, t1, t2; string data_root = PTREEDATA"/"; string corr_root = data_root + "mv_corr/co_mv_"; string supp_root = data_root + "mv_supp/sp_mv_"; string ptree_set_id = data_root + "nf_us_mv_pt"; string ptree_set_idT = data_root + "nf_mv_us_pt"; ifstream inFile1; ifstream inFile2; auto mppConfig config; /* Option parsing. */ auto int gopt; while ( (gopt = getopt(argc, argv, "C:c:i:n:")) != EOF ) { switch ( gopt ) { case 'c': if ( !config.read_config(optarg) ) { fprintf(stderr, "%s: Cannot read " \ "standard configuration - " \ "%s\n", argv[0], optarg); exit(1); } first = false; } /* Add a task to the table. */ if ( movie_number != 0 ) { for (lp= 0; lp < max_slots; ++lp) { if ( table[lp].pid == 0 ) { changed = lp; table[lp].pid = pid; table[lp].movie = movie_number; table[lp].predictions = predictions; table[lp].start = now; print_job_table(max_slots, table, changed, \ "Started"); fflush(stdout); return; } } } /* Remove a task from the table. */ for (lp= 0; lp < max_slots; ++lp) { if ( table[lp].pid == pid ) { auto time_t run_time = time(NULL) - table[lp].start; auto float per_user = run_time; prediction_count += table[lp].predictions; snprintf(msg, sizeof(msg), "Completed: %lu " \ "[%.2f/user] secs.", run_time, \ per_user/table[lp].predictions); print_job_table(max_slots, table, lp, msg); table[lp].pid = 0; table[lp].movie = 0; table[lp].predictions = 0; table[lp].start = 0; fprintf(stdout, "\tMovies: %5d\tPredictions: %d\n\n", \ ++movie_count, prediction_count); fflush(stdout); return; }}}

mpp-mpred.C5 /** Load the rating data as two separate sets of PTree's. */ t1=time(NULL); fputs("Data load started.\n", stderr); fputs("\tUser ptrees - ", stderr); if ( !Users.load_binary() ) { fputs("\n\nFailed load.\n", stderr); return 1; } fputs("identities - ", stderr); if ( !Users.load_identities() ) { fputs("\n\nFailed load.\n", stderr); return 1; } fputs("completed.\n", stderr); fputs("\tMovie ptrees - ", stderr); if ( !Movies.load_binary() ) { fputs("\n\nFailed load.\n", stderr); return 1; } fputs("completed.\n", stderr); t2=time(NULL); fprintf(stderr, "Data load completed, time = %u\n\n", t2 - t1); ifstream inFile; inFile.open(probe.c_str() ); char str[100]; int last_movie_id = 0, new_movie_id = 0; bool last_movie = true; inFile>>str; string str1(str); str1.erase(str1.size()-1); new_movie_id = atoi(str1.c_str()); /* Start of loop over movies begins here. */ run_start = time(NULL); for(int movie_cnt= 0; !inFile.eof(); movie_cnt++) { vector <int> probeUs; break; case 'C': if ( !config.read_cluster_config(optarg) ) { fprintf(stderr, "%s: Cannot read " \ "cluster configuration - " \ "%s\n", argv[0], optarg); exit(1); } break; case 'i': have_input = true; probe.assign(optarg); break; case 'n': single_threaded = false; max_process_slots = atoi(optarg); break; } } if ( !have_input ) { fprintf(stderr, "%s: No input file specified.\n", argv[0]); return 1; } if ( !config.is_standard_config() && !config.is_cluster_config() ) { fprintf(stderr, "%s: No configuration specified.\n", argv[0]); return 1; } fprintf(stderr, "%s: Vertical Rating Predictor - %s\n\n", argv[0], VERSION); fputs("Data files:\n", stderr); fprintf(stderr, "\tid:\t%s\n", ptree_set_id.c_str()); fprintf(stderr, "\tidT:\t%s\n", ptree_set_idT.c_str()); fprintf(stderr, "\tsupp:\t%s*\n", supp_root.c_str()); fprintf(stderr, "\tcorr:\t%s*\n\n", corr_root.c_str()); fprintf(stderr, "\tInput:\t%s\n\n", probe.c_str()); if ( single_threaded ) fputs("Mode: single-threaded\n", stderr); else fprintf(stderr, "Mode: %d way multi-processor\n", \ max_process_slots); if ( config.is_standard_config() ) { auto PredictionConfig *pcfg = config.get_standard_config(); fputs("\nPrediction configuration:\n", stderr); pcfg->print(stderr); }

mpp-mpred.C6 ++movie_count; last_movie_id = new_movie_id; last_movie = true; while( last_movie && (inFile>>str) ) { string str1(str); if (str1.at(str1.size() - 1) == ':') { str1.erase(str1.size() - 1); new_movie_id = atoi(str1.c_str()); last_movie = false; } else probeUs.push_back(atoi(str1.c_str())); } /* M is the movie to be predicted. */ t1 = time(NULL); unsigned long int M = last_movie_id - 1; /* read the pearson correlations for movies * NOTE using pearson not Perp * Try to find bes co-related movie set for * pmv */ snprintf(snbufr, sizeof(snbufr), "%d", last_movie_id); string sn(snbufr); string outCorr1 = corr_root + sn + ".bin"; inFile1.open( outCorr1.c_str() ); string outSupp1 = supp_root + sn + ".bin"; inFile2.open( outSupp1.c_str() ); inFile1.read(reinterpret_cast<char*>(&corData), \ 17771*sizeof(float)); inFile2.read(reinterpret_cast<char*>(&supData), \ 17771*sizeof(short int)); inFile1.close(); inFile2.close(); /* Get the list of users who have rated this movie. */ auto PTree user_list = Movies.get_users(M); /* Wait for any child processes to complete. */ if ( process_count == max_process_slots ) { int status; pid = wait(&status); if ( pid == -1 ) { perror("FPP wait failed."); exit(1); } --process_count; job_table(max_process_slots, pid, 0, 0); if ( WIFEXITED(status) == 0 ) { fprintf(stderr, "\tError in movie, " \ "status = %d\n", \ WEXITSTATUS(status)); } } } /* Capture all remaining slave processes. */ do { int status; pid = wait(&status); if ( pid == -1 ) { fputs("No processes left.\n", stderr); process_count = 0; continue; } --process_count; job_table(max_process_slots, pid, 0, 0); if ( WIFEXITED(status) == 0 ) { fprintf(stderr, "\tError in movie, " \ "status = %d\n", \ WEXITSTATUS(status)); } } while ( process_count > 0 ); inFile.close(); fputs("\nPredictions completed.\n", stderr); fprintf(stderr, "\tMovies: %d\n", movie_count); fprintf(stderr,"\tTime: %d\n", time(NULL) -run_start); return 0; } /* Check to see if predictions of movies are * to be single-threaded. If so run the * movie prediction synchronously and then * skip to the next movie. */ if ( single_threaded ) { auto time_t now = time(NULL); auto float start = now; fprintf(stderr, "Starting movie: %d, " \ "Users: %d, ", M, probeUs.size()); Mpred_User_Predict(config, M, probeUs, user_list); now = time(NULL); fprintf(stderr, "Completed: %2.0f " \ "[%.2f/user] secs.\n\n", now - start, \ (now - start)/probeUs.size()); continue; } /* Start prediction for movie pmv for given * users in probeUser set. Fork a new process and * generate customer predictions in this new fork. */ if ( process_count < max_process_slots ) { pid = fork(); if ( pid == -1 ) { perror("FPP fork failed."); exit(1); } /* Child - process movie and exit. */ if ( pid == 0 ) { Mpred_User_Predict(config, M, probeUs, \ user_list); _exit(0); } /* Parent - update task table. */ ++process_count; job_table(max_process_slots, pid, M, probeUs.size()); }

mpp-user.C1 /** \file * This file contains the driver code which * implements predictions of recommendations. */ /* Program compilation defines folloow. * * These defines enable and control generation of movie specific logfiles. * The MOVIE_LOGGING define needs to be enabled to turn on generation of * logfiles. Other defines increase the amount of output generated. */ #if 0 #define MOVIE_LOGGING #endif #if 0 #define MEMORY_LOGGING #endif #if 0 #define VOTE_LOGGING #endif // Include files. #include <stdio.h> #include <time.h> // Standard C++ includes. #include <fstream> #include <iostream> #include <vector> #include <map> #include <utility> // Local C++ include files. #include <PTreeSet.H> #include "mppConfig.H" #include "UserSet.H" #include "MovieSet.H" /* Standard C include files. */ #include "mpp.h" using namespace std; // External variables. extern int topMovK, verK; extern bool use_pearson_movies; extern float Minimum_User_Correlation; extern float corData[17771]; extern unsigned short int supData[17771]; extern string probe; // CREATES, OPENS logfile if logging enabled, else NULL returned LOGGING #if defined(MOVIE_LOGGING) static inline FILE * open_logfile(string movie_number) { auto string logname("./Output/" + probe.substr(probe.find_last_of('/') + 1) + \ "_" + movie_number + ".log"); return(fopen(logname.c_str(), "w+")); } #else static inline FILE * open_logfile(string movie_number) {return NULL;} #endif // ENABLING causes nearest nbr user voting to print for each prediction. #if defined(VOTE_LOGGING) static inline void print_votes( FILE *logfile, int user, double vote, double weight, \ double vRt, double VBar, double Ub, double voter_corr) { if ( logfile == NULL ) return; fprintf(logfile, "\t\tVote: %.2f\tWeight: %.2f\tUser: %d\n", vote, weight, user); fprintf(logfile, "\t\t\tvRt: %.2f\tVbar: %.2f\tUb: %.2f\n", vRt, VBar, Ub); fprintf(logfile, "\t\t\tCor: %.2f\n\n", voter_corr); return; } #else static inline void print_votes( FILE *logfile, int user, double vote, double weight,\ double vRt, double VBar, double Ub, double voter_corr){ return; } #endif // Enabling prints amount of memory consumed against given starting pt. #if defined(MEMORY_LOGGING) static inline void log_memory(FILE *logfile, const char *fmt, void *start) { fprintf(logfile, fmt, (char *) sbrk(0) - (char *) start); return; } #else static inline void log_memory(FILE *logfile __attribute__ ((unused)), \ const char *fmt __attribute__ ((unused)), \ void *start __attribute__ ((unused))) { return; } #endif extern int Mpred_User_Predict (mppConfig &config, unsigned long int M, \ vector <int> & user_list, PTree & M_support) { auto void *movie_memory_start; auto char snbufr[10]; auto time_t start_time = time(NULL); auto unsigned long int U; auto FILE *predictions; auto FILE *logfile; auto PredictionConfig *pcfg = NULL;

mpp-user.C2 supportM.clearbit(U); supportU.clearbit(M); if ( supportM.get_count() < 1) { fprintf(predictions, "%.2f\n", vote); fflush(predictions); continue; } /* Get configuration information. */ if ( config.is_standard_config() ) pcfg = config.get_standard_config(); if ( config.is_cluster_config() ) { pcfg = config.select_configuration(Users, U); config.show_selection(logfile); } /* Config file needs: (mpp-user part) * External Pruning: * 1. Reset support in movie-vote call: yes, no. * * Voting selection: * 2. Set vote_wt: 0 <= vote_wt <= 1 * (VOTE_wt = 1 - vote_wt) * Forcing in Range: * 5. Select 0, 1 or 2 force_vote_in_ranges: * user-vote movie-VOTE */ /* User voting.*/ if ( pcfg->do_user_voting() ) vote = user_vote(pcfg, M, supportM, U, supportU); //if ( vote < 1 ) vote = 1; else if ( vote > 5 ) vote = 5; /* Movie voting. */ if ( pcfg->do_movie_voting() ) VOTE = movie_vote(pcfg, M, supportM, U, supportU); //if ( VOTE < 1 ) VOTE = 1; else if ( VOTE > 5 ) VOTE = 5; /* Set user_vote_weight here. */ vote_wt = pcfg->get_user_vote_weight(); VOTE_wt = 1.0 - vote_wt; vote = (vote * vote_wt + VOTE * VOTE_wt ) / \ (vote_wt + VOTE_wt); // OPEN log and prediction files. snprintf(snbufr, sizeof(snbufr), "%lu", Movies.get_identity(M)); string sn(snbufr); string outPredName("./Output/"+probe.substr(probe.find_last_of('/')+1) \ + "_" + sn + ".predict"); logfile = open_logfile(sn); if ( (predictions = fopen(outPredName.c_str(), "w+")) == NULL ) { fputs("Cannot open prediction file.\n", stderr); return 0; } fprintf(predictions, "%lu:\n", Movies.get_identity(M)); if ( logfile != NULL ) fflush(logfile); /* * Write descriptor to output logfile and the number of the movie * to the prediction file. */ if ( logfile != NULL ) fprintf(logfile, "\nBeginning movie: %5d\tUsers: %d\t" \ "PID: %d\n", Movies.get_identity(M), user_list.size(),\ getpid()); if ( logfile != NULL ) movie_memory_start = sbrk(0); /* Select eligible clusters for this movie. */ if ( config.is_cluster_config() ) config.select_clusters(Movies, M); /* Loop over users starts here. */ for (unsigned int user= 0; user < user_list.size(); ++user) { auto double vote = DEFAULT_VOTE, VOTE = DEFAULT_VOTE, vote_wt = 0.0, VOTE_wt = 0.0; U = Users.get_index(user_list[user]); auto PTree supportM(M_support), supportU = Users.get_movies(U);

mpp-user.C3 //sumSCor=sumSCor/countdimMN; sumPCor=sumPCor/countdimMN; sumDCor=sumDCor/countdimMN; sumdimMN=sumdimMN/countdimMN; //sumsCor=sumsCor/countdimUV; sumpCor=sumpCor/countdimUV; sumdCor=sumdCor/countdimUV; sumdimUV=sumdimUV/countdimUV; // vote=(vote*sumdimUV + VOTE*sumdimMN)/(sumdimUV+sumdimMN); //auto double red=.4; vote=(vote*exp(-pow(Vsdp,2))+VOTE*exp(-red*pow(Nsdp,2)))/(exp(-pow(Vsdp,2))+exp(-red*pow(Nsdp,2))); //auto double red=1.0; vote=(vote*exp(-pow(Vsdp,2))+VOTE*red*exp(-pow(Nsdp,2)))/(exp(-pow(Vsdp,2))+red*exp(-pow(Nsdp,2))); // if ( sumsCor>sumSCor + 0.1 ){ vote=( vote*sumdimUV*(2+sumsCor)+VOTE*sumdimMN*(2+sumSCor) )/( sumdimUV*(2+sumsCor)+sumdimMN*(2+sumSCor)); } // vote=VOTE; // if ( Nsdp < 2.0 && Vsdp > 2.0 && sumSCor + .5 > sumsCor ) vote=VOTE; // if ( Nsdp < 0.5 && Vsdp > 2 ) vote=VOTE; // vote=(vote*exp(-pow(Vsdp,2) ) + VOTE*exp(-pow(Nsdp,2)))/(exp(-pow(Vsdp,2)) + exp(-pow(Nsdp,2))); //.937465(95) // Final output occurs here. if ( (vote < 1) && (vote != DEFAULT_VOTE) ) vote = 1; if ( (vote > 5) && (vote != DEFAULT_VOTE) ) vote = 5; // force vote into range fprintf(predictions, "%.2f\n", vote); fflush(predictions); if (logfile != NULL) fprintf(logfile,"\tPrediction #%d: %0.1f\tuser: %u\t" \ "config: %s\n\n", user, vote, Users.get_identity(U), \ pcfg->get_name()); } // ULOOP end if (logfile!=NULL) { float total_time = time(NULL) - start_time; fprintf(logfile,"Ending movie: %d\tTime: %.2f [%.2f/user] " \ "secs.\t", Movies.get_identity(M), total_time, \ (float) (total_time/user_list.size())); log_memory(logfile, "Memory: %d\n", movie_memory_start); fputs("\n", logfile); fclose(logfile); } fclose(predictions); return 0; } // MLOOP end

User-vote.C1 /** \file This file contains the implementation of the user voting function. */ /* Include files. */ #include <stdio.h> #include <math.h> #include <PTree.H> #include "MovieSet.H" #include "UserSet.H" #include "mppConfig.H" #include "PredictionConfig.H" #include "mpp.h" /* Config file needs: (user-vote part) * uCor Internal Pruning: * * 1. Select 0 or 1 of dvCorp, dvCors, vdCorp, vdCors, pCor, dCor, sCor * 1.1 For selected in 1, set Threshold: dvThrp, dvThrs, vdThrp, vdThrs, pThr, dThr, sThr, * Threshold defaults are: 0 0 0 0 0 0 0 * * * uCor vote weighting: (Default uCor=1. By selecting 1 of these, we reset uCor value to it.) * 2. Select 0 or 1 of dvCorp, dvCors, vdCorp, vdCors, pCor, dCor, sCor * * Standard Deviation Internal Pruning: (population/sample; diffference_of_vectors/vector_of_differences) * * 3. Select 0 or more of: dUVsdp, dUVsds, Vsdp_Usdp, Vsds_Usds * 3.1 Foreach selected in 2, set Threshold: dUVsdpThr, dUVsdsThr, Vsdp_UsdpThr, Vsds_UsdsThr * Threshold defaults are: 0 0 0 0 * * 3.2 Foreach selected in 2, set pow exp: dUVsdpExp, dUVsdsExp, Vsdp_UsdpExp, Vsds_UsdsExp * Power Exponent defaults are: -1 -1 -1 -1 * * External Pruning: * 4. Select 0 or more of: Prune_Movies_In_SupU, Prune_Users__In_SupM, Prune_Movies_In_CoSupUV * 4.1 Foreach selected in 2, select 1 of: Prune, FastPrune, CommonCoSupportPrune * * 4.2 Reset non-pruned support in 2nd: yes, no. * * 4.3 Foreach selected in 2, set parameter: mstrt, ustrt, TSa, TSb, Tdvp,Tdvs,Tvdp,Tvds,TD,TP,PPm,TV,TSD,Ch, Ct * Prune Parameter defaults are: 0 0 -100 -100 -1 -1 -1 -1 -1 -1 .1 -1 -1 1 no def * * Forcing in Range: * 5. Select 0 or more force_vote_in_range: in_Voter_LOOP after_Voter_LOOP before_return */

User-vote.C2 /** * Public function. * This function implements user voting. * * \param pcfg A pointer to the class containing the parameters * which configure the voting. * \param M The movie number for which a prediction is to be * made * * \param supportM The PTree identifying the support for the movie * to be predicted. * \param U The identity number of the user for which a * prediction is to be made. * \param supportU The Ptree identifying the support for the user * who a predication is being made for. * \return The recommended prediction. */ extern double user_vote(PredictionConfig *pcfg, unsigned long int M, \ PTree & supportM, unsigned long int U, \ PTree & supportU) { /* Enabled for boundary based prediction revisions. */ #if 0 auto double z0IP55=0, z0IP44=0, z0IP33=0, z0IP22=0, z0IP11=0, z0IP15=0, z0IP14=0, z0IP13=0, z0IP12=0, z0IP51=0, z0IP41=0, z0IP31=0, z0IP21=0, z0IP25=0, z0IP24=0, z0IP23=0, z0IP52=0, z0IP42=0, z0IP32=0, z0IP35=0, z0IP34=0, z0IP53=0, z0IP43=0, z0IP45=0, z0IP54=0; #endif auto double vote = DEFAULT_VOTE, vote_sum = 0, vote_cnt = 0; auto double Vb, Ub, dsSq, uCor = 1; struct pruning *internal_prune; struct external_prune *external_prune; auto PTree supM = supportM, supU = supportU; supM.clearbit(U); supU.clearbit(M); /* External pruning: PRUNE MOVIES supU */ external_prune = pcfg->get_user_Prune_Movies_in_SupU(); if ( external_prune->enabled ) { if( supU.get_count() > external_prune->params.Ct ) do_pruning(external_prune, M, U, supM, supU); supM.clearbit(U); supU.clearbit(M); if( (supM.get_count() < 1) || (supU.get_count() < 1) ) return vote; } /* Reset user support if requested. */ if ( pcfg->reset_user_support() ) { supM = supportM; supM.clearbit(U); } /* External pruning: Prune Users supM */ external_prune = pcfg->get_user_Prune_Users_in_SupM(); if ( external_prune->enabled ) { if ( supM.get_count() > external_prune->params.Ct ) do_pruning(external_prune, M, U, supM, supU); supM.clearbit(U); supU.clearbit(M); if( (supM.get_count() < 1) || (supU.get_count() < 1) ) return vote; } /* VN: VLOOP strt (Vs are user voters)*/ auto unsigned long long int *supMlist = supM.get_indexes(); for (unsigned long long int v= 0; v < supM.get_count(); ++v) { auto unsigned long long int V = supMlist[v]; auto double MV = Users.get_rating(V, M) - 2, max = 0, smV = 0, smU = 0, UU = 0, UV = 0, VV = 0, dm;

User-vote.C3 auto PTree csUV = supU & Users.get_movies(V); csUV.clearbit(M); dm = csUV.get_count(); if( dm < 1) continue; /* turn on only if doing Inner-Product Boundary-Based prediction revisions */ #if 0 auto double S1=0, S2=0, S3=0, S4=0, S5=0, C1=0, C2=0, C3=0, C4=0, C5=0, A1=0, A2=0, A3=0, A4=0, A5=0, S11=0, S22=0, S33=0, S44=0, S55=0, C11=0, C22=0, C33=0, C44=0, C55=0, A11=0, A22=0, A33=0, A44=0, A55=0, smN=0, smM=0, NN=0, MN=0, MM=0; #endif /* External pruning: PRUNE MOVIES CoSupUV */ external_prune = pcfg->get_user_Prune_Movies_in_CoSupUV(); if ( external_prune->enabled ) { if( csUV.get_count() > external_prune->params.Ct ) do_pruning(external_prune, M, U, supM, csUV); csUV.clearbit(M); supM.clearbit(U); dm = csUV.get_count(); if( dm < 1 ) continue; } /* VN: NLOOP strt (Ns are movie vector_space_dimensions) */ auto unsigned long long int *csUVlist = csUV.get_indexes(); for (unsigned long long int n= 0; n < csUV.get_count(); ++n) { auto unsigned long long int N = csUVlist[n]; auto double NU = Users.get_rating(U, N) - 2, NV = Users.get_rating(V, N) - 2; if( pow(NU-NV, 2) > max) max = pow(NU-NV, 2); smV += NV; smU += NU; UU += NU * NU; UV += NU * NV; VV += NV * NV; Vb = smV / dm; Ub = smU / dm; dsSq = VV - 2*UV + UU; vote = MV - Vb + Ub; /* SAMPLE-statistic-based pruning through early exit. */ if( dm > 1) { /* method dUVsds */ internal_prune = pcfg->get_internal_prune(user_dUVsds); if ( internal_prune->enabled ) { auto double dUVsds, thr = internal_prune->threshold, expnt = internal_prune->exponent; dUVsds = pow((dsSq-dm*(Vb-Ub)*(Vb-Ub))/(dm-1),.5); if( dUVsds > (thr * pow(dm, expnt)) ) continue; } /* method Usds_Vsds. NO exponent. */ internal_prune = pcfg->get_internal_prune(user_Vsds,Usds); if ( internal_prune->enabled ) { auto double Usds, Vsds, thr=internal_prune->threshold; Usds = pow((UU-dm*Ub*Ub)/(dm-1), 0.5); Vsds = pow((VV-dm*Vb*Vb)/(dm-1), 0.5); if( Vsds > (thr * Usds) ) continue; } /* e.g., -10 is exponent. */ /* e.g., 0 in if statement is threshold. */ internal_prune = pcfg->get_internal_prune(user_dvCors); if ( internal_prune->enabled ) { auto double dvCors, Usds, Vsds, thr = internal_prune->threshold, expnt = internal_prune->exponent; Usds = pow((UU-dm*Ub*Ub)/(dm-1), 0.5); Vsds = pow((VV-dm*Vb*Vb)/(dm-1), 0.5); dvCors = exp(expnt * (Vsds-Usds)*(Vsds-Usds)); if ( dvCors < thr ) continue; if ( internal_prune->weight ) uCor = dvCors; } internal_prune = pcfg->get_internal_prune(user_vdCors); if ( internal_prune->enabled ) { auto double vdCors, dUVsds, thr = internal_prune->threshold, expnt = internal_prune->exponent; dUVsds=pow((dsSq-dm*(Vb-Ub)*(Vb-Ub))/(dm-1),.5); vdCors = exp(expnt * dUVsds * dUVsds); if ( vdCors < thr ) continue; if ( internal_prune->weight ) uCor=vdCors; } } //turn on only if doing Inner-Product Boundary-Based prediction revisions #if 0 if(NU==1&&NV>0){S1+=NV;++C1;}else{ if(NU==2&&NV>0){S2+=NV;++C2;}else{ if(NU==3&&NV>0){S3+=NV;++C3;}else{ if(NU==4&&NV>0){S4+=NV;++C4;}else{ if(NU==5&&NV>0){S5+=NV;++C5;} }}}} #endif }

/* POPULATION-statistics-based pruning through early exit. */ if( dm > 0 ) { internal_prune = pcfg->get_internal_prune(user_dUVsdp); if ( internal_prune->enabled ) { auto double dUVsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent; dUVsdp=pow(dm*dsSq-(smV-smU)*(smV-smU),.5)/dm; if ( dUVsdp > thr * pow(dm, expnt) ) continue; } /* method Usds_Vsds */ // Usdp=pow(dm*UU-smU*smU,.5)/dm; // Vsdp=pow(dm*VV-smV*smV,.5)/dm; // if( Vsdp > 0.5 * Usdp )continue; // Threshold is 0.5 // No exponent internal_prune = \ pcfg->get_internal_prune(user_Vsdp_Usdp); if ( internal_prune->enabled ) { auto double Usdp, Vsdp, thr = internal_prune->threshold; Usdp = pow(dm*UU - smU*smU, 0.5) / dm; Vsdp = pow(dm*VV - smV*smV, 0.5) / dm; if ( Vsdp > thr * Usdp ) continue; } // e.g., Threshold: 0.9 // e.g., Exponent: -10 // dvCorp=exp(-10 *(Vsdp-Usdp) * (Vsdp-Usdp)); // if ( dvCorp < .9 ) continue; // uCor=dvCorp; internal_prune = pcfg->get_internal_prune(user_dvCorp); if ( internal_prune->enabled ) { auto double dvCorp, Usdp, Vsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent; Usdp = pow(dm*UU - smU*smU, 0.5) / dm; Vsdp = pow(dm*VV - smV*smV, 0.5) / dm; dvCorp = exp(expnt * (Vsdp-Usdp)*(Vsdp-Usdp)); if ( dvCorp < thr ) continue; if ( internal_prune->weight ) uCor = dvCorp; } internal_prune = pcfg->get_internal_prune(user_vdCorp); if ( internal_prune->enabled ) { auto double vdCorp, dUVsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent; dUVsdp = pow(dm*dsSq-(smV-smU)*(smV-smU), .5) \ / dm; vdCorp = exp(expnt * dUVsdp * dUVsdp); if ( vdCorp < thr) continue; if ( internal_prune->weight ) uCor = vdCorp; } } /* OTHER Correlation pruning * (pearson=s, pureshift=p, distance=d) */ internal_prune = pcfg->get_internal_prune(user_sCor); if ( internal_prune->enabled ) { auto double sCor, thr = internal_prune->threshold; sCor = (UV - dm*Ub*Vb)/(.0001 + \ (pow((UU-dm*pow(Ub,2)),0.5))* \ (.0001+pow((VV-dm*pow(Vb,2)),.5))); if ( sCor < thr ) continue; if ( internal_prune->weight ) uCor = sCor; } internal_prune = pcfg->get_internal_prune(user_pCor); if ( internal_prune->enabled ) { auto double OnePDS, pCor = -1, thr = internal_prune->threshold, expnt = internal_prune->exponent; OnePDS = dsSq - dm*pow(Vb-Ub, 2); if ( max > 0 ) pCor=exp(expnt*OnePDS/(pow(max,.75)*pow(dm,.5))); if ( pCor < thr ) continue; if ( internal_prune->weight ) uCor = pCor; } User-vote.C4

User-vote.C5 internal_prune = pcfg->get_internal_prune(user_dCor); if ( internal_prune->enabled ) { auto double dCor, OnePDS, thr = internal_prune->threshold; OnePDS = dsSq - dm*pow(Vb-Ub, 2); dCor = exp(-dsSq / 100); if ( dCor < thr ) continue; if ( internal_prune->weight ) uCor = dCor; } /* Turn on for boundary based predication revisions. */ #if 0 if(C1>0&&C2+C3+C4+C5>0) {A1=S1/C1; A11=(S2+S3+S4+S5)/(C2+C3+C4+C5); z0IP11+=(A1-((A1+A11)/2))*(MV-((A1+A11)/2));} if(C1>0&&C2>0) {A1=S1/C1; A2=S2/C2; z0IP12+=(A1-((A1+A2 )/2))*(MV-((A1+A2 )/2));} if(C1>0&&C3>0) {A1=S1/C1; A3=S3/C3; z0IP13+=(A1-((A1+A3 )/2))*(MV-((A1+A3 )/2));} if(C1>0&&C4>0) {A1=S1/C1; A4=S4/C4; z0IP14+=(A1-((A1+A4 )/2))*(MV-((A1+A4 )/2));} if(C1>0&&C5>0) {A1=S1/C1; A5=S5/C5; z0IP15+=(A1-((A1+A5 )/2))*(MV-((A1+A5 )/2));} z0IP51=-z0IP15; z0IP41=-z0IP14; z0IP31=-z0IP13; z0IP21=-z0IP12; if(C2>0&&C1+C3+C4+C5>0) {A2=S2/C2; A22=(S1+S3+S4+S5)/(C1+C3+C4+C5); z0IP22+=(A2-((A2+A22)/2))*(MV-((A2+A22)/2));} if(C2>0&& C3>0) {A2=S2/C2; A3=S3/C3; z0IP23+=(A2-((A2+A3 )/2))*(MV-((A2+A3 )/2));} if(C2>0&& C4>0) {A2=S2/C2; A4=S4/C4; z0IP24+=(A2-((A2+A4 )/2))*(MV-((A2+A4 )/2));} if(C2>0&& C5>0) {A2=S2/C2; A5=S5/C5; z0IP25+=(A2-((A2+A5 )/2))*(MV-((A2+A5 )/2));} z0IP32=-z0IP23; z0IP42=-z0IP24; z0IP52=-z0IP25; if(C3>0&&C1+C2+C4+C5>0) {A3=S3/C3; A33=(S1+S2+S4+S5)/(C1+C2+C4+C5); z0IP33+=(A3-((A3+A33)/2))*(MV-((A3+A33)/2));} if(C3>0&& C4>0) {A3=S3/C3; A4=S4/C4; z0IP34+=(A3-((A3+A4 )/2))*(MV-((A3+A4 )/2));} if(C3>0&& C5>0) {A3=S3/C3; A5=S5/C5; z0IP35+=(A3-((A3+A5 )/2))*(MV-((A3+A5 )/2));} z0IP43=-z0IP34; z0IP53=-z0IP35; if(C4>0&&C1+C2+C3+C5>0) {A4=S4/C4; A44=(S1+S2+S3+S5)/(C1+C2+C3+C5); z0IP44+=(A4-((A4+A44)/2))*(MV-((A4+A44)/2));} if(C4>0&& C5>0) {A4=S4/C4; A5=S5/C5; z0IP45+=(A4-((A4+A5 )/2))*(MV-((A4+A5 )/2));} z0IP54=-z0IP45; if(C5>0&&C1+C2+C3+C4>0) {A5=S5/C5; A55=(S1+S2+S3+S4)/(C1+C2+C3+C4); z0IP55+=(A5-((A5+A55)/2))*(MV-((A5+A55)/2));} //auto double MU = Users.get_rating(U,M)-2; fprintf(stderr,"MU=%1.0f %8.1f %8.1f %8.1f \n", MU,z0IP55,z0IP11,z0IP51); //auto double MU = Users.get_rating(U,M)-2; fprintf(stderr,"MU=%1.0f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f %5.1f \ %5.1f\n",MU,z0IP11,z0IP22,z0IP33,z0IP44,z0IP55,z0IP12,z0IP13,z0IP14,z0IP15,z0IP23,z0IP24,z0IP25,z0IP34,z0IP35,z0IP45); #endif

if ( uCor > 0 ) { vote_sum += vote*uCor; vote_cnt += uCor; } else continue; /* Check and implement forcing of vote in the user loop. */ if ( pcfg->user_vote_force_in_loop() ) { if( (vote < 1) && (vote != DEFAULT_VOTE) ) vote = 1; if( (vote > 5) && (vote != DEFAULT_VOTE) ) vote = 5; } } if ( vote_cnt > 0 ) vote = vote_sum / vote_cnt; else vote = DEFAULT_VOTE; /* force_vote_after_Voter_Loop goes here. */ if ( pcfg->user_vote_force_after_loop() ) { if( (vote < 1) && (vote != DEFAULT_VOTE) ) vote=1; if( (vote > 5) && (vote != DEFAULT_VOTE) ) vote=5; } /* Turn on only if doing Inner-Product Boundary-Based prediction revisions. */ #if 0 //Boundary-Based-Inner-Product vote CHANGE start if ( z0IP55>-.01 //&& z0IP55> z0IP33 && z0IP55> z0IP44 && z0IP51>-.01 //&& z0IP52> .1 && z0IP53> THRZ0 && z0IP54> THRZ0 ) vote=5; #endif #if 0 //Boundary-Based-Inner-Product vote CHANGE start auto double FACZ0=-0.1, THRZ0=-0.1 ; //fauto double FACZ0= 0.40, THRZ0=0.7, z0IP51=-z0IP15, z0IP52=-z0IP25, z0IP53=-z0IP35, z0IP54=-z0IP54; #if 1 //Change vote to 5? if ( true && z0IP55> FACZ0 + z0IP11 && z0IP55> FACZ0+z0IP22 && z0IP55> FACZ0+z0IP33 && z0IP55> FACZ0 + z0IP44 && z0IP51> THRZ0 && z0IP52> THRZ0 && z0IP53> THRZ0 && z0IP54> THRZ0 ) vote=5; #endif #if 1 //Change vote to 1? if ( true && z0IP11>(FACZ0 )*z0IP22 && z0IP11>(FACZ0 )*z0IP33 && z0IP11>(FACZ0 )*z0IP44 && z0IP11>(FACZ0 )*z0IP55 && z0IP12> THRZ0 && z0IP13> THRZ0 && z0IP14> THRZ0 && z0IP15> THRZ0 ) vote=1; #endif #endif //Boundary-Based-Inner-Product vote CHANGE end return vote; } User-vote.C6

/** \file This file contains the implementation of the movie voting algorithem. */ /* Include files. */ #include <stdio.h> #include <PTree.H> #include "MovieSet.H" #include "UserSet.H" #include "mppConfig.H" #include "PredictionConfig.H" #include "mpp.h" /* Config file needs: (movie-vote part) * UCor Internal Pruning: * 1. Select 0 or 1 of DVCorp, DVCors, VDCorp, VDCors, PCor, DCor, SCor * 1.1 For selected in 1, set Threshold: DVThrp, DVThrs, VDThrp, VDThrs, PThr, DThr, SThr * Threshold defaults are: 0 0 0 0 0 0 0 * UCor VOTE weighting: (Default is UCor=1. By selecting 1 of these, we reset UCor's value to it.) * 2. Select 0 or 1 of DVCorp, DVCors, VDCorp, VDCors, PCor, DCor, SCor * Standard Deviation Internal Pruning: (population/sample; diffference_of_vectors/vector_of_differences) * 3. Select 0 or more of: dMNsdp, dMNsds, Nsdp_Msdp, Nsds_Msds * 3.1 Foreach selected in 2, set Threshold: dMNsdpThr, dMNsdsThr, Nsdp_MsdpThr, Nsds_MsdsThr * Threshold defaults are: 0 0 0 0 * 3.2 Foreach selected in 2, set pow exp: dMNsdpExp, dMNsdsExp, Nsdp_MsdpExp, Nsds_MsdsExp * Power Exponent defaults are: -1 -1 -1 -1 * External Pruning: * 4. Select 0 or more of: Prune_Users_In_SupM, Prune_Movies_In_SupU, Prune_Users_In_CoSupMN * 4.1 Foreach selected in 2, select 1 of: Prune, FastPrune, CommonCoSupportPrune * 4.2 Reset non-pruned support in 2nd: yes, no. * 4.3 Foreach selected in 2, set parameter: mstrt, ustrt, TSa, TSb, Tdvp,Tdvs,Tvdp,Tvds,TD,TP,PPm,TV,TSD,Ch, Ct * Prune Parameter defaults are: 0 0 -100 -100 -1 -1 -1 -1 -1 -1 .1 -1 -1 1 no def * Forcing in Range: * 5. Select 0,1 or 2 force_vote_in_ranges: in_Voter_LOOP(for each voter) outside_Voter_LOOP (for composite VOTE) */ /** * Public function. * This function implements movie voting. * \param pcfg A pointer to the class containing the parameters * which configure the voting. * \param M The movie number for which a prediction is to be made * \param supportM The PTree identifying the support for the movie to be predicted. * \param U The identity number of the user for which a prediction is to be made. * \param supportU The Ptree identifying the support for the user who a predication is being made for. * \return The recommended prediction. */ movie-vote.C1

extern double movie_vote(PredictionConfig *pcfg, unsigned long int M, \ PTree & supportM, unsigned long int U, \ PTree & supportU) { auto double vote = DEFAULT_VOTE, VOTE = DEFAULT_VOTE, VOTE_sum = 0, VOTE_cnt = 0; auto double Nb, Mb, dsSq, UCor = 1; struct pruning *internal_prune; struct external_prune *external_prune; auto PTree supM = supportM, supU = supportU; supM.clearbit(U); supU.clearbit(M); /* External pruning: Prune Users supM */ external_prune = pcfg->get_movie_Prune_Users_in_SupM(); if ( external_prune->enabled ) { if( supM.get_count() > external_prune->params.Ct) do_pruning(external_prune, M, U, supM, supU); supM.clearbit(U); supU.clearbit(M); if ( (supM.get_count() < 1) || (supU.get_count() < 1) ) return vote; } /* Reset support if requested. */ if ( pcfg->reset_movie_support() ) { supU = supportU; supU.clearbit(M); } /* External pruning: Prune Movies supU */ external_prune = pcfg->get_movie_Prune_Movies_in_SupU(); if ( external_prune->enabled ) { if( supU.get_count() > external_prune->params.Ct ) do_pruning(external_prune, M, U, supM, supU); supM.clearbit(U); supU.clearbit(M); if( (supM.get_count() < 1) || (supU.get_count() < 1) ) return vote; } /* NV: NLOOP strt (Ns are movie voters) */ auto unsigned long long int *supUlist = supU.get_indexes(); for (unsigned long long int nn= 0; nn < supU.get_count(); ++nn) { auto unsigned long long int N = supUlist[nn]; auto double NU = Users.get_rating(U,N)-2, MAX = 0, smN = 0, smM = 0, MM = 0, MN = 0, NN = 0, dm; auto PTree csMN = supM & Movies.get_users(N); csMN.clearbit(U); dm = csMN.get_count(); if( dm < 1 ) continue; /* External pruning: PRUNE USERS CoSupMN */ external_prune = pcfg->get_movie_Prune_Users_in_CoSupMN(); if ( external_prune->enabled ) { if( csMN.get_count() > external_prune->params.Ct) do_pruning(external_prune, M, U, csMN, supU); csMN.clearbit(U); supU.clearbit(M); dm = csMN.get_count(); if( dm < 1) continue; } /* NV: VLOOP strt (Vs are user vector_space_dimensions) */ auto unsigned long long int *csMNlist = csMN.get_indexes(); for (unsigned long long int v= 0; v < csMN.get_count(); ++v) { auto unsigned long long int V = csMNlist[v]; auto double MV = Users.get_rating(V,M) - 2, NV = Users.get_rating(V,N) - 2; if( pow(MV-NV, 2) > MAX ) MAX = pow(MV-NV, 2); smN += NV; smM += MV; MM += MV * MV; MN += NV * MV; NN += NV * NV; } Nb = smN / dm; Mb = smM / dm; dsSq = NN - 2*MN + MM; VOTE = NU - Nb + Mb; movie-vote.C2

internal_prune = \ pcfg->get_internal_prune(movie_VDCors); if ( internal_prune->enabled ) { auto double VDCors, dMNsds, thr = internal_prune->threshold, expnt = internal_prune->exponent; dMNsds=pow((dsSq-dm*(Nb-Mb)*(Nb-Mb))/(dm-1),.5); VDCors = exp(expnt * dMNsds * dMNsds); if ( VDCors < thr ) continue; if ( internal_prune->weight ) UCor = VDCors; } } /* POPULATION-statistics-based pruning through early exit. */ if ( dm > 0 ) { internal_prune = \ pcfg->get_internal_prune(movie_dMNsdp); if ( internal_prune->enabled ) { auto double dMNsdp,thr=internal_prune->threshold; dMNsdp=pow(dm*dsSq-(smN-smM)*(smN-smM),.5)/dm; if ( dMNsdp > (thr * pow(dm,0.9)) ) continue; } /* method Usds_Vsds */ internal_prune = \ pcfg->get_internal_prune(movie_Nsdp_Msdp); if ( internal_prune->enabled ) { auto double Nsdp, Msdp, thr = internal_prune->threshold; Msdp = pow(dm*MM - smM*smM, 0.5) / dm; Nsdp = pow(dm*NN - smN*smN, 0.5) / dm; if( Nsdp > (thr * Msdp) ) continue; } internal_prune = \ pcfg->get_internal_prune(movie_VDCorp); if ( internal_prune->enabled ) { auto double DVCorp, Msdp, Nsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent; Msdp = pow(dm*MM - smM*smM, 0.5) / dm; Nsdp = pow(dm*NN - smN*smN, 0.5) / dm; DVCorp = exp(expnt * (Nsdp-Msdp)*(Nsdp-Msdp)); if ( DVCorp < thr ) continue; if ( internal_prune->weight ) UCor = DVCorp; } /* force_vote_in_Voter_Loop goes here. */ if ( pcfg->movie_vote_force_in_loop() ) { if ( (VOTE < 1) && (VOTE != DEFAULT_VOTE) ) VOTE=1; if ( (VOTE > 5) && (VOTE != DEFAULT_VOTE) ) VOTE=5; } /* SAMPLE-statistic-based pruning through early exit. */ if( dm > 1 ) { /* method dMNsds */ internal_prune = \ pcfg->get_internal_prune(movie_dMNsds); if ( internal_prune->enabled ) { auto double dMNsds, thr = internal_prune->threshold, expnt = internal_prune->exponent; dMNsds = pow((dsSq-dm*(Nb-Mb)*(Nb-Mb))/(dm-1),\ 0.5); if( dMNsds > (thr * pow(dm, expnt)) ) continue; } /* method Msds_Nsds NO exponent. */ internal_prune = \ pcfg->get_internal_prune(movie_Nsds_Msds); if ( internal_prune->enabled ) { auto double Msds, Nsds, thr = internal_prune->threshold; Msds = pow((MM-dm*Mb*Mb)/(dm-1), 0.5); Nsds = pow((NN-dm*Nb*Nb)/(dm-1), 0.5); if ( Nsds > (thr * Msds) ) continue; } internal_prune = \ pcfg->get_internal_prune(movie_DVCors); if ( internal_prune->enabled ) { auto double Msds, Nsds, DVCors, thr = internal_prune->threshold, expnt = internal_prune->exponent; Msds = pow(dm*MM - smM*smM, 0.5) / dm; Nsds = pow(dm*NN - smN*smN, 0.5) / dm; DVCors = exp(expnt * (Nsds-Msds)*(Nsds-Msds)); if ( DVCors < thr ) continue; if ( internal_prune->weight ) UCor = DVCors; } movie-vote.C3

if ( internal_prune->enabled ) { auto double VDCorp, dMNsdp, thr = internal_prune->threshold, expnt = internal_prune->exponent; dMNsdp=pow(dm*dsSq-(smN-smM)*(smN-smM),.5)/dm; VDCorp = exp(expnt * dMNsdp * dMNsdp); if ( VDCorp < thr ) continue; if ( internal_prune->weight ) UCor = VDCorp; } } /* OTHER Correlation pruning (pearson=s,pureshift=p,distance=d)*/ internal_prune = pcfg->get_internal_prune(movie_SCor); if ( internal_prune->enabled ) { auto double SCor, thr=internal_prune->threshold; SCor= (MN-dm*Mb*Nb)/(.0001+(pow((MM-dm*pow(Mb,2)),.5)) * (.0001+pow((NN-dm*pow(Nb, 2)),.5))); if ( SCor < thr ) continue; if ( internal_prune->weight ) UCor = SCor; } /* CHECK for exponent */ internal_prune = pcfg->get_internal_prune(movie_PCor); if ( internal_prune->enabled ) { auto double ONEPDS, PCor = 1, thr = internal_prune->threshold; ONEPDS = dsSq - dm * pow(Nb-Mb, 2); if (MAX>0) PCor=exp(-.1*ONEPDS/(pow(MAX,.75)*pow(dm,.5))); if( PCor < thr ) continue; if ( internal_prune->weight ) UCor = PCor; } internal_prune = pcfg->get_internal_prune(movie_DCor); if ( internal_prune->enabled ) { auto double DCor, ONEPDS, thr = internal_prune->threshold; ONEPDS = dsSq - dm*pow(Nb-Mb, 2); DCor = exp(-dsSq / 100); if ( DCor < thr ) continue; if ( internal_prune->weight ) UCor = DCor; } if (UCor>0) {VOTE_sum += VOTE*UCor; VOTE_cnt+=UCor; } else continue; movie-vote.C4 /* force_vote_in_Voter_Loop goes here. */ if ( pcfg->movie_vote_force_in_loop() ) { if ( (VOTE < 1) && (VOTE != DEFAULT_VOTE) ) VOTE=1; if ( (VOTE > 5) && (VOTE != DEFAULT_VOTE) ) VOTE=5; } } if ( VOTE_cnt > 0 ) VOTE = VOTE_sum / VOTE_cnt; else VOTE = DEFAULT_VOTE; /* force_vote_after_Voter_Loop goes here. */ if ( pcfg->movie_vote_force_after_loop() ) { if ( (VOTE < 1) && (VOTE != DEFAULT_VOTE) ) VOTE=1; if ( (VOTE > 5) && (VOTE != DEFAULT_VOTE) ) VOTE=5; } return VOTE; }

/* Set the starting point based on the specificed start point * and a multiplier if it is specified. If the starting point * exceeds the support count start at the beginning of the * support list. */ start = start + (unsigned long long int) (mult * supcnt); if ( start > supcnt ) start = (unsigned long long int) (mult * supcnt); if ( start > supcnt ) start = 0; /* The simple case is a start of zero. */ if ( start == 0 ) { for (unsigned long long int lp= 0; lp < supcnt; ++lp) list.push_back(indexes[lp]); } /* Two loop passes are needed for a non-zero start value. */ for (unsigned long long int lp= start; lp < supcnt; ++lp) list.push_back(indexes[lp]); for (unsigned long long int lp= 0; lp < start; ++lp) list.push_back(indexes[lp]); return; } /* Private function. * This function verifies whether or not a voting entity is within a * selection window. A selection window is defined by a minimum (leftside) * voter window and a window size. * \param voter The voter being considered. * * \param pp A pointer to the structure containing the * leftside and width parameters for a pruning method. * \return A boolean value is returned if the voter is * within the selection window. A false value * is automatically returned if the width value * is set to zero. Setting the width value to * zero thus disables window based selection. */ static bool outside_window(unsigned long long int voter, \ struct pruning_parameters *pp) { if ( pp->width == 0 ) return false; if ( voter < pp->leftside ) return true; if ( voter > pp->leftside + pp->width ) return true; return false; } Prune.C1 /** \file contains implementations of routines * for pruning user and movie voting lists. */ /* Standard C++ include files. */ #include <map> #include <vector> #include <unistd.h> #include <stdlib.h> /* Local C++ include files. */ #include <PTree.H> #include "UserSet.H" #include "MovieSet.H" #include "mppConfig.H" #include "mpp.h" /* Global accessible variables. */ extern float corData[17771]; using namespace std; /* Shorthand type definition for the correlation map. */ typedef multimap<double, unsigned long long int, greater<double > > map_t; /* Private function. * * This function loads a vector with a list of support indexes from * the given PTree. The list contains N elements where N is the support * count. The actual order of the list is determined by the start and * multiplier values passed in from the caller. * * \param suptree A reference to PTree whose support list is to be generated. * \param list A reference to vector loaded with support indexes. * \param start The starting element in the support list which * will be 0th element in the completed support list. * \param mult The multiplier value to be used in determining * the support starting point. */ static void load_support_vector(PTree & suptree, \ vector<unsigned long long int> & list, \ unsigned long long int start, double mult) { auto unsigned long long int *indexes = suptree.get_indexes(), supcnt = suptree.get_count();

auto PTree csMN = supM&Movies.get_users(N); if( csMN.get_count() < 1 ) continue; /* moviePRUNE (NV loops) VLOOP start */ auto vector<unsigned long long int> ilp; load_support_vector(csMN, ilp, pp->ustrt, pp->ustrt_mult); for (unsigned long long int lp1= 0; lp1 < ilp.size(); ++lp1) { auto unsigned long long int V = ilp[lp1]; #if 0 if ( outside_window(V, pp) ) continue; #endif MV = Movies.get_rating(V, M) - 2; NV = Movies.get_rating(V, N) - 2; if(pow(MV-NV,2)>max) max=pow(MV-NV,2); smM += MV; smN += NV; MM += MV*MV; NN += NV*NV; MN += MV*NV; } dm=csMN.get_count(), Mb=smM/dm, Nb=smN/dm, dsSq=NN-2*MN+MM, OnePDS=dsSq-dm*pow(Nb-Mb,2), sCor=(MN-dm*Mb*Nb)/(.0001+ (pow((MM-dm*pow(Mb,2)),.5))*(pow((NN-dm*pow(Nb,2)),.5))), dCor=exp(-dsSq/100), pCor=1; if(max>0)pCor=exp(-pp->PPm*OnePDS/(.0001+pow(max,.75)*pow(dm,.5))); if(dm>0){Nsdp=pow(dm*NN-smN*smN,.5)/dm; Msdp=pow(dm*MM-smM*smM,.5)/dm; dMNsdp=pow(dm*dsSq-(smN-smM)*(smN-smM),.5)/dm;} if(dm>1){Nsds=pow((NN-dm*Nb*Nb)/(dm-1),.5); Msds=pow((MM-dm*Mb*Mb)/(dm-1),.5); dMNsds=pow((dsSq-dm*(Nb-Mb)*(Nb-Mb))/(dm-1),.5);} dvCorp=exp(-10 * (Nsdp-Msdp) * (Nsdp-Msdp) ); dvCors=exp(-10 * (Nsds-Msds) * (Nsds-Msds) ); vdCorp=exp(-10 * dMNsdp * dMNsdp ); vdCors=exp(-10 * dMNsds * dMNsds ); if( pp->Ch == 1) mCor = corData[N+1]; if( pp->Ch == 2) mCor = sCor; if( pp->Ch == 3) mCor = dCor; if( pp->Ch == 4) mCor = pCor; if( pp->Ch == 5) mCor=vCor; if( pp->Ch == 6) mCor = stdCor; if( pp->Ch == 7 ) mCor = dvCorp; if( pp->Ch == 8 ) mCor = dvCors; if( pp->Ch == 9 ) mCor = vdCorp; if( pp->Ch == 0 ) mCor = vdCors; // THRESHOLD PRUNING if ( corData[N+1] < pp->TSa || sCor < pp->TSb || \ pCor < pp->TP || dCor < pp->TD || vCor < pp->TV || \ stdCor < pp->TSD || dvCorp < pp->Tdvp || \ dvCors < pp->Tdvs || vdCorp < pp->Tvdp || vdCors < pp->Tvds ) Prune.C2 /* Private function. * This function implements the final step in 'pruning' of a PTree. It * clears the destination PTree and then sets only those bits in the PTree * which have been selected by a previous correlation strategy. * \param tree A reference to the PTree which is reflect the * contents of the multimap. * \param index_map The map specifying the index bits to be set. * \param max_count Maximum number of indexes to be selected from PTree. */ static void load_ptree(PTree & tree, map_t index_map, double max_count) { map_t::iterator index_ptr = index_map.begin(); if ( index_map.size() < max_count ) max_count = index_map.size(); tree.clearall(); for (unsigned int lp= 0; lp < max_count; ++lp) { tree.setbit(index_ptr->second); ++index_ptr; } return; } /* Movie prune standard. */ /* movie_vote: Prune */ static void mPrune(unsigned long long int M, PTree & supM, PTree & supU, struct pruning_parameters *pp) { if ( supU.get_count() < (pp->Ct + 1) ) return; map_t corRm; auto vector<unsigned long long int> support; /* moviePRUNE (NV loops) NLOOP start */ load_support_vector(supU, support, pp->mstrt, pp->mstrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int N = support[lp]; if ( outside_window(N, pp) ) continue; auto double smM = 0, smN = 0, MM = 0, NN = 0, MN = 0, MV, NV, max=0, dm, Mb, Nb, dsSq, OnePDS, Nsdp = 0, Msdp = 0, Nsds = 0, Msds = 0, dMNsdp = 0, dMNsds = 0, mCor = 1, sCor = 1, dCor = 1, pCor = 1, vCor = 1, stdCor = 1, dvCorp = 1, dvCors = 1, vdCorp = 1, vdCors = 1;

auto double smU=0, smV=0, UU=0, VV=0, UV=0, max=0, Vsdp=0, Usdp=0, Vsds=0, Usds=0, dUVsdp=0, dUVsds=0, mCor=1, sCor=1, dCor=1, pCor=1, vCor=1, stdCor=1, dvCorp=1, dvCors=1, vdCorp=1, vdCors=1, NU, NV, dm, Ub, Vb, dsSq, OnePDS; auto PTree csUV = supU & Users.get_movies(V); if( csUV.get_count() < 1 ) continue; /* user PRUNE (VN loops) NLOOP start */ auto vector<unsigned long long int> ilp; load_support_vector(csUV, ilp, pp->mstrt, pp->mstrt_mult); for (unsigned long long int lp1= 0; lp1 < ilp.size(); ++lp1) { auto unsigned long long int N = ilp[lp1]; #if 0 if ( outside_window(N, pp) ) continue; #endif NU = Movies.get_rating(U, N) - 2; NV = Movies.get_rating(V, N) - 2; if ( pow(NU-NV,2) > max ) max=pow(NU-NV, 2); smU += NU; smV += NV; UU += NU*NU; VV += NV*NV; UV += NU*NV; } //user PRUNE (VN loops) NLOOP end dm = csUV.get_count(); Ub = smU/dm; Vb = smV/dm; dsSq = VV - 2*UV + UU; OnePDS = dsSq - dm*pow(Vb-Ub,2); sCor=(UV-dm*Ub*Vb)/((pow((UU-dm*pow(Ub,2)),.5))*(pow((VV-dm*pow(Vb,2)),.5))); dCor = exp(-dsSq/100); if (max>0) pCor=exp(-pp->PPm*OnePDS/(pow(max,.75)*pow(dm,.5))); if(dm>0){ Vsdp=pow(dm*VV-smV*smV,.5)/dm; Usdp=pow(dm*UU-smU*smU,.5)/dm; dUVsdp=pow(dm*dsSq-(smV-smU)*(smV-smU),.5)/dm;} if(dm>1){ Vsds=pow((VV-dm*Vb*Vb)/(dm-1),.5); Usds=pow((UU-dm*Ub*Ub)/(dm-1),.5); dUVsds=pow((dsSq-dm*(Vb-Ub)*(Vb-Ub))/(dm-1),.5);} dvCorp=exp(-10 * (Vsdp-Usdp) * (Vsdp-Usdp) ); dvCors=exp(-10 * (Vsds-Usds) * (Vsds-Usds) ); vdCorp=exp(-10 * dUVsdp * dUVsdp ); vdCors=exp(-10 * dUVsds * dUVsds ); if( pp->Ch == 1 ) mCor = sCor; if( pp->Ch == 2 ) mCor = sCor; if( pp->Ch == 3 ) mCor = dCor; if( pp->Ch == 4 ) mCor = pCor; if( pp->Ch == 5 ) mCor = vCor; if( pp->Ch == 6 ) mCor = stdCor; if( pp->Ch == 7 ) mCor = dvCorp; if( pp->Ch == 8 ) mCor = dvCors; if( pp->Ch == 9 ) mCor = vdCorp; if( pp->Ch == 0) mCor = vdCors; Prune.C3 else { auto pair<double,unsigned long long int> entry(mCor,N); corRm.insert(entry); } } if ( corRm.size() == 0 ) return; load_ptree(supU, corRm, pp->Ct); return; } /* movie_vote: FastPrune */ static void fmPruneS(PTree & supU, struct pruning_parameters *pp) { if ( supU.get_count() < pp->Ct + 1 ) return; map_t corRm; auto vector<unsigned long long int> support; /* moviePRUNE (NV loops) NLOOP start */ load_support_vector(supU, support, pp->mstrt, pp->mstrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int N = support[lp]; #if 0 if ( outside_window(N, pp) ) continue; #endif if( corData[N+1] < pp->TSa ) continue; auto pair<double, unsigned long long int> \ entry(corData[N+1], N); corRm.insert(entry); } if ( corRm.size() == 0 ) return; load_ptree(supU, corRm, pp->Ct); return; } //userPRUNE (VN loops) start /* user_vote: Prune */ static void uPrune (unsigned long long int U, PTree & supM, PTree & supU, \ struct pruning_parameters *pp) { if ( supM.get_count() < pp->Ct + 1) return; map_t corR; auto vector<unsigned long long int> support; /* userPrune (VN loops) VLOOP start */ load_support_vector(supM, support, pp->ustrt, pp->ustrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int V = support[lp]; if ( outside_window(V, pp) ) continue;

auto double dm = csUV.get_count(), Ub = smU / dm, Vb = smV / dm, SCor=(UV-dm*Ub*Vb)/(.00001+(pow((UU-dm*pow(Ub,2)),.5))* (pow((VV-dm*pow(Vb,2)),.5))); if( SCor < pp->TSb ) continue; auto pair<double,unsigned long long int> entry(SCor,V); corR.insert(entry); } if ( corR.size() == 0 ) return; load_ptree(supM, corR, pp->Ct); return; } /* user_vote: CommonCoSupportPrune */ static void uPrune2(PTree & supM, PTree & supU, struct pruning_parameters *pp) { if ( supM.get_count() < pp->Ct+1) return; map_t corR; auto PTree csUV; auto vector<unsigned long long int> support; /* CommonCoSup userPRUNE VN loops VLOOP start */ load_support_vector(supM, support, pp->ustrt, pp->ustrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int V = support[lp]; if ( outside_window(V, pp) ) continue; csUV = supU & Users.get_movies(V); auto double dm = csUV.get_count(); auto pair<double, unsigned long long int> entry(dm, V); corR.insert(entry); } auto unsigned int select_count = (unsigned int) pp->Ct; auto PTree ccsU = supU; map_t::iterator begin = corR.begin(); supM.clearall(); if ( corR.size() < pp->Ct ) select_count = corR.size(); for(unsigned int lp= 0; lp < select_count; ++lp) { supM.setbit(begin->second); ccsU = ccsU & Users.get_movies(begin->second); ++begin; } supU = ccsU; return; } Prune.C4 // THRESHOLD PRUNE if ( sCor < pp->TSb || pCor < pp->TP || dCor < pp->TD || vCor < pp->TV || stdCor < pp->TSD || dvCorp < pp->Tdvp|| dvCors < pp->Tdvs|| vdCorp < pp->Tvdp|| vdCors < pp->Tvds) continue; else { auto pair<double,unsigned long long int> entry(mCor,V); corR.insert(entry); } } if ( corR.size() == 0 ) return; load_ptree(supM, corR, pp->Ct); return; } /* user_vote: FastPrune */ static void fuPruneS(unsigned long long int U, PTree & supM, PTree & supU, \ struct pruning_parameters *pp) { if ( supM.get_count() < (pp->Ct + 1) ) return; map_t corR; auto vector<unsigned long long int> support; load_support_vector(supM, support, pp->ustrt, pp->ustrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int V = support[lp]; if ( outside_window(V, pp) ) continue; auto PTree csUV = supU & Users.get_movies(V); if ( csUV.get_count() < 1 ) continue; auto double smU = 0, smV = 0, UU = 0, VV = 0, UV = 0, NU, NV; /* fast user Prune (VN loops) NLOOP start */ auto vector<unsigned long long int> ilp; load_support_vector(csUV, ilp, pp->mstrt, pp->mstrt_mult); for(unsigned long long int lp1= 0; lp1 < ilp.size(); ++lp1) { auto unsigned long long int N = ilp[lp1]; #if 0 if ( outside_window(N, pp) ) continue; #endif NU = Movies.get_rating(U, N) - 2; NV = Movies.get_rating(V, N) - 2; smU += NU; smV += NV; UU += NU * NU; VV += NV * NV; UV += NU * NV; }

Prune.C5 /* movie_voting: CommonCoSupportPrune */ static void mPrune2(PTree & supM, PTree & supU, struct pruning_parameters *pp) { if ( supU.get_count() < (pp->Ct + 1) ) return; map_t corRm; auto PTree csMN; auto vector<unsigned long long int> support; /* moviePRUNE NV loops NLOOP start */ load_support_vector(supU, support, pp->mstrt, pp->mstrt_mult); for (unsigned long long int lp= 0; lp < support.size(); lp++) { auto unsigned long long int N = support[lp]; if ( outside_window(N, pp) ) continue; csMN = supM & Movies.get_users(N); auto double dm = csMN.get_count(); auto pair<double, unsigned long long int> entry(dm, N); corRm.insert(entry); } auto unsigned int select_count = (unsigned int) pp->Ct; auto PTree ccsM = supM; map_t ::iterator begin = corRm.begin(); supU.clearall(); if ( corRm.size() < select_count ) select_count = corRm.size(); for(unsigned int lp= 0; lp < select_count; ++lp) { supU.setbit(begin->second); ccsM = ccsM & Movies.get_users(begin->second); ++begin; } supM = ccsM; return; } /* Internal function. * This function dispatches execution to the pruning method which has * been selected for an external pruning routine. * \param pcfg A pointer to the structure defining the * external pruning to be conducted. * \param M The movie whose rating is to be predicted. * \param U The user who the predication is to be made for. * \param supM A PTree describing the movie support. * \param supU A PTree describing user support. */ void do_pruning(struct external_prune * const prune, unsigned long int M, \ unsigned long int U, PTree & supM, PTree & supU) { auto struct pruning_parameters *params = &prune->params; switch ( prune->method ) { case UserPrune: uPrune(U, supM, supU, params); break; case UserFastPrune: fuPruneS(U, supM, supU, params); break; case UserCommonCoSupportPrune: uPrune2(supM, supU, params); break; case MoviePrune: mPrune(M, supM, supU, params); break; case MovieFastPrune: fmPruneS(supU, params); break; case MovieCommonCoSupportPrune: mPrune2(supM, supU, params); break; } return; }

run script for processing movie_predict files into 1 movie_prediction file (and also 1 .rmse and 1 .out log file). # Remove any old output files and make sure we have a fresh backup directory. rm -f $Output $Logfile; if [ -d "$Backup" ]; then echo "Error: Backup directory present."; exit 1; fi; mkdir $Backup; # Loop over prediction input file and generate outputs. cat $Inputfile | while read input; do if [ "$input" != "${input%%:}" ]; then Movie=${input%%:}; Predictions="$Name"_$Movie.predict; Log="$Name"_$Movie.log; if [ ! -e "$Predictions" ]; then echo "Error: Prediction file not found - " \ ">$Predictions<"; exit 1; fi; echo "Processing: $Movie"; cat $Predictions >>$Output; # [ -e "$Log" ] && cat $Log >>$Logfile; rm $Predictions; # previous line added # with following commented out, it seem to eliminate backing up. # mv $Predictions $Backup; cd Output ../mpp-glue1 ../$1 cd .. mpp-rmse1 ./$1 mpp-glue script #! /bin/bash # This utility 'glues' a set of .predict files for a given run # of mpp-mpred into a single file. This program is driven # by the input file used for the prediction run. When it finds # a movie (delimited by a trailing :) ALL entries in files, # InputFileName_movieID.predict, in the current directory # are printed to a file, InputFileName.txt.prediction. # The utility takes as the single argument, InputFileName # used for the prediction run # Verify input file is found. if [ -z "$1" ]; then echo "Error: Input file not specified."; exit 1; fi; if [ ! -e "$1" ]; then echo "Error: Input file not found - >$Input<"; exit 1; fi; # if [ $? -ne 0 ]; then echo "Error: Unable to create predictions backup."; # exit 1; fi; # if [ -e "$Log" ]; then mv $Log $Backup; # if [ $? -ne 0 ]; then echo "Error: Unable to create logs backup."; # exit 1; fi; fi; fi; done; # All done. echo -e "\nInputfile: $Inputfile"; echo -e "\tPredictions:\t$Output"; echo -e "\tLogfile:\t$Logfile"; echo -e "\tBackups:\t$Backup"; echo -e "\nLine count verifications:"; echo -e "\t$(wc -l $Inputfile)"; echo -e "\t$(wc -l $Output)"; [ -n "$Current_Dir" ] && cd ..; exit 0 # Variables global to this module. declare -r Name=`basename $1`; declare -r Output="$Name.predictions" Logfile="$Name.logfile"; declare -r Backup="$Name.backup"; declare Inputfile=$1; declare Movie; declare Predictions Log; declare Current_Dir; # Main body of the program occurs here. # If a directory named Output is present assume # we should use that directory. if [ -d "./Output" ]; then Current_Dir=`pwd`; Inputfile="../$Inputfile"; cd Output; fi; Puts as output (in current dir) InputFileName.txt.predictions Takes all InputFileName_movieID1.predict … InputFileName_movieIDn.predict in current directory as input (deleted after processing) mpp-glue

mpp-rmse1 script mpp-rmse1.pl $answers = $ARGV[0]; $predictions = $ARGV[1]; $lp = 0; $cnt = 0; $error = 0; $error_sum = 0; $total_error = 0; $total_cnt = 0; $last_movie = ""; chomp(@answers = `cat $answers`); chomp(@predictions = `cat $predictions`); foreach(@answers) { if ( /:$/ ) { if ( $last_movie ne "" ) { printf "\n\tSum: %.5f\tTotal: %-5d\tRMSE: %f\n\n", $error_sum, $cnt, sqrt($error_sum/$cnt); printf "\tRunning RMSE: %f / %d predictions\n\n", sqrt($total_error/$total_cnt), $total_cnt; $error_sum = 0; $cnt = 0; } $last_movie = $_; print "Movie: $_\n"; if ( $_ ne $predictions[$lp] ) { print "Movies don't match\n"; print "\t$_ vs. $predictions[$lp]\n"; exit 1; } ++$lp; next; } # Correct for an NAN if ( $predictions[$lp] eq "nan" ) { print "NAN"; $predictions[$lp] = "3.70"; } if ( $predictions[$lp] eq "corm-nan" ) { print "CORM-NAN"; $predictions[$lp] = "3.70"; } #! /bin/bash # This utility generates an RMSE report based on predictions carried # out on the 'probe' dataset. It compares a prediction list against # the set of known files. # This program is driven by the input file used for the prediction # run. The majority of the comparative work and generation of the # RMSE values is done by the PERL script called from this script. # The PERL script reads both the prediction file # (Output/InputFileName.txt.prediction) and the list of known answers # (InputFileName.txt.answers in the current directory). # When a movie is found it verifies the movie is # also present in the companion file. This is to insure there are # no discrepancies between the two files. # The utility takes as a single argument the name of the input file # used for the prediction run. # Verify input file is found. if [ -z "$1" ]; then echo "Error: Input file not specified."; exit 1; fi; if [ ! -e "$1" ]; then echo "Error: Input file not found - >$Input<"; exit 1; fi; # Variables global to this module. declare -r Startdir=`dirname $0`; declare -r Basename=`basename $1`; declare -r Answers="$1.answers"; declare -r Predictions="Output/$Basename.predictions"; if [ ! -e "$Answers" ]; then echo "Answers file not found - >$Answers<."; exit 1; fi; if [ ! -e "$Predictions" ]; then echo "Predictions file not found - >$Predictions<."; exit 1; fi; # Main body of the program occurs here. perl $Startdir/mpp-rmse.pl $Answers $Predictions | tee "$Basename.rmse"; exit 0 $error = ($_ - $predictions[$lp])**2; $error_sum += $error; $total_error += $error; ++$total_cnt; ++$cnt; printf "\t%4d:\tAnswer: %2d\tPrediction: $predictions[$lp]\tError: %.5f\n", $cnt -1, $_, $error; ++$lp; } # Print the RMSE from the last movie. printf "\n\tSum: %.5f\tTotal: %-5d\tRMSE: %f\n\n", $error_sum, $cnt, sqrt($error_sum/$cnt); # Then the total RMSE for the run. print "Prediction summary:\n"; printf "\tSum: %.5f\tTotal: %-5d\tRMSE: %f\n\n", $total_error, $total_cnt, sqrt($total_error/$total_cnt); exit 0; Puts as output (in current dir) InputFileName.txt.rmse Takes Output/InputFileName.txt.predictions and InputFileName.txt.answers from current directory as input mpp-rmse

mpp-user-reduce script if [ -z "$2" ]; then echo "$Pgm: Error - RMSE threshold not specified."; exit 1; fi; if [ -z "$3" ]; then echo "$Pgm: Error - Low output filename not specified."; exit 1; fi; if [ -z "$4" ]; then echo "$Pgm: Error - High output filename not specified."; exit 1; fi; # Variables global to this module which are dependent on command-line options. declare -r Input=$1; declare -r Startdir=`dirname $0`; declare -r Basename=`basename $1`; declare -r Answers="$1.answers"; declare -r Predictions="Output/$Basename.predictions"; declare -r Threshold=$2; declare -r LowOut=$3; declare -r HighOut=$4; if [ ! -e "$Answers" ]; then echo "$Pgm: Error - Answers file not found: >$Answers<."; exit 1; fi; if [ ! -e "$Predictions" ]; then echo "$Pgm - Predictions file not found: >$Predictions<."; exit 1; fi; # Main body of the program occurs here. perl -w $Startdir/mpp-user-reduce.pl $Input $Answers $Predictions $Threshold \ $LowOut $HighOut $Mode; exit 0 #! /bin/bash # Variables global to this module. declare -r Pgm=`basename $0`; declare Mode="both"; # This utility reduces a set of movies to be predicted by outputting # movies which have an RMSE value greater than a specified threshold. # This program is driven by the input file used for the prediction # run. The majority of the comparative work and generation of the # RMSE values is done by the PERL script called from this script. # If the first argument to the utility is a -m the next argument # is interpreted as a mode value. The following arguments are accepted: # low: Output only low RMSE pairings. # high: Output only high RMSE pairings. # both: Output both files. # The default is for both files to be output. if [ "$1" = "-m" ]; then case $2 in low) Mode="low";; high) Mode="high";; both) Mode="both";; *) echo -e "$Pgm: Unknown argument to mode switch, \c"; echo "specify low, high or both."; exit 1;; esac; shift 2; fi; # The utility takes four general argumns as follows: # # $1: Inputfile # $2: RMSE threshold value. # $3: Root name of output file for movies below threshold. # $4: Root name of output file for movies above threshold. # Verify input file is found. if [ -z "$1" -o ! -e "$1" ]; then echo "$Pgm: Error - Input file not specified."; echo echo "Command format:" echo -e "\t$Pgm [-m low|high|both] Inputfile Threshold \c"; echo -e "LowOutFile HighOutfile"; exit 1; fi; mpp-user-reduce -m both Data/probe19.txt .0001 lo19 hi19Takes input, Data/probe19.txt (movieID with interleaved userIDs format or .txt format) SqErrThrhld (if SqErr ≤ .0001, put pair in lo19.txt, else put in hi19.txt) -m both means both lo and hi will be produced (other options: low or high) Puts as output lo-FileName hi-FileName mpp-user-reduce –m both|low|high InputFile.txt SqErrThrhd

mpp-user-reduce.pl # Main program starts here. # Load input, answers and predictions into arrays which are stored in # hashes keyed by movie number. open(INPUT, $Input) || die "Cannot open input: $Input"; while ( <INPUT> ) {chomp; if (/:$/) {$key = $_; $Input{$key}=[];} else { push(@{$Input{$key}}, $_); } } close(INPUT); open(INPUT, $Answers) || die "Cannot open answer file: $Answers"; while ( <INPUT> ) {chomp; if (/:$/) {$key = $_; $Answers{$key}=[];} else { push(@{$Answers{$key}}, $_); } } close(INPUT); open(INPUT, $Predictions) || die "Cannot open predictions file: $Predictions"; while ( <INPUT> ) { chomp; if ( /:$/ ) { $key = $_; $Predictions{$key} = []; } else { push(@{$Predictions{$key}}, $_); } } close(INPUT); foreach( keys(%Answers) ) { my $lp; my $error; $movie = $_; @users = @{$Input{$movie}}; @ans = @{$Answers{$movie}}; @pred = @{$Predictions{$movie}}; for ($lp= 0; $lp <= $#ans; ++$lp) { $user = $users[$lp]; $predict = $pred[$lp]; # Correct for NAN's and CORM-NAN if ($pred[$lp] eq "nan"){print "NAN"; $predict="3.70";} if ($pred[$lp] eq "corm-nan"){print "CORM-NAN";$predict="3.70";} $error = ($ans[$lp] - $predict)**2; if ( $error > $Threshold ) { $HighRMSE{$movie} = [] if !defined($HighRMSE{$movie}); push(@{$HighRMSE{$movie}},"$user $ans[$lp]");++$High_Count;} else { $LowRMSE{$movie} = [] if !defined($LowRMSE{$movie}); push(@{$LowRMSE{$movie}},"$user $ans[$lp]");++$Low_Count;} } } # Output new input and predictions files based on the reduced set. print "Selected movie/user pairings based on RMSE = $Threshold:\n"; if ( ($Mode eq "low") or ($Mode eq "both") ) { print "\tLow rmse pairs: ", $Low_Count, "\n"; Output_Pairing($LowOut, \%LowRMSE); print "\n"; } if ( ($Mode eq "high") or ($Mode eq "both") ) { print "\tHigh rmse pairs: ", $High_Count, "\n"; Output_Pairing($HighOut, \%HighRMSE); } # All done. exit 0; $Input = $ARGV[0]; $Answers = $ARGV[1]; $Predictions = $ARGV[2]; $Threshold = $ARGV[3]; $LowOut = $ARGV[4]; $HighOut = $ARGV[5]; $Mode = $ARGV[6]; $Low_Count = 0; $High_Count = 0; # Subroutine outputs pairing results for a given collection of user/movie ratings; sub Output_Pairing { my($file, $rmse_ptr) = @_; my($inputfile, $answerfile, $user, $answer); # Open input and answer files. $inputfile = $file . ".txt"; print "\t\tInput: $inputfile\n"; open(NEWINPUT, ">$inputfile") || die "Cannot open new inputfile: $inputfile"; $answerfile = $file . ".txt.answers"; print "\t\tAnswers: $answerfile\n"; open(ANSWERS, ">$answerfile") || die "Cannot open new answer file: $answerfile."; # The outer loop runs over the movies in a grouping. The inner # loop then runs over the set of inputs and answers for that movie. foreach ( keys(%{$rmse_ptr}) ) { print NEWINPUT "$_\n"; print ANSWERS "$_\n"; foreach ( @{$$rmse_ptr{$_}} ) { ($user, $answer) = split; print NEWINPUT "$user\n"; print ANSWERS "$answer\n"; } } close(NEWINPUT); close(ANSWERS); return; }

mpp-filter script (for unioning (-M or), intersecting (-M and) clusters (to check coverage, etc.) mpp-filter.pl #! /bin/bash This is a driver program for implementing a utility for ANDing or # ORing two input files. # Variables global to this module. declare -r Pgm=`basename $0`; declare Mode=""; # Parse arguements. while getopts "M:" Arg; do case $Arg in M) Mode=$OPTARG;; esac; done; # Sanity checks. if [ -z "$Mode" ]; then echo "$Pgm: No mode specified."; exit 1; fi; if [ "$Mode" != "and" -a "$Mode" != "or" ]; then echo "$Pgm: Invalid mode specifed - $Mode"; exit 1; fi; # Verify two filenames are present. shift `expr $OPTIND - 1`; if [ $# -ne 2 ]; then echo "$Pgm: Insufficient filenames specified."; exit ; fi; # Call Perl to carry out the boolean filtering operation. exec perl $Pgm.pl $Mode $*; # This script implements boolean filtering operations between two input # files. The results of the filtering operation are output on stdout. # Two merge modes are supported: # AND: A user index is output if it exists for a given movie in both input files. # OR: A movie/user pair is output if it exists in either input file. $Mode = $ARGV[0]; $Input1 = $ARGV[1]; $Input2 = $ARGV[2]; # The following subroutine loads a file into an associative array. The # filename to be read is passed to the subroutine as the first arguement. # A reference to the associative array is passed as the second arguement. # If the filename cannot be opened an error exit is taken from the applic. sub Load_File { my $key, $file = $_[0], $hptr = $_[1]; open(IN, $file) || die "Cannot open file: $file"; while ( $_ = <IN> ) { chomp; if ( /:$/ ) { $key = $_; $$hptr{$key} = []; } else { push(@{$$hptr{$key}}, $_); } } close(IN); return; } # Subroutine outputs a file which has been stored in hashed/array format. sub Output_File { foreach ( keys(%{$_[0]}) ) { print "$_\n"; my @hlist = @{$_[0]{$_}}; foreach ( @hlist ) { print "$_\n"; } }

Appendix 1: additional codes Directories drwxr-xr-x 75 perrizo faculty 2.3M Feb 2 13:13 Output drwxr-xr-x 3 perrizo faculty 4.0K Jan 8 13:28 p19 drwxr-xr-x 5 perrizo faculty 4.0K Jan 8 13:34 p95 drwxr-xr-x 5 perrizo faculty 4.0K Jan 31 10:51 pf -rw-r--r-- 1 perrizo faculty 22K Nov 29 11:22 PredictionConfig.C -rw-r--r-- 1 perrizo faculty 5.3K Nov 29 11:22 PredictionConfig.H -rw-r--r-- 1 perrizo faculty 22K Nov 29 11:25 PredictionConfig.o -rw-r--r-- 1 perrizo faculty 19K Feb 2 12:38 prune.C -rw-r--r-- 1 perrizo faculty 29K Feb 2 12:38 prune.o -rw-r--r-- 1 perrizo faculty 1.2K Nov 29 11:22 read-user-ptrees.C -rwxr-xr-x 1 perrizo faculty 146 Nov 29 13:59 run -rwxr-xr-x 1 perrizo faculty 74K Dec 16 20:47 show-config -rw-r--r-- 1 perrizo faculty 454 Nov 29 11:22 show-config.C -rw-r--r-- 1 perrizo faculty 2.7K Nov 29 11:25 show-config.o -rw-r--r-- 1 perrizo faculty 6.4K Nov 29 11:22 UserSet.C -rw-r--r-- 1 perrizo faculty 1.5K Nov 29 11:22 UserSet.H -rw-r--r-- 1 perrizo faculty 7.2K Nov 29 11:25 UserSet.o -rw-r--r-- 1 perrizo faculty 17K Jan 19 06:57 user-vote.C -rw-r--r-- 1 perrizo faculty 9.3K Jan 19 07:07 user-vote.o $ ls -l -rwxr-xr-x 1 perrizo faculty 259 Nov 29 11:22 cluster-corr -rw-r--r-- 1perrizo faculty 1.2K Nov 29 11:22 cluster-corr.pl -rwxr-xr-x 1 perrizo faculty 7.7K Feb 1 12:26 config -rw-r--r-- 1 perrizo faculty 14K Nov 29 11:22 config.c -rw-r--r-- 1 perrizo faculty 1.4K Nov 29 11:22 config.h -rw-r--r-- 1 perrizo faculty 5.6K Nov 29 11:25 config.o -rw-r--r-- 1 perrizo faculty 38K Nov 29 11:25 config-parser.c -rw-r--r-- 1 perrizo faculty 806 Nov 29 11:22 config-parser.l -rw-r--r-- 1 perrizo faculty 15K Nov 29 11:25 config-parser.o -rw-r--r-- 1 perrizo faculty 2.4K Nov 29 11:22 cosupport.C drwxr-xr-x 2 perrizo faculty 12K Feb 2 12:40 Data drwxr-xr-x 2 perrizo faculty 4.0K Nov 29 11:25 libPTree -rw-r--r-- 1 perrizo faculty 4.1K Nov 29 11:22 Makefile -rwxr-xr-x 1 perrizo faculty 16K Nov 29 11:25 movie-corr -rw-r--r-- 1 perrizo faculty 1.3K Nov 29 11:22 movie-corr.C -rw-r--r-- 1 perrizo faculty 2.3K Nov 29 11:22 MovieCorrelation.C -rw-r--r-- 1 perrizo faculty 1.4K Nov 29 11:22 MovieCorrelation.H -rw-r--r-- 1 perrizo faculty 9.6K Nov 29 11:25 MovieCorrelation.o -rw-r--r-- 1 perrizo faculty 3.3K Nov 29 11:25 movie-corr.o -rw-r--r-- 1 perrizo faculty 1.5K Nov 29 11:22 movie-rating.C -rw-r--r-- 1 perrizo faculty 1.5K Nov 29 11:22 movie-set.C -rw-r--r-- 1 perrizo faculty 2.0K Nov 29 11:22 MovieSet.C -rw-r--r-- 1 perrizo faculty 1.1K Nov 29 11:22 MovieSet.H -rw-r--r-- 1 perrizo faculty 4.2K Nov 29 11:25 MovieSet.o -rw-r--r-- 1 perrizo faculty 14K Jan 19 07:07 movie-vote.C -rw-r--r-- 1 perrizo faculty 9.7K Jan 19 07:07 movie-vote.o -rwxr-xr-x 1 perrizo faculty 303 Nov 29 11:22 mpp -rwxr-xr-x 1 perrizo faculty 1.3K Nov 29 11:22 mpp-cluster-list -rw-r--r-- 1 perrizo faculty 2.5K Nov 29 11:22 mpp-cluster-list.pl -rw-r--r-- 1 perrizo faculty 1.7K Nov 29 11:22 mppConfig.C -rw-r--r-- 1 perrizo faculty 1.1K Nov 29 11:22 mppConfig.H -rw-r--r-- 1 perrizo faculty 2.9K Nov 29 11:25 mppConfig.o -rwxr-xr-x 1 perrizo faculty 745 Dec 5 11:32 mpp-filter -rw-r--r-- 1 perrizo faculty 3.0K Dec 5 11:32 mpp-filter.pl -rwxr-xr-x 1 perrizo faculty 2.3K Nov 29 11:22 mpp-glue -rw-r--r-- 1 perrizo faculty 591 Nov 29 11:22 mpp.h -rwxr-xr-x 1 perrizo faculty 101K Feb 2 12:38 mpp-mpred -rw-r--r-- 1 perrizo faculty 13K Nov 29 11:22 mpp-mpred.C -rw-r--r-- 1 perrizo faculty 29K Nov 29 11:25 mpp-mpred.o -rwxr-xr-x 1 perrizo faculty 1.4K Nov 29 11:22 mpp-rmse -rw-r--r-- 1 perrizo faculty 1.5K Nov 29 11:22 mpp-rmse.pl -rw-r--r-- 1 perrizo faculty 6.9K Jan 19 06:36 mpp-user.C -rwxr-xr-x 1 perrizo faculty 1.3K Nov 29 11:22 mpp-user-cluster -rw-r--r-- 1 perrizo faculty 3.8K Nov 29 11:22 mpp-user-cluster.pl -rw-r--r-- 1 perrizo faculty 11K Jan 19 07:07 mpp-user.o -rwxr-xr-x 1 perrizo faculty 2.5K Jan 21 17:38 mpp-user-reduce -rw-r--r-- 1 perrizo faculty 3.2K Jan 21 17:38 mpp-user-reduce.pl $ ls -l Data -rw-r--r-- 1 perrizo faculty 67 Dec 18 01:32 p1.txt -rw-r--r-- 1 perrizo faculty 23 Dec 18 01:32 p1.txt.answers -rw-r--r-- 1 perrizo faculty 533K Dec 18 01:33 probe-1000.txt -rw-r--r-- 1 perrizo faculty 146K Dec 18 01:33 probe-1000.txt.answers -rw-r--r-- 1 perrizo faculty 1.9K Dec 18 01:32 probe19.txt -rw-r--r-- 1 perrizo faculty 611 Dec 18 01:32 probe19.txt.answers -rw-r--r-- 1 perrizo faculty 23K Dec 18 01:32 probe95.txt -rw-r--r-- 1 perrizo faculty 6.4K Dec 18 01:32 probe95.txt.answers -rw-r--r-- 1 perrizo faculty 594K Dec 18 01:32 test-probe-1000.txt -rw-r--r-- 1 perrizo faculty 162K Dec 18 01:32 test-probe-1000.txt.answers -rw-r--r-- 1 perrizo faculty 51K Dec 18 01:32 test-probe-100.txt -rw-r--r-- 1 perrizo faculty 14K Dec 18 01:32 test-probe-100.txt.answers $ ls -l libPTree -rw-r--r-- 1 perrizo faculty 18672 Nov 29 11:25 libPTree.a -rw-r--r-- 1 perrizo faculty 3192 Nov 29 11:22 Makefile -rw-r--r-- 1 perrizo faculty 15813 Nov 29 11:22 PTree.C -rw-r--r-- 1 perrizo faculty 2973 Nov 29 11:22 PTree.H -rw-r--r-- 1 perrizo faculty 11096 Nov 29 11:25 PTree.o -rw-r--r-- 1 perrizo faculty 18135 Nov 29 11:22 PTree-omp.C -rw-r--r-- 1 perrizo faculty 3796 Nov 29 11:22 ptree-op-test.C -rw-r--r-- 1 perrizo faculty 488 Nov 29 11:22 ptree-read.C -rw-r--r-- 1 perrizo faculty 779 Nov 29 11:22 ptree-save.C -rw-r--r-- 1 perrizo faculty 7485 Nov 29 11:22 PTreeSet.C -rw-r--r-- 1 perrizo faculty 1179 Nov 29 11:22 PTreeSet.H -rw-r--r-- 1 perrizo faculty 6464 Nov 29 11:25 PTreeSet.o -rw-r--r-- 1 perrizo faculty 2265 Nov 29 11:22 ptreeset-read.C -rw-r--r-- 1 perrizo faculty 420 Nov 29 11:22 ptree-test.C -rw-r--r-- 1 perrizo faculty 16127 Nov 29 11:22 PTree-x86_64.C -rw-r--r-- 1 perrizo faculty 16127 Nov 29 11:22 PTree-x86.C $ ls -lOutput ... -rw-r--r-- 1 perrizo faculty 32157 Feb 2 13:25 probe-full.txt_9939.predict ... drwxr-xr-x 2 perrizo faculty 901120 Jan 20 06:23 probe-full.txt.backup -rw-r--r-- 1 perrizo faculty 7059980 Jan 20 06:23 probe-full.txt.predictions

Makefile OPT = -O2 ${VECTOR} ifeq (${ARCH}, x86_64) OPT += -msse2 endif C_DEBUG = -g -pg LD_DEBUG = -g -pg endif ifeq (${COMPILER}, pgroup) CC = pgcc C++ = pgCC OPT = -fast -Minline=levels:10 C_DEBUG = -g -Minfo #-pg LD_DEBUG = -g -tp core2-64 #-pg endif ifeq (${COMPILER}, intel) CC = icpc C++ = icpc OPT = -O2 C_DEBUG = -g -p LD_DEBUG = -g -p endif endif INCLUDES = -I./libPTree CFLAGS = ${OPT} ${WARNINGS} ${INCLUDES} ifdef DEBUG CFLAGS += ${C_DEBUG} endif ifdef DEBUG LDFLAGS += ${LD_DEBUG} endif OBJS = mpp-mpred.o mpp-user.o mppConfig.o PredictionConfig.o \ MovieCorrelation.o UserSet.o MovieSet.o movie-vote.o user-vote.o \ prune.o config.o config-parser.o LIB = ./libPTree/libPTree.a LIBS = -lfl -L ./libPTree -lPTree # Executable target definitions. all: mpp-mpred show-config movie-corr mpp-mpred: ${OBJS} ${LIB} ${C++} ${LDFLAGS} -o $@ $^ ${LIBS}; cosupport: cosupport.o UserSet.o MovieSet.o ${LIB} ${C++} ${LDFLAGS} -o $@ $^ ${LIBS}; tools: movie-rating movie-set movie-rating: movie-rating.o UserSet.o MovieSet.o ${LIB} ${C++} ${LDFLAGS} -o $@ $^ ${LIBS} movie-set: movie-set.o UserSet.o ${LIB} ${C++} ${LDFLAGS} -o $@ $^ ${LIBS}; movie-corr: movie-corr.o MovieCorrelation.o ${LIB} VERSION = 2.6.0 # Default directory where PTree data is stored. # Overriden below depending on architecture. PTREEDATA = /tmp # Set compiler behavior based on architecture. ARCH := $(shell uname -m | sed -e s/i686/x86/) ifeq (${ARCH}, x86_64) COMPILER = gcc # COMPILER = gcc4 PTREEDATA = /scratch/perrizo endif ifeq (${ARCH}, ia64) # COMPILER = intel COMPILER = gcc4 endif ifeq (${ARCH}, x86) COMPILER = gcc endif ifndef (${COMPILER},) ifeq (${COMPILER}, gcc4) CC = /opt/gcc4/bin/gcc C++ = /opt/gcc4/bin/g++ # WARNINGS = -W -Wall -Wchar-subscripts -Wshadow \ -Wpointer-arith -Wwrite-strings -Wmissing-prototypes # VECTOR = -ftree-vectorize -ftree-vectorizer-verbose=5 OPT = -O2 ${VECTOR} ifeq (${ARCH}, x86_64) OPT += -msse2 endif C_DEBUG = -g -pg LD_DEBUG = -g -pg endif ifeq (${COMPILER}, gcc) CC = gcc C++ = g++ # WARNINGS = -W -Wall -Wchar-subscripts -Wshadow \ -Wpointer-arith -Wwrite-strings -Wmissing-prototypes # VECTOR = -ftree-vectorize -ftree-vectorizer-verbose=5

cosupport.C if ( argv[1] == NULL ) { fputs("Need V specified.\n", stderr); return 1; } auto unsigned long int U = 421582, M = 0, V = strtoul(argv[1], NULL, 10); auto PTree M_support = Movies.get_users(M); auto PTree Voters(M_support); Voters.clearbit(U); unsigned long long int *voters = Voters.get_indexes(); fputs("Voter list:\n", stdout); for (size_t voter= 0; voter < Voters.get_count(); ++voter) fprintf(stdout, "%zu: %llu\n", voter, voters[voter]); fputc('\n', stdout); auto PTree cosupport; fputs("Voter Map:\n", stdout); Voters.dump(stdout); fputs("U Map:\n", stdout); (Users.get_movies(U)).dump(stdout); fputs("V Map:\n", stdout); (Users.get_movies(V)).dump(stdout); cosupport = Users.get_movies(U) & Users.get_movies(V); fputs("Cosupport Map:\n", stdout); cosupport.dump(stdout); cosupport.clearbit(M); fprintf(stdout, "Cosupport, M= %lu, U = %lu, V = %lu\n", M,U,V); auto double Ubar = Users.get_mean(U, cosupport), Vbar = Users.get_mean(V, cosupport), Vrt = Users.get_rating(V, M); auto double vote = Vrt - Vbar + Ubar; auto unsigned long long int *movies = cosupport.get_indexes(); for (unsigned long int movie= 0; movie < cosupport.get_count(); \ ++movie) fprintf(stdout, "\t\t\t\t%lu [%lu]:\tU = %0.2f, V = %0.2f\n",\ Movies.get_identity(movies[movie]), movies[movie], \ Movies.get_rating(U, movies[movie]), \ Movies.get_rating(V, movies[movie])); fprintf(stdout"\t\t\t%.2f\t[Vrt: %.2f Vbar: %.2f Ubar: %.2f]\n", vote, Vrt, Vbar, Ubar); return 0; } /** * This file contains a driver program to * determine the rating given to * a movie by a user. */ /* Standard include files. */ #include <unistd.h> #include <stdio.h> #include <string.h> #include <math.h> /* Local include files. */ #include "UserSet.H" #include "MovieSet.H" extern int main(int argc, char *argv[]) { auto MovieSet Movies; auto UserSet Users; fputs("Loading user PTree's.\n", stdout); if ( !Movies.load_binary() ) { fputs("Cannot load binary PTree's.\n", stderr); return 1; } fputs("Loading movie PTree's.\n", stdout); if ( !Users.load_binary() ) { fputs("Cannot load binary PTree's.\n", stderr); return 1; } fputs("Loading user identities.\n\n", stdout); if ( !Users.load_identities() ) { fputs("Cannot load user identities.\n", stderr); return 1; }

movie-corr.C mpp #! /bin/bash if [ "$1" != "-i" ]; then echo "No input file specified."; exit 1; fi; shift; inputfile="$1"; run_name=`basename $inputfile`; rm -f $run_name.out; ./mpp-mpred -i $inputfile $* >"$run_name.out" 2>&1 & while [ ! -e "$run_name.out" ]; do sleep 1s; done; tail -f "$run_name.out"; exit; /** \file This file implements a program for * printing movie-movie correlations.*/ /* Standard include files. */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> /* Local include files. */ #include "MovieCorrelation.H" /* Program entry point. */ extern int main(int argc, char *argv[]) { auto bool dump = false; auto int gopt; auto unsigned int target=0, movie=0; auto MovieCorrelation mvcorr; while ((gopt=getopt(argc,argv,"dm:t:"))!=EOF){ switch ( gopt ) { case 'd': dump = true; break; case 'm': movie = atoi(optarg); break; case 't': target = atoi(optarg); break; } } mpp.h /** \file * This file contains general definitions and * defines for the PTree * based Netflix prediction system. */ /* External variable declarations. */ extern UserSet Users; extern MovieSet Movies; /* Function declarations. */ extern void do_pruning(struct external_prune * const prune, unsigned long int M, unsigned long int U, \ PTree & supM, PTree & supU); double user_vote(PredictionConfig *, unsigned long int, PTree &, unsigned long int, PTree &); double movie_vote(PredictionConfig *, unsigned long int, PTree &, unsigned long int, PTree &); if ( movie == 0 ) { fputs("movie-corr: No movie specified.\n", stderr); return 1; } if ( !mvcorr.load(movie) ) { printf("Error loading movies.\n"); return 1; } /* Dump movies and correlations. */ if ( dump ) { fprintf(stdout, "Correlations for movie: %u\n", movie); for (unsigned int lp= 0; lp < MOVIE_COUNT; ++lp) fprintf(stdout, "\t%5u: %7.4f / %d\n", lp + 1, \ mvcorr.supp(lp), mvcorr.corr(lp)); return 0; } /* Print correlation of target movie. */ if ( target > 0 ) { fprintf(stdout,"%-7.4f\n",mvcorr.corr(target-1));return 0;} return 0;}

MovieCorrelation.C /*Public method. * Implements loading of correlation and support vector for given movie. * \param index The index number of the movie to be loaded. * \return A boolean value is used to indicate the success * or failure of the load. A true value indicates success.*/ bool MovieCorrelation::load(unsigned long int index) { auto char snbufr[10]; auto string root = PTREEDATA"/mpred-data/", corr_path = root + "mv_corr/co_mv_", supp_path = root + "mv_supp/sp_mv_"; auto ifstream corr_file, supp_file; /* Sanity check for movie index size. */ if ( index > (MOVIE_COUNT + 1) ) return false; movie_index = index; /* Synthesize the filename of the correlations file and read it. */ snprintf(snbufr, sizeof(snbufr), "%lu", movie_index); string sn(snbufr); string corr_fname = corr_path + sn + ".bin"; corr_file.open(corr_fname.c_str()); if ( corr_file.fail() ) { corr_file.close(); return false; } corr_file.read(reinterpret_cast<char*>(&correlations), \ (MOVIE_COUNT + 1)*sizeof(float)); if ( corr_file.fail() ) { corr_file.close(); return false; } corr_file.close(); /* Synthesize the filename of the support file and read it. */ string supp_fname = supp_path + sn + ".bin"; supp_file.open(supp_fname.c_str()); if ( supp_file.fail() ) { supp_file.close(); return false; } supp_file.read(reinterpret_cast<char*>(&support), \ (MOVIE_COUNT + 1)*sizeof(short int)); if ( supp_file.fail() ) { supp_file.close(); return false; } supp_file.close(); return true; } /** \file * This file contains the implementation of a class * which encapsulates management of correlation info * for a particular movie to all other movies. */ /* System include files. */ #include <stdlib.h> /* Standard C++ includes. */ #include <string> #include <iostream> #include <fstream> /* Local include files. */ #include "MovieCorrelation.H" using namespace std; MovieCorrelation::MovieCorrelation(void) { movie_index = 0; /* Initialize correlation and support count. */ for (unsigned int lp= 0; lp <= MOVIE_COUNT + 1; ++lp) { support[lp] = 0; correlations[lp] = 0.0; } return; } /** * Destructor. */ MovieCorrelation::~MovieCorrelation(void) { return; }

MovieCorrelation.H /* * Inline accessor methods for returning movie supports and * correlations. */ float inline corr(unsigned int index) { if ( index > (MOVIE_COUNT + 1) ) return 0; return correlations[index + 1]; } unsigned short int inline supp(unsigned int index) { if ( index > (MOVIE_COUNT + 1) ) return 0; return support[index + 1]; } /* Public method for loading the correlation vector for a movie. */ bool load(unsigned long int); }; #endif #if !defined(MOVIECORRELATION_H) #define MOVIECORRELATION_H /* Total number of movies. */ #define MOVIE_COUNT 17770 /* Standard include files. */ #include <stdio.h> /* Local include files. */ class MovieCorrelation { private: /* The index number of the movie whose correlations are loaded. */ unsigned long int movie_index; /* * The following array contains the list of correlations for * a movie to all the other movies. The array is one based * so a value of one needs to be added to the movie index * number to retrieve the correlation. */ float correlations[MOVIE_COUNT + 1]; /* * The following array contains the support list for the * correlations vector. The vector is one based as is the * correlations vector. */ unsigned short int support[MOVIE_COUNT + 1]; public: /* Void constructor. */ MovieCorrelation(void); /* Destructor. */ ~MovieCorrelation(void);

MovieSet.C MovieSet.H /* System include files. */ #include <limits.h> /* Local include files. */ #include "MovieSet.H" /* Variables static to this module. */ /* No arguement constructor.*/ MovieSet::MovieSet(void) : ptree_set() {return;} /* Destructor.*/ MovieSet::~MovieSet(void) {return;} /* Public method calculates rating user provided for movie. * \param user_index The identity number of the user. * \param movie The identity number of the movie. * \return The rating number is returned to the caller.*/ double MovieSet::get_rating(unsigned long int user_index, \ unsigned long int movie_index) { auto double rating = 0; auto size_t slot = movie_index * 3; for (int tree= 2, bit= 0; tree >= 0; --tree, ++bit) { if ( ptree_set[slot + tree].is_set(user_index)) rating += pow(2.0, bit); } return rating; } /* Public method returns PTree describing * set of users who rated movie*/ PTree MovieSet::get_users(unsigned long int index) { auto size_t slot = index * 3; return ptree_set[slot] | ptree_set[slot+1] | ptree_set[slot+2]; } /* Public method \param output descriptor- PTree's to be directed*/ #if !defined(MOVIESET_H) #define MOVIESET_H /* Standard include files. */ #include <stdio.h> #include <math.h> /* Local include files. */ #include "PTreeSet.H" class MovieSet { private: PTreeSet ptree_set; public: /* Void constructor. */ MovieSet(void); /* Constructor to initialize an in-memory tree. */ /* Destructor. */ ~MovieSet(void); /* Public inline method to return identity of movie index*/ unsigned long int get_identity(unsigned long int offset) { return offset + 1; } /* Public inline method to return index of movie identity*/ unsigned long int get_index(unsigned long int identity) { return identity - 1; } /* Public method to return rating of movie by user. */ double get_rating(unsigned long int, unsigned long int); /* Public method to return set of users rating movie. */ PTree get_users(unsigned long int); /* Public method to print sparseness of set. */ void dump(FILE *); /* Public method to load a binary PTree set. */ bool load_binary(void); }; #endif void MovieSet::dump(FILE *output) { for (int lp= 0; lp < ptree_set.size(); ++lp) ptree_set[lp].dump(output); return;} /* Public method loads binary PTree set which has as its * X-axis user indexes with movie rating PTree's on Y-axis.*/ bool MovieSet::load_binary(void) { auto char bufr[PATH_MAX]; auto FILE *input; for (int pt= 22; pt <= 53331; ++pt) { snprintf(bufr, sizeof(bufr), \ "%s/mpred-data/nf_us_mv_pt/p%d.pct", PTREEDATA, pt); if ( (input = fopen(bufr, "r")) == NULL ) return false; if ( !ptree_set.load_binary_file(input) ) return false; fclose(input); } return true;}

mppConfig.C /* Public method causes object to be initialized as standard single * file configuration. * \param cfgfile Pointer to buffer containing the name of the * standard configuration file. * \return If initialization of configuration is successful * a boolean true value is returned. Otherwise a * false value is returned.*/ bool mppConfig::read_cluster_config(const char * const cfgfile) { return false; } /** \file contains implentation of class which encapsulates * info needed to configure prediction run. Purpose of * class is to abstract out diff between single config * run and a run based on a cluster of configurations. */ /* System include files. */ /* Local include files. */ #include "mppConfig.H" /* No arguement constructor. */ mppConfig::mppConfig(void) { standard_config = false; standard = NULL; cluster_config = false; return; } /* Destructor. */ mppConfig::~mppConfig(void) { if ( standard != NULL ) delete standard; return; } /* Public method causes the object to be initialized * as a standard single file configuration. * \param cfgfile ptr to buffer containing name of * standard configuration file. * \return If init of configuration is successful * a boolean true value is returned. Otherwise a * false value is returned. */ bool mppConfig::read_config(const char * const cfgfile) { standard = new PredictionConfig; if ( standard == NULL ) return false; if ( !standard->read_config(cfgfile) ) return false; standard_config = true; return true; } mppConfig.H #if !defined(MPPCONFIG_H) #define MPPCONFIG_H /* Standard include files. */ #include <stdio.h> /* Local include files. */ #include "PredictionConfig.H" class mppConfig { private: bool standard_config, cluster_config; PredictionConfig *standard; public: /* Void constructor. */ mppConfig(void); /* Destructor. */ ~mppConfig(void); /* Public inline accessor methods to determine if a standard * or cluster configuration is being used. */ inline bool is_standard_config(void) {return standard_config;} inline bool is_cluster_config(void) {return cluster_config;} /* Public inline accesor method for the standard configuration. */ inline PredictionConfig*get_standard_config(void){return standard;} /* Public method to read a configuration file. */ bool read_config(const char * const); /* Public method to read a cluster configuration file. */ bool read_cluster_config(const char * const); /* Public method to print out a configuration. */ void print(FILE *); }; #endif

PredictionConfig.C /* No arguement constructor. */ PredictionConfig::PredictionConfig(void) { /* Initialize general prediction parameters. */ name = NULL; user_voting = false; movie_voting = false; user_vote_weight = 1; /* Initialize user voting parameters. */ user_force_vote_in_Voter_Loop = false; user_force_vote_after_Voter_Loop = false; user_reset_support = false; user_boundary_override = false; user_facz = 0.0; user_thrz = 1.0; _init_internal_prune(&dvCorp); _init_internal_prune(&dvCors); _init_internal_prune(&vdCorp); _init_internal_prune(&vdCors); _init_internal_prune(&pCor); _init_internal_prune(&dCor); _init_internal_prune(&sCor); _init_internal_prune(&dUVsdp); _init_internal_prune(&dUVsds); _init_internal_prune(&Vsdp_Usdp); _init_internal_prune(&Vsds_Usds); _init_external_prune(&Prune_Users_in_SupM); _init_external_prune(&Prune_Movies_in_SupU); _init_external_prune(&Prune_Movies_in_CoSupUV); /* Initialize movie voting parameters. */ movie_force_vote_in_Voter_Loop = false; movie_force_vote_outside_Voter_Loop = false; movie_boundary_override = false; movie_facz = 0.0; movie_thrz = 1.0; _init_internal_prune(&DVCorp); _init_internal_prune(&DVCors); _init_internal_prune(&VDCorp); _init_internal_prune(&VDCors); _init_internal_prune(&PCor); _init_internal_prune(&DCor); _init_internal_prune(&SCor); _init_internal_prune(&dMNsdp); _init_internal_prune(&dMNsds); _init_internal_prune(&Nsdp_Msdp); _init_internal_prune(&Nsds_Msds); _init_external_prune(&Movie_Prune_Users_in_SupM); _init_external_prune(&Movie_Prune_Movies_in_SupU); _init_external_prune(&Movie_Prune_Users_in_CoSupMN); return; } /* \file File contains implementation of class which encapsulates * info which regulates how Movie/User pair predictions are made.*/ /* System include files. */ #include <stdlib.h> #include <string.h> /* Local include files. */ #include "PredictionConfig.H" extern "C" {#include "config.h"} /* Internal private function. * This function initializes an internal pruning structure. * \param p A pointer to the structure to be initialized. */ static void _init_internal_prune(struct pruning *p) { p->enabled = false; p->weight = false; p->threshold = 0.0; p->exponent = 1.0; return; } /* Internal private function. * This function initializes a structure defining external pruning. * \param p A pointer to the structure to be initialized. */ static void _init_external_prune(struct external_prune *p) { p->enabled = false; p->method = UserPrune; p->params.mstrt = 0; p->params.mstrt_mult = 0.0; p->params.ustrt = 0; p->params.ustrt_mult = 0.0; p->params.TSa = -100; p->params.TSb = -100; p->params.Tdvp = -1; p->params.Tdvs = -1; p->params.Tvdp = -1; p->params.Tvds = -1; p->params.TD = -1; p->params.TP = -1; p->params.PPm = 0.1; p->params.TV = -1; p->params.TSD = -1; p->params.Ch = 1; p->params.Ct = 2; return; }

PredictionConfig.C page 2 /* Internal private function. * initializes configuration structure for an internal pruning method. * \param cf The configuration which is being used. * \param sp A pointer to the external pruning definition * structure which is to be initialized. * \param name The name of the external pruning method. */ void _set_external_prune(Config cf, struct external_prune *sp, \ const char *name) { auto char *val; auto struct pruning_parameters *pp = &sp->params; if ( !Config_Set_Section(cf, name) ) return; val = Config_Get(cf, "method"); if ( strcmp(val, "UserPrune")==0) sp->method=UserPrune; if ( strcmp(val,"UserFastPrune")==0) sp->method=UserFastPrune; if ( strcmp(val, "UserCommonCoSupportPrune")==0) sp->method=UserCommonCoSupportPrune; /* Destructor. */ PredictionConfig::~PredictionConfig(void) { if ( name != NULL ) free(name); return; } /* Internal private fctn determines if config enabled. * \param cf Ptr to configto be tested for the option. * \param var Ptr to name of variable to be tested. * \return Boolean value returned to indicated whether * configuration option has been enabled. True value * indicates variable is enabled else false returned. */ static bool _is_enabled(Config cf, const char * const var) { auto char *p; p = Config_Get(cf, var); if(p==NULL) return false; if (strcmp(p,"enabled")==0) return true; return false; } /* Internal private function. * initializes config struct for internal pruning method. * \param cf The configuration which is being used. * \param sp Pointer to the structure to be initialized. * \param name Name of the internal pruning method. * \param threshold Name of variable containing thresh. * \param wt Name of variable specifying whether * method should be used to set the value of uCor.*/ void _set_internal_prune(Config cf,struct pruning *sp, const char *name,const char *threshold,const char *weight) { auto char *val; sp->enabled = _is_enabled(cf, name); if ( !sp->enabled ) return; val = Config_Get(cf, threshold); if ( val != NULL ) sp->threshold = atof(val); sp->weight = _is_enabled(cf, weight); return; } if(strcmp(val,"MoviePrune")==0) sp->method=MoviePrune; if(strcmp(val,"MovieFastPrune")==0)sp->method=MovieFastPrune; if(strcmp(val,"MovieCommonCoSupportPrune")==0) sp->method = MovieCommonCoSupportPrune; /* Set the external pruning parameters. */ val = Config_Get(cf, "mstrt"); if ( val != NULL ) pp->mstrt = atoll(val); val = Config_Get(cf, "mstrt_mult"); if ( val != NULL ) pp->mstrt_mult = atof(val); val = Config_Get(cf, "ustrt"); if ( val != NULL ) pp->ustrt = atoll(val); val = Config_Get(cf, "ustrt_mult"); if ( val != NULL ) pp->ustrt_mult = atof(val); val = Config_Get(cf, "TSa"); if ( val != NULL ) pp->TSa = atof(val); val = Config_Get(cf, "TSb"); if ( val != NULL ) pp->TSb = atof(val); val = Config_Get(cf, "Tdvp"); if ( val != NULL ) pp->Tdvp = atof(val); val = Config_Get(cf, "Tdvs"); if ( val != NULL ) pp->Tdvs = atof(val); val = Config_Get(cf, "Tvdp"); if ( val != NULL ) pp->Tvdp = atof(val); val = Config_Get(cf, "Tvds"); if ( val != NULL ) pp->Tvds = atof(val); val = Config_Get(cf, "TD"); if ( val != NULL ) pp->TD = atof(val); val = Config_Get(cf, "TP"); if ( val != NULL ) pp->TP = atof(val); val = Config_Get(cf, "PPm"); if ( val != NULL ) pp->PPm = atof(val); val = Config_Get(cf, "TV"); if ( val != NULL ) pp->TV = atof(val); val = Config_Get(cf, "TSD"); if ( val != NULL ) pp->TSD = atof(val); val = Config_Get(cf, "Ch"); if ( val != NULL ) pp->Ch = atof(val); val = Config_Get(cf, "Ct"); if ( val != NULL ) pp->Ct = atof(val); return; } /* Internal private function. * Function initializes config structure for standard * deviation based pruning method. * \param cf Configuration which is being used. * \param sp Ptr to structure to be initialized. * \param name name of the internal pruning method. * \param threshold name of variable containing threshold val. * \param exponent Name of variable specifying the exponent * which should be used for the GAUSSIAN * method should be used to set value of uCor.*/ void _set_stddev_prune(Config cf,struct pruning *sp, const char *name,const char *threshold,const char *exponent){ auto char *val; sp->enabled = _is_enabled(cf, name); if(!sp->enabled)return; val=Config_Get(cf, threshold); if(val!=NULL)sp->threshold=atof(val); val=Config_Get(cf,exponent); if(val!=NULL)sp->exponent=atof(val); return; }

PredictionConfig.C page 3 /* Open and parse the configuration file. */ cf = Config_Init(); if(cf == NULL ) return false; if(Config_Parse(cf,file)<0){Config_Destroy(cf); return false;} /* Set general prediction parameters. */ if (!Config_Set_Section(cf,"Default")) {Config_Destroy(cf);return false;} val = Config_Get(cf, "name"); if ( val != NULL ) name = strdup(val); user_voting = _is_enabled(cf, "user_voting"); movie_voting = _is_enabled(cf, "movie_voting"); val = Config_Get(cf, "user_vote_weight"); if ( val != NULL ) user_vote_weight = atof(val); /* Process user voting parameters. */ if ( user_voting && Config_Set_Section(cf, "user_voting") ) { user_force_vote_in_Voter_Loop = _is_enabled(cf, \ "force_vote_in_Voter_Loop"); user_force_vote_after_Voter_Loop = _is_enabled(cf, \ "force_vote_after_Voter_Loop"); user_reset_support = _is_enabled(cf, "reset_support"); user_boundary_override = _is_enabled(cf, "boundary_override"); if ( user_boundary_override ) { val = Config_Get(cf, "facz"); if ( val != NULL ) user_facz = atof(val); val = Config_Get(cf, "thrz"); if ( val != NULL ) user_thrz = atof(val); } /* Public method used for paramers to be associated with * internal pruning type. * \param Enumerated type describing internal pruning return * for which parameter information is to be obtained. * \return Ptr to structure describing how the internal * pruning method is to be implemented. */ struct pruning *PredictionConfig::get_internal_prune (enum internal_pruning pr) { switch ( pr ) { case user_dvCorp: return &dvCorp; case user_dvCors: return &dvCors; case user_vdCorp: return &vdCorp; case user_vdCors: return &vdCors; case user_pCor: return &pCor; case user_dCor: return &dCor; case user_sCor: return &sCor; case movie_DVCorp: return &DVCorp; case movie_DVCors: return &DVCors; case movie_VDCorp: return &VDCorp; case movie_VDCors: return &VDCors; case movie_PCor: return &PCor; case movie_DCor: return &DCor; case movie_SCor: return &SCor; /* Standard deviation types */ case user_dUVsdp: return &dUVsdp; case user_dUVsds: return &dUVsds; case user_Vsdp_Usdp: return &Vsdp_Usdp; case user_Vsds_Usds: return &Vsds_Usds; case movie_dMNsdp: return &dMNsdp; case movie_dMNsds: return &dMNsds; case movie_Nsdp_Msdp: return &Nsdp_Msdp; case movie_Nsds_Msds: return &Nsds_Msds; } return NULL; } /* Process user voting parameters. */ if(user_voting && Config_Set_Section(cf,"user_voting")){ user_force_vote_in_Voter_Loop = _is_enabled(cf, \ "force_vote_in_Voter_Loop"); user_force_vote_after_Voter_Loop = _is_enabled(cf, \ "force_vote_after_Voter_Loop"); user_reset_support = _is_enabled(cf, "reset_support"); user_boundary_override=_is_enabled(cf,"boundary_override"); if ( user_boundary_override ) { val = Config_Get(cf, "facz"); if ( val != NULL ) user_facz = atof(val); val = Config_Get(cf, "thrz"); if ( val != NULL ) user_thrz = atof(val); } /* Public method. parses configuration file and translates the ASCII * key/value pairs into appropriate configuration variables. * \param file A character pointer to the file name containing * the configuration to be read. * \return A boolean value is returned to indicate whether * or not the read of the configuration file was * successful. A true value indicates success while * failure is indicated by a false value. */ bool PredictionConfig::read_config(const char * const file) { auto char *val; auto Config cf;

PredictionConfig.C page 4 Movie_Prune_Users_in_SupM.enabled=_is_enabled(cf, "Prune_Users_in_SupM"); Movie_Prune_Movies_in_SupU.enabled=_is_enabled(cf, "Prune_Movies_in_SupU"); Movie_Prune_Users_in_CoSupMN.enabled=_is_enabled(cf, "Prune_Users_in_CoSupMN"); if ( Movie_Prune_Users_in_SupM.enabled ) _set_external_prune(cf,&Movie_Prune_Users_in_SupM, "movie_voting Prune_Users_in_SupM"); if ( Movie_Prune_Movies_in_SupU.enabled ) _set_external_prune(cf,&Movie_Prune_Movies_in_SupU, "movie_voting Prune_Movies_in_SupU"); if ( Movie_Prune_Users_in_CoSupMN.enabled ) _set_external_prune(cf, \ &Movie_Prune_Users_in_CoSupMN, \ "movie_voting Prune_Users_in_CoSupMN"); } Config_Destroy(cf); return true; } _set_internal_prune(cf,&dvCorp,"dvCorp","dvThrp","dvCorpWeight"); _set_internal_prune(cf,&dvCors,"dvCors","dvThrs","dvCorsWeight"); _set_internal_prune(cf,&vdCorp,"vdCorp","vdThrp","vdCorpWeight"); _set_internal_prune(cf,&vdCors,"vdCors","vdThrs","vdCorsWeight"); _set_internal_prune(cf,&pCor,"pCor","pThr","pCorWeight"); _set_internal_prune(cf,&dCor,"dCor","dThr","dCorWeight"); _set_internal_prune(cf,&sCor,"sCor","sThr","sCorWeight"); _set_stddev_prune(cf,&dUVsdp,"dUVsdp","dUVsdpThr","dUVsdpExp"); _set_stddev_prune(cf,&dUVsds,"dUVsds","dUVsdsThr","dUVsdsExp"); _set_stddev_prune(cf,&Vsdp_Usdp,"Vsdp_Usdp","Vsdp_UsdpThr", "Vsdp_UsdpExp"); _set_stddev_prune(cf,&Vsds_Usds,"Vsds_Usds","Vsds_UsdsThr", "Vsds_UsdsExp"); Prune_Movies_in_SupU.enabled=_is_enabled(cf, "Prune_Movies_in_SupU"); Prune_Users_in_SupM.enabled=_is_enabled(cf, "Prune_Users_in_SupM"); Prune_Movies_in_CoSupUV.enabled=_is_enabled(cf, "Prune_Movies_in_CoSupUV"); if(Prune_Movies_in_SupU.enabled)_set_external_prune(cf, &Prune_Movies_in_SupU,"user_voting Prune_Movies_in_SupU"); if(Prune_Users_in_SupM.enabled)_set_external_prune(cf, &Prune_Users_in_SupM,"user_voting Prune_Users_in_SupM"); if(Prune_Movies_in_CoSupUV.enabled)_set_external_prune(cf, &Prune_Movies_in_CoSupUV,"user_voting Prune_Movies_in_CoSupUV");} /* Process movie voting configuration. */ if ( movie_voting && Config_Set_Section(cf, "movie_voting")){ movie_force_vote_in_Voter_Loop=_is_enabled(cf, "force_vote_in_Voter_Loop"); movie_force_vote_outside_Voter_Loop=_is_enabled (cf, "force_vote_outside_Voter_Loop"); fputs("\t\t\tPruning method: ", output); switch ( sp->method ) { case UserPrune: fputs("UserPrune\n", output); break; case UserFastPrune: fputs("UserFastPrune\n", output); break; case UserCommonCoSupportPrune: fputs("UserCommonCoSupportPrune\n", output);break; case MoviePrune: fputs("MoviePrune\n", output); break; case MovieFastPrune: fputs("MovieFastPrune\n", output); break; case MovieCommonCoSupportPrune:fputs("MovieCommonCoSupportPrune\n", output);break; } fprintf(output,"\t\t\t\tmstrt: %-llu\tmultiplier: %-7.2f\n", pp->mstrt,pp->mstrt_mult); fprintf(output,"\t\t\t\tustrt: %-llu\tmultiplier: %-7.2f\n", pp->ustrt,pp->ustrt_mult); fprintf(output,"\t\t\t\tTSa: %-7.2f\tTSb: %-7.2f\n", pp->TSa, pp->TSb); fprintf(output,"\t\t\t\tTdvp: %-7.2f\tTdvs: %-7.2f\n", pp->Tdvp,pp->Tdvs); fprintf(output,"\t\t\t\tTvdp: %-7.2f\tTvds: %-7.2f\n", pp->Tvdp,pp->Tvds); fprintf(output,"\t\t\t\tTD: %-7.2f\tTP: %-7.2f\n", pp->TD, pp->TP); fprintf(output,"\t\t\t\tPPm: %-7.2f\n", pp->PPm); fprintf(output,"\t\t\t\tTV: %-7.2f\tTSD: %-7.2f\n", pp->TV, pp->TSD); fprintf(output,"\t\t\t\tCh: %-7.2f\tCt: %-7.2f\n", pp->Ch, pp->Ct); return; } movie_reset_support = _is_enabled(cf,"reset_support"); movie_boundary_override = _is_enabled(cf,"boundary_override"); if(movie_boundary_override) { val=Config_Get(cf, "facz"); if(val!=NULL)movie_facz=atof(val); val = Config_Get(cf, "thrz"); if ( val != NULL ) movie_thrz = atof(val); } _set_internal_prune(cf,&DVCorp,"DVCorp","DVThrp","DVCorpWeight"); _set_internal_prune(cf,&DVCors,"DVCors","DVThrs","DVCorsWeight"); _set_internal_prune(cf,&VDCorp,"VDCorp","VDThrp","VDCorpWeight"); _set_internal_prune(cf,&VDCors,"VDCors","VDThrs","VDCorsWeight"); _set_internal_prune(cf,&PCor, "PCor", "PThr", "PCorWeight"); _set_internal_prune(cf,&DCor, "DCor", "DThr", "DCorWeight"); _set_internal_prune(cf,&SCor, "SCor", "SThr", "SCorWeight"); _set_stddev_prune(cf, &dMNsdp, "dMNsdp", "dMNsdpThr","dMNsdpExp"); _set_stddev_prune(cf, &dMNsds, "dMNsds", "dMNsdsThr","dMNsdsExp"); _set_stddev_prune(cf,&Nsdp_Msdp,"Nsdp_Msdp","Nsdp_MsdpThr", "Nsdp_MsdpExp"); _set_stddev_prune(cf,&Nsds_Msds,"Nsds_Msds","Nsds_MsdsThr", "Nsds_MsdsExp");

Netflix Contest: Ratings Prediction Program Development

Netflix Contest: Ratings Prediction Program Development

Presentation Transcript

Quest for $1,000,000: The Netflix Prize

Research Challenges in Recommender Systems / Survey of the Netflix Contest

The Netflix Prize

The £ 1,000,000 Bank Note

The Netflix Prize Contest

Spending 1,000,000

The Contest

$1,000,000 Math Project

1,000,000

My $1,000,000

1,000,000 Assignment

The $1,000,000 Netflix Contest

The Netflix Prize

1,000,000

Turkey 1,000,000

1,000,000

1,000,000

1,000,000

1,000,000

Netflix