240 likes | 251 Views
This paper explores the use of item-based P-Trees in collaborative filtering for recommendation systems, with a focus on the Netflix dataset. It discusses the P-Tree algorithm, similarity measurements, and experimental results. The paper concludes with the potential of P-Trees in improving recommendation accuracy.
E N D
Item-Based P-Tree Collaborative Filtering applied to theNetflix Data Tingda Lu, Yan Wang, William Perrizo, Gregory Wettstein, Amal S. Perera Computer Science North Dakota State UniversityFargo, ND 58108 USA
Agenda • Introduction to the Recommendation Systems and Collaborative Filtering using P-Trees • What are P-Trees? • Item-based P-Tree CF algorithm • Similarity measurements • Experimental results • Conclusion
Recommendation Systems • analyze customer’s purchase (or rental) history and recommend the most likely satisfying next purchase(s) or rental(s) • eventually leads to business success
Amazon.com Book Recommendations • Make purchases, rate items and make lists, • recommendations get “better” • Collaborative filtering • similar users like similar things • More choice necessitates better filters and recommendation engines
Netflix Movie Recommendation http://www.netflixprize.com/ “The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on previous ratings (by that person or similar others).” $1 million prize was given last Fall to a team for a >10% improvement over Netflix’s current movie recommender, Cinematch
Collaborative Filtering • Collaborative Filtering (CF) algorithm is widely used in recommender systems • User-based CF algorithm is limited because of its computation complexity • Item-based CF has fewer processing speed concerns, but still standard processing methods need speedup. • We use P-tree technology for these purposes
P-Tree Collaborative Filtering • P-Trees are lossless, compressed, and data-mining-ready vertical data structure • P-trees are used for fast computation and for masking specific phenomena • Data is first converted to P-trees
Predicate trees (Ptrees): vertically project each attribute, R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 for Horizontally structured records Scan vertically = pure1? true=1 pure1? false=0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 pure1? false=0 pure1? false=0 pure1? false=0 0 1 0 1 0 0 0 0 1 0 01 1 1. Whole is pure1? false 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 2. Left half pure1? false 0 P11 0 0 0 0 0 01 3. Right half pure1? false 0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 4. Left half of rt half? false0 0 0 1 0 0 01 5. Rt half of right half? true1 0 1 0 1-Dimensional Ptrees then vertically project each bit position of each attribute, Given a table structured into Horizontal Data records. (which are traditionally Vertically Processed, so VPHD ) then compress each bit slice into a one-dimensioinal Ptree e.g., the compression of R11 into P11 goes as follows: =2 VPHD to find the number of occurences of 7 0 1 4 HPVD to find the number of occurences of 7 0 1 4? R(A1 A2 A3 A4) Base 10 Base 2 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 R11 0 0 0 0 0 0 1 1 Top-down construction of the 1-dimensional Ptree of R11, denoted, P11: Record the truth of the [universal] predicate pure1 ( bit, bit=1) in a tree recursively on halves, until the half is pure (purely 1’s or purely 0’s. P11 To find the number of occurences of 7 0 1 4, AND these basic Ptrees (next slide) But it is pure (pure0) so this branch ends
R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 = R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 This (terminal) 0 makes entire left branch 0 There is no need to look at the other operands. 7 0 1 4 These 0s make this node 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 These 1s and these 0s(which when complemented are 1's) make node 1 0 0 0 0 0 01 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 1 0 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 0 0 1 0 0 01 0 1 0 R(A1 A2 A3 A4) 2 7 6 1 3 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 # change To count occurrences of 7,0,1,4 use 111000001100: 0 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 01 The 21-level has the only 1-bit so 1-count = 1*21 = 2 ^
R11 0 0 0 0 1 0 1 1 Top-down construction of basic P-trees is best for understanding, bottom-up is much faster (once across). 0 0 0 0 1 0 0 0 0 0 0 1 1 1 Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal, collapsing of pure siblings as we go: P11 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 Siblings are pure0 so collapse!
Item-based P-Tree CF based on similarities PTree.load_binary(); // Calculate the similarity while i in I { while j in I { simi,j = sim(PTree[i], Ptree[j]); } } // Get the top K nearest neighbors to item i pt=Ptree.get_items(u); sort(pt.begin(), pt.end(), simi,pt.get_index()); // Prediction of rating on item i by user u sum = 0.0, weight = 0.0; for (j=0; j<K; ++j) { sum += ru,pt[j] * simi,pt[j]; weight += simi,pt[j]; } pred = sum/weight
Item-Based Similarity (I) • Cosine based • Pearson correlation
Item-Based Similarity (II) • Adjusted Cosine • SVD item-feature
Similarity Correction • Two items should not be considered similar if only a few customers purchased or rated both • Therefore we include the co-support in item similarity
Prediction • Weighted Average • Item Effects
Analysis Adjusted Cosine similarity algorithm gets much lower RMSE The reason lies in the fact that other algorithms do not exclude based on user rating variance Adjusted Cosine based algorithm discards users with high variance hence gets better prediction accuracy
Similarity Correction • All algorithms get better RMSE with similarity correction except Adjusted Cosine.
Item Effects • Improves rmse for all algorithms.
Conclusion • Experiments were carried out on Cosine, Pearson, Adjusted Cosine and SVD item-feature algorithms. • Support corrections and item effects significantly improve the prediction accuracy. • Pearson and SVD item-feature algorithms achieve better results when similarity correction and item effects are included.