390 likes | 404 Views
Recommender Systems Session I. Robin Burke DePaul University Chicago, IL. Roadmap. Session A: Basic Techniques I Introduction Knowledge Sources Recommendation Types Collaborative Recommendation Session B: Basic Techniques II Content-based Recommendation Knowledge-based Recommendation
E N D
Recommender SystemsSession I Robin Burke DePaul University Chicago, IL
Roadmap Session A: Basic Techniques I Introduction Knowledge Sources Recommendation Types Collaborative Recommendation Session B: Basic Techniques II Content-based Recommendation Knowledge-based Recommendation Session C: Domains and Implementation I Recommendation domains Example Implementation Lab I Session D: Evaluation I Evaluation Session E: Applications User Interaction Web Personalization Session F: Implementation II Lab II Session G: Hybrid Recommendation Session H: Robustness Session I: Advanced Topics Dynamics Beyond accuracy
Current research • Question 1 • do we lose something when we think of a ratings database as static? • my work • Question 2 • does a summary statistic like MAE hide valuable information? • Mike O’Mahoney (UCD colleague)
Collaborative Dynamics • Remember our evaluation methodology • get all the ratings • divide them up into test / training data sets • run prediction tests
Problem • That isn’t how real recommender systems operate • They get a stream of ratings over time • They have to respond to user requests • predictions • recommendation lists • dynamically
Questions • Are early ratings more predictive than later ratings? • Is there a pattern to how users build their profiles? • How long does it take to get past the cold-start?
Some ideas • Temporal leave-one-out • Profile MAE • Profile Hit Ratio
Temporal leave-one-out (TL1O) • for a rating r(u,i) at time t • predict that r(u,i) using the ratings database immediately prior to t • the information that would have been available right before we learned u’s real rating • Average the error over time intervals • we see how error evolves as data is added • cold-start in action
Profile MAE • For each profile • do the TL1O ratings • average over all profiles of that length • See the aggregate evolution of profiles
Profile Hit Ratio • Do a similar thing for hit ratio • For each liked item r(u,i) > 3 at time t • create a recommendation list at time t • measure the rank of item i on that list • compute the hit ratio of such items on lists of length k
Cold Start • Seems to take about 150 days to get past the initial cold start • about 15% of the data • Temporal MAE improves after that • but not as steeply
Profile MAE • Decrease in MAE as profiles get longer • Strongest decrease earlier in the curve • Seems to be a kNN property • same thing happens if the first 150
Diminishing returns • Appears to be diminishing returns in longer profile sizes • paradoxical given what we know about sparsity • More data should be better
A clue • ML100K data • 10% data size • Sparser data compresses the curve • Diminishing returns may be a function of the average profile length
Average rating • Users seem to add positive ratings first and negative ratings later
Application-dependence • Could be because ratings are added in response to recommendations • Easy (popular) recommendations given first • likely to be right • Later recommendations • more errors • users rate lower
Profile Hit Ratio • Cumulative hit ratio • n=50 • Dashed line is random performance
Interestingly • Harder to see • Appear to be diminishing returns • like MAE • but then a jump at the end • Need to examine this data more • ML100K data • experiments very slow to run
MAE for different ratings • Odd result • MAE for each rating value • correlated with # of ratings of that value in the profile • subtract out contribution of total # of ratings of that value • May tell us the average value of adding a rating of a particular type • Look at R=5? • saturation • more about this later
What Have The Neighbours Ever Done for Us? A Collaborative Filtering Perspective. Michael O’Mahony 5th March, 2009
Presentation based on paper submitted to UMAP ’09 • Authors: • R. Rafter, M.P. O’Mahony, N. J. Hurley and B. Smyth
Collaborative Filtering • Collaborative filtering (CF) – key techniques used in recommender systems • Harnesses past ratings to make predictions & recommendations for new items • Recommend items with high predicted ratings and suppress those with low predicted ratings • Assumption: CF techniques provide a considerable advantage over simpler average-rating approaches
Valid Assumption? • We analyse the following: • What do CF techniques actually contribute? • How is accuracy performance measured? • What datasets are used to evaluate CF techniques? • Consider two standard CF techniques: • User-based and item-based CF
CF Algorithms • Two components to user-based and item-based CF: • Initial estimate: based on average rating of target user or item • Neighbour estimate: based on ratings of similar users or items • Must perturb the initial estimate: • By the correct magnitude • In the correct direction • General formula:
CF Algorithms • User-based CF: • Item-based CF: Neighbour Estimate Initial Estimate
Evaluating Accuracy • Predictive accuracy: • Mean Absolute Error (MAE): • MAE calculated over all test set ratings (problem?) • Other metrics: RMSE, ROC curves … – give similar trends
Evaluation • Datasets: • Procedure: • Create test set by randomly removing 10% of ratings • Make predictions for test set ratings using remaining data • Repeat x10 and compute average MAE
Results • Average performance, computed over all test set ratings • Neighbour estimate magnitudes are small, between 8.5% – 11% of range • Item-based CF is comparable to/outperforms user-based CF wrt MAE • (smaller magnitudes observed for item-based CF) • Book-crossing dataset – user-based CF shifts initial estimate in correct direction • in only 53% of cases (just slightly better than chance!)
Datasets • Frequency of occurrence of ratings: • Bias (natural?) toward ratings onhigher end of scale • Consider MovieLens: • Most ratings are 3 and 4 • Mean user rating ≈ 3.6 –– small neighbour estimate magnitude required in most cases • Consequences of such datasets characteristics for CF research: • Computing average MAE across all test set ratings hide performance issues in light of such characteristics [Shardanand and Maes 1995] • For example, can CF achieve large magnitudes when needed?
MAE vs Actual Ratings Recall: average overall MAE = 0.73 for both UB and IB …
NeighbourContribution • Effect of neighbour estimate versus initial (mean-based) estimate:
Conclusions • Examined the contribution of standard CF techniques: • Neighbours have small influence (magnitude) which is not always reliable (direction) • Evaluating accuracy performance: • Need for more fine-grained error analysis [Shardanand and Maes 1995] • Focus on developing CF algorithms which offer improved accuracy performance for extreme ratings • Test datasets: • Standard datasets have particular characteristics – e.g. bias in ratings toward higher end of rating scale – need for new datasets • Such characteristics, combined with using overall MAE to evaluate accuracy, has “hidden” performance issues – and hindered CF development (?)
That’s all folks! • Questions?