Recommender Systems Session I

Recommender SystemsSession I Robin Burke DePaul University Chicago, IL

Roadmap Session A: Basic Techniques I Introduction Knowledge Sources Recommendation Types Collaborative Recommendation Session B: Basic Techniques II Content-based Recommendation Knowledge-based Recommendation Session C: Domains and Implementation I Recommendation domains Example Implementation Lab I Session D: Evaluation I Evaluation Session E: Applications User Interaction Web Personalization Session F: Implementation II Lab II Session G: Hybrid Recommendation Session H: Robustness Session I: Advanced Topics Dynamics Beyond accuracy

Current research • Question 1 • do we lose something when we think of a ratings database as static? • my work • Question 2 • does a summary statistic like MAE hide valuable information? • Mike O’Mahoney (UCD colleague)

Collaborative Dynamics • Remember our evaluation methodology • get all the ratings • divide them up into test / training data sets • run prediction tests

Problem • That isn’t how real recommender systems operate • They get a stream of ratings over time • They have to respond to user requests • predictions • recommendation lists • dynamically

Questions • Are early ratings more predictive than later ratings? • Is there a pattern to how users build their profiles? • How long does it take to get past the cold-start?

Some ideas • Temporal leave-one-out • Profile MAE • Profile Hit Ratio

Temporal leave-one-out (TL1O) • for a rating r(u,i) at time t • predict that r(u,i) using the ratings database immediately prior to t • the information that would have been available right before we learned u’s real rating • Average the error over time intervals • we see how error evolves as data is added • cold-start in action

Profile MAE • For each profile • do the TL1O ratings • average over all profiles of that length • See the aggregate evolution of profiles

Profile Hit Ratio • Do a similar thing for hit ratio • For each liked item r(u,i) > 3 at time t • create a recommendation list at time t • measure the rank of item i on that list • compute the hit ratio of such items on lists of length k

Temporal MAE (ML1M)

Cold Start • Seems to take about 150 days to get past the initial cold start • about 15% of the data • Temporal MAE improves after that • but not as steeply

Profile MAE • Decrease in MAE as profiles get longer • Strongest decrease earlier in the curve • Seems to be a kNN property • same thing happens if the first 150

Diminishing returns • Appears to be diminishing returns in longer profile sizes • paradoxical given what we know about sparsity • More data should be better

A clue • ML100K data • 10% data size • Sparser data compresses the curve • Diminishing returns may be a function of the average profile length

Average rating • Users seem to add positive ratings first and negative ratings later

Application-dependence • Could be because ratings are added in response to recommendations • Easy (popular) recommendations given first • likely to be right • Later recommendations • more errors • users rate lower

Profile Hit Ratio • Cumulative hit ratio • n=50 • Dashed line is random performance

Interestingly • Harder to see • Appear to be diminishing returns • like MAE • but then a jump at the end • Need to examine this data more • ML100K data • experiments very slow to run

MAE for different ratings • Odd result • MAE for each rating value • correlated with # of ratings of that value in the profile • subtract out contribution of total # of ratings of that value • May tell us the average value of adding a rating of a particular type • Look at R=5? • saturation • more about this later

Break

What Have The Neighbours Ever Done for Us? A Collaborative Filtering Perspective. Michael O’Mahony 5th March, 2009

Presentation based on paper submitted to UMAP ’09 • Authors: • R. Rafter, M.P. O’Mahony, N. J. Hurley and B. Smyth

Collaborative Filtering • Collaborative filtering (CF) – key techniques used in recommender systems • Harnesses past ratings to make predictions & recommendations for new items • Recommend items with high predicted ratings and suppress those with low predicted ratings • Assumption: CF techniques provide a considerable advantage over simpler average-rating approaches

Valid Assumption? • We analyse the following: • What do CF techniques actually contribute? • How is accuracy performance measured? • What datasets are used to evaluate CF techniques? • Consider two standard CF techniques: • User-based and item-based CF

CF Algorithms • Two components to user-based and item-based CF: • Initial estimate: based on average rating of target user or item • Neighbour estimate: based on ratings of similar users or items • Must perturb the initial estimate: • By the correct magnitude • In the correct direction • General formula:

CF Algorithms • User-based CF: • Item-based CF: Neighbour Estimate Initial Estimate

Evaluating Accuracy • Predictive accuracy: • Mean Absolute Error (MAE): • MAE calculated over all test set ratings (problem?) • Other metrics: RMSE, ROC curves … – give similar trends

Evaluation • Datasets: • Procedure: • Create test set by randomly removing 10% of ratings • Make predictions for test set ratings using remaining data • Repeat x10 and compute average MAE

Results • Average performance, computed over all test set ratings • Neighbour estimate magnitudes are small, between 8.5% – 11% of range • Item-based CF is comparable to/outperforms user-based CF wrt MAE • (smaller magnitudes observed for item-based CF) • Book-crossing dataset – user-based CF shifts initial estimate in correct direction • in only 53% of cases (just slightly better than chance!)

Neighbour Magnitude

Datasets • Frequency of occurrence of ratings: • Bias (natural?) toward ratings onhigher end of scale • Consider MovieLens: • Most ratings are 3 and 4 • Mean user rating ≈ 3.6 –– small neighbour estimate magnitude required in most cases • Consequences of such datasets characteristics for CF research: • Computing average MAE across all test set ratings hide performance issues in light of such characteristics [Shardanand and Maes 1995] • For example, can CF achieve large magnitudes when needed?

MAE vs Actual Ratings Recall: average overall MAE = 0.73 for both UB and IB …

Error PDFs

NeighbourContribution • Effect of neighbour estimate versus initial (mean-based) estimate:

Neighbour Contribution

Conclusions • Examined the contribution of standard CF techniques: • Neighbours have small influence (magnitude) which is not always reliable (direction) • Evaluating accuracy performance: • Need for more fine-grained error analysis [Shardanand and Maes 1995] • Focus on developing CF algorithms which offer improved accuracy performance for extreme ratings • Test datasets: • Standard datasets have particular characteristics – e.g. bias in ratings toward higher end of rating scale – need for new datasets • Such characteristics, combined with using overall MAE to evaluate accuracy, has “hidden” performance issues – and hindered CF development (?)

That’s all folks! • Questions?

Recommender Systems Session I

Recommender Systems Session I

Presentation Transcript

Recommender systems

Recommender Systems

Recommender Systems

Recommender Systems

Recommender Systems

Recommender Systems

Recommender Systems

Recommender systems

Recommender Systems

Recommender Systems

Recommender Systems

Recommender Systems

Recommender Systems

Recommender systems

Recommender Systems

Recommender Systems

Recommender Systems Session E

Recommender Systems