330 likes | 372 Views
Collaborative filtering with temporal dynamics. Max Naylor. Background.
E N D
Background • Netflix Prize: In 2009, Netflix held an open competition for the best collaborative filtering algorithm to predict user ratings for films (grand prize: $1M), releasing 100+ million anonymized user-movie ratings, now called the Netflix Prize dataset • Collaborative filtering is the most popular approach to implementing recommender systems, leveraging past user behaviour to give customized recommendations • Concept drift is the phenomenon of user behaviour changing over time, either on a global scale or a local scale
The problem: how do you model user’s preferences as they change over time? • Goal for modeling concept drift • Drop temporary effects with very low impact on future behaviour • Capture longer-term trends that reflect inherent nature of data • Modeling localized concept drift • Global drifts can affect the whole population • Seasons / holidays / etc. • What about the unique local drifts that affect specific users differently? • Change in user’s music taste / family structure / etc.
Three usual approaches to the concept drift problem 1. Instance selection (time-window approach) • Discard instances that are deemed less relevant to current state • Reasonable with abrupt time shifts • Not great with gradual shifts t
Three usual approaches to the concept drift problem 2. Instance weighting • Use time decay function to underweight instances as they occur deeper in the past t
Discarding or under-weighting the past trashes valuable signal The best time decay function turns out to be no decay at all • Previous preferences tend to linger • Or, previous preferences help establish cross-user/cross-product patterns that are indirectly useful in modeling other users t
Three usual approaches to the concept drift problem 3. Ensemble learning: • Maintain family of predictors that jointly produce final outcome • Predictors that were more successful on recent instances get higher weights t
Ensemble learning... ... is not a good fit for collaborative filtering + temporal dynamics • Misses global patterns by using multiple models that each only consider a fraction of total behaviour • To keep track of independent, localized drifting behaviours, ensemble learning requires a separate ensemble for each user • Which complicates integrating information across users • Which is the cornerstone to how collaborative filtering works t
Model Goals • Explain user behaviour along the full time period • Capture multiple separate drifting concepts • User-dependent or item-dependent • Sudden or gradual • Combine all drifting concepts within a single framework • Model interactions that cross users and items, identifying higher-level patterns • Do not try to extrapolate future temporal dynamics • Could be helpful, but too difficult • BUT capturing past temporal dynamics helps predict future behaviour
Two Collaborative Filtering Methods How to compare fundamentally different objects -- users and items? • Neighbourhood approach • Item-item • Transform users into item space by viewing them as baskets of rated items • Leverages the similarity between items to estimate a user’s preference for a new item • User-user • Transform items into user space by viewing items as baskets of user ratings • Leverages the similarity between users to estimate user’s preference for a new item Item space Items Users Item-item
Two Collaborative Filtering Methods How to compare fundamentally different objects -- users and items? 2. Latent factor models • Transform users and items to the same latent factor space to be directly comparable • Use singular value decomposition to automatically infer factors from user ratings that characterize movies and users • pi represents the user’s affinity for a factor • qirepresents the movie’s relation to a factor User action superheroes ? Item
Static latent factor model Vanilla model: captures interactions between users and items • Set f = number of factors/dimensions in the space • Find vector pu∈ ℝffor each user u and vector qi∈ ℝffor each item i • A rating is predicted as r̂ui= qiTpu ∈ ℝ • Learn pu and qi by minimizing (L2 regularized) squared error: where K = {(u,i,t) | rui(t) is known}
Adding baseline predictors to LFM r̂ui? Baseline predictors: absorb user- or item- biases • Let µ= overall average rating • Let buand bi are observed biases of user u and item i • Then the baseline predictor for an unknown rating rui isbui = µ + bu + bi • Adding user-item interaction term, ratings are predicted asr̂ui= µ + bu + bi+ qiTpu Jordan Black Panther Average rating over all movies:µ = 3.7 Black Panther tends to be rated higher than average:bi = 0.5 Jordan tends to be more critical than average: bu= -0.3 Baseline estimate: bui = µ + bu + bi= 3.7 + 0.5 - 0.3 = 3.9
Modeling time dynamics of baseline predictors Time-dependent baseline predictors: Absorbing user- and item- biases as they change over time • Let bu= bu(t) and bi= bi(t) be functions of time • Now the baseline predictor for unknown rating by user u of item i at time t is bui(t)= µ + bu(t) + bi(t) • Movie likeability doesn’t usually fluctuate significantly over time • User-biases can change daily, so model requires finer time resolution
Modeling time dynamics of item biases Modeling item bias bi (t) over time using time-based bins • How big to make the bins? • Want finer resolution → smaller bins • Need enough ratings per bin → larger bins • Authors’ choice: • 30 bins total, 10 consecutive weeks per bin • For any day t, Bin(t) ∈ [1,30] represents which bin t belongs to • So bias of an item i on day t is bi(t) = bi + bi,Bin(t) • Baseline predictor using time-based bins to absorb item bias over time:bui(t) = µ + bu(t) + bi(t) = µ + bu(t) + bi + bi,Bin(t)
Modeling time dynamics of user biases Capturing userbiases bu(t) over time with a linear model • Let tu = overall average date of ratings by user u • Then |t - tu| measures number of days between day t and tu • Define the time deviation of a rating by user u on day t to be devu(t) = sign(t - tu) · |t - tu|β , where β = 0.4 by cross-validation • Find αu = regression coefficient of devu(t), sobias of a user u at time t isbu(t) = bu + αu · devu(t) • Now, the baseline predictor with linear user bias looks like this:bui(t) = µ + bu(t) + bi + bi,Bin(t) = µ + bu + αu · devu(t)+ bi + bi,Bin(t)
Modeling time dynamics of user biases Capturing user biases bu(t) over time with a more-flexible splines model instead of a simple linear model • User u gives nu ratings • Choose kutime points, , spaced uniformly across total rating time of user u • coefficient for each control point (learned from data) • Number of control points balances flexibility and computational efficiency • Authors’ choice: ku = nu0.25, grows with number of available ratings (some users rate more movies) • Parameter γ determines smoothness of spline • Authors’ choice: γ = 0.3 by cross-validation
But what about sudden drifts? • Previous smooth functions model gradual concept drift • However, sudden concept drifts emerge as “spikes” associated with single day or session • To address short-lived effects, assign a single parameter, bu,t per user u and day t to absorb day-specific variability • Linear model: bu(t) = bu + αu · devu(t) + bu,t • Splines model:
Find parameters and build baseline models Select baseline predictor model (e.g., linear): Learn bu,αu, bu,t, bi and bi,Bin(t):
static mov linear spline linear+ bui(t) = linear + bu,t spline+ bui(t) = spline + bu,t
Results (baseline only!) • Add time-dependent scaling feature cu(t) = cu + cu,tper user to item bias • cu= average rating of user u over time • cu,t = day-specific variability from average rating • Baseline predictor is now • RMSE = 0.9555 for baseline model; even before capturing any user-item interactions, it can explain almost as much variability as commercial Netflix Cinematch recommender system (RMSE = 0.9514 on the same test set)
Adding back user-item interaction... • Temporal dynamics affect user preferences ⇒ Temporal dynamics affect user-item interactions • e.g., “psychological thrillers” fan → “crime dramas” fan • Similarly define latent factors vector pu as function of time pu(t)So for a user u and a factor k, the element puk(t) of pu(t) becomes
Putting it all together (baseline + user-item interaction) SVD:r̂ui (t) = qiTpu SVD++: r̂ui (t) = qiT( pu + |R(u)|-1/2∑j∈R(u)yj )timeSVD++: (f= number of factors)
Example: What can we learn from our results? Question: Why do ratings rise as movies become older?Two hypotheses: • People watch new movies indiscriminately, but only watch an older movie after a more careful selection process. An improved user-to-movie match would be captured by the interaction part of the model rising with movies’ age. • Older movies are just inherently better than newer ones. This would be captured by the baseline part of the model.
Example: Neighbourhood approach Question: Why do ratings rise as movies become older?Answer:
Takeaways • Addressing temporal dynamics in the data can have a more significant impact on accuracy than designing more complex learning algorithms • (Yielded the best results published so far on a widely-analyzed high-quality movie rating dataset) • Modeling time as a dimension of the data can help uncover interesting inherent patterns in the data • Even past behaviour that is entirely different from current behaviour is still useful for predicting future behaviour
More effects: Periodic • Some items more popular in specific seasons or near certain holidays • e.g., period(t) = {fall, winter, spring, summer} • Users may have different attitudes or buying patterns during weekend vs working week • e.g., period(t) = {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday}
Neighbourhood approach • Less accurate than factor models • But popular for explaining the reasoning behind computed recommendations and seamlessly accounting for new entered ratings No temporal dynamics: With temporal dynamics: