250 likes | 376 Views
Google News Personalization. Big Data reading group November 12, 2007 Presented by Babu Pillai. Problem: finding stuff on Internet. Know what you want: content-based filtering, search Don’t know browse How to handle: Don’t know but, show me something interesting!. Google News.
E N D
Google News Personalization Big Data reading groupNovember 12, 2007 Presented by Babu Pillai
Problem: finding stuff on Internet • Know what you want: • content-based filtering, • search • Don’t know • browse • How to handle: Don’t know but, show me something interesting!
Google News • Top Stories • Recommendationsfor registered users • Based on userclick history,community clicks
Problem Scale • Lots of users, (more is good) • Millions of clicks from millions of users • Problem: high churn in item set • Several million items (clusters of news articles about the same story, as identified by GN) per month • Continuous addition, deletion • Strict timing (few hundred ms) • Existing systems not suitable
Memory-based Ratings • General form: where r is rating of item sk for user ua, and w(ua,ui) is similarity between users ua and ui • Problem: scalability, even when similarity is computed offline
Model-based techniques • Clustering / segmentation, e.g. based on interests • Bayesian models, Markov Decision, … • All are computationally expensive
What’s in this paper? • Investigate 2 different ways to cluster users: MinHash, and PLSI • Implement both on MapReduce
Google News Rating Model • 1 click = 1 positive vote • Noisier than 1-5 ranking (Netflix) • No explicit negatives • Why might it work? Partly due to the fairly significant article clips provided, so a user that clicks is likely genuinely interested
Design guidelines for a scalable rating system • Associate users into clusters of similar users (based on prior clicks, offline) • Users can belong to multiple clusters • Generate rating using much smaller sets of user clusters, rather than all users:
Technique 1: MinHash • Probabilistically assign users to clusters based on click history • Use Jaccard coefficient: distance is a metric • Using this metric is computationally expensive, not feasible even offline
MinHash as a form of Locality Sensitive Hashing • Basic idea: assign hash value to each use based on click history • How: randomly permute set of all items; assign id of first item in this order that appears in the user’s click history as the hash value for the user • Probability that 2 users have the same hash is equal to the Jaccard coefficient
Using MinHash for clusters • Concatenate p>1 such hashes as cluster id for increased precision • Apply q>1 in parallel (users belong to q clusters) to improve recall • Don’t actually maintain p*q permutations: hash item id with random seed to get proxy for permutation index, for p*q different seeds
MinHash on MapReduce • Generate p x q hashes for each user based on click history; generate q p-long cluster ids by concatenation • Map using cluster id’s as keys • Reduce to form membership lists for each cluster id
Technique 2: PLSI clustering • Probabilistic Latent Semantic Indexing • Main idea: hidden state z that correlates users and items • Generate this clustering from training set based on EM algorithm give by Hoffman04 • Iterative technique, generates new probability estimates based on previous estimates
PLSI as MapReduce • Q* can be independently computed for each (u,s), given prior N(z,s), N(z), p(z|u): map to RxK machines (R, K partitions for u, s respectively) • Reduce is simply addition
PLSI in a dynamic environment • Treat Z as user clusters • On each click, update p(s|z) for all clusters the user belongs to • This approximates PLSI, but is updated dynamically as additional items are added • Does not allow additions of users
Cluster-based recommendation • For each cluster, maintain number of clicks, decayed by time, for each item visited by a member • For a candidate item, lookup user’s clusters, add up age-discounted visitation counts, normalized by total clicks • Do this using both MinHash and PLSI clustering
One more technique: Covisitation • Memory-based technique • Create adjacency matrix between all pairs of items (can be directed) • Increment corresponding count if one item visited soon after another • Recommendation: for candidate item j, sum of all counts from i to j for all items i in recent click history of user, normalized appropriately
Whole System • Offline clustering • Online click history update, cluster item stats update, covisitation update
Results Generally around 30-50% better than popularity based recommendations
Discussion • Covisitation appears to work as well as clustering • Operational details missing: how big are cluster memberships, etc. • All of the clustering is done offline