“You might also like: …”

“You might also like: …” Privacy risks of collaborative filtering Yuval Madar, June 2012 Based on a paper by J.A. Calandrino, A. Kilzer, A. Narayanan, E. W. Felten & V. Shmatikov

The setting: Recommendation systems • Help suggesting users items and other users to their liking by deducing them from previous purchases. • Numerous examples • Amazon, iTunes, CNN, Last.fm, Pandora, Netflix, Youtube, Hunch, Hulu, LibraryThing, IMDb and many others.

Recommendation types • User to item – “You might also like…” • Item to item – Similar items list • User to user – Another customer with common interests

Recomandation methods • Content Based filtering • Based on A-priori similarity between items, and recommendations are derived from a user’s own history. • Doesn’t pose a privacy threat. • Collaborative filtering • Based on correlations between other uses purchases. • Our attacks will target this type of systems. • Hybrid • A system employing both filtering techniques

Attack Model • The data the recommendation system uses is modeled as a matrix, where the rows correspond to users, and columns correspond to items. • Some auxiliary information on the target user is available (A subset of a target user’s transaction history) • An attack is successful, if it allows the attacker to learn transactions not part of its auxiliary information.

Sources for auxiliary information • User public rating and comments on products • Shared transactions (Via facebook, or other mediums) • Discussions in 3rd party sites • Favorite books in facebook profile • Non-online interactions (With friends, neighbors, coworkers, etc.) • Other sources…

Attack 1 - Related-items list inference • Input: • a set of target items T and a set of auxiliary items A • Observe the related items list of A, until an item in T appears, or moves up. • If a target item appears in enough related items lists in the same time, the attacker may infer it was bought by the target user. • Note 1 – Scoring may be far more complex, since different items in A are correlated. (Books which belong to a single series, bundle discounts, etc.) • Note 2 – It is preferable that A consist of obscure and uncommon items, to improve the effect of the target user’s choices on its related items lists.

Attack 2 - Covariance Matrix inference • In some sites, the covariance matrix, describing the correlation between items in the site, is exposed to the users. (Hunch is one such website) • Similarly, the attacker is required to watch for improvement in the correlation between the auxiliary items and the target items. • Note 1 – Asynchronous updates to different matrix cells. • Note2 – inference probability improves if the auxiliary items are user-unique. (No other user bought all auxiliary items) More likely if some of them are unpopular, or if there are enough of them.

Attack 3 - kNN recommender systems inference • System model • For each user, the system finds the k users most similar to it, and ranks items purchased by them by total number of sales. • Active Attack • Create k dummy users, each buying all known auxiliary items. • With high probability, the k dummy users and the target user will be clustered together. (Given auxiliary items list of size logarithmic in the total number of users. In practice, 8 items were found to be enough for most sites) • In that case, the recommendations to the dummy users will consist of transactions of the target user previously unknown to the attacker. • Note – The attack is more feasible in a system where user interactions with items does not involve spending money.

Attack Metrics • The main parameters for evaluation of an inference attack are: • Yield – How many inferences are produced. • Accuracy – How likely is each inference. • Yield-accuracy tradeoff - stricter accuracy algorithms reject less probable inferences.

Actual attacks • The paper further discusses specific attacks performed against: • Hunch • LibraryThing • Last.fm • Amazon • And measures the accuracy and yield of these attacks, arriving in some instances to impressive tradeoff figures. (Such as 70% accuracy for 100% yield in Hunch)

Differential privacy • Not discussed in the paper. • Achieved in other papers for static recommendation databases. • Remains an open problem for dynamic systems. (Which all real world examples are)

Decreasing the privacy risk • Limited-length related items list – The first elements of such lists have low sensitivity to single purchases. • Factoring item popularity into update frequency – less popular items are more sensitive to single purchases. Batching their purchases together will decrease the information leak.

Decreasing the privacy risk • Limit data access rate – Preventing large-scale privacy attacks, though lowering utility and may be circumvented using a botnet. • User opt-out – A privacy conscious user may decide to opt-out of recommender systems entirely. (At clear cost of utility)

Conclusion • A passive attack on recommender systems using auxiliary information on a certain user’s purchases, allowing the attacker to infer undisclosed private transactions. • Increased user awareness is required • Suggested several methods to decrease the information leaked by these systems

Questions?

“You might also like: …”

“You might also like: …”

Presentation Transcript