320 likes | 426 Views
Towards Publishing Recommendation Data With Predictive Anonymization. Chih-Cheng Chang † , Brian Thompson † , Hui Wang ‡ , Danfeng Yao †. †. ‡. ACM Symposium on Information, Computer, and Communications Security (ASIACCS 2010). Outline. Introduction Privacy in recommender systems
E N D
Towards Publishing Recommendation Data With Predictive Anonymization Chih-Cheng Chang†, Brian Thompson†, Hui Wang‡, Danfeng Yao† † ‡ ACM Symposium on Information, Computer, and Communications Security (ASIACCS 2010)
Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization
Motivation • Inevitable trend towards data sharing • Medical records • Social networks • Web search data • Online shopping, ads • Databases contain sensitive information • Growing need to protect privacy Towards Publishing Recommendation Data With Predictive Anonymization
Privacy in Relational Databases identifiers sensitive information Towards Publishing Recommendation Data With Predictive Anonymization
Privacy in Relational Databases “Pseudo-identifiers” 87% of the U.S. population can be uniquely identified by DOB, gender, and zip code! [S00] Towards Publishing Recommendation Data With Predictive Anonymization
Approaches to Achieving Privacy • Statistical databases • Only aggregate queries: What is average salary? • Differential Privacy [Dinur-Nissim ‘03, Dwork ‘06]Adaptively add random noise to output so querier can not determine if a user is in the database • Quality decreases over multiple queries • Publishing of anonymized databases • No restriction on how data is utilized, good for complex data mining applications • How to address privacy concerns? Towards Publishing Recommendation Data With Predictive Anonymization
Anonymization of Databases Techniques: • Perturbation 52 53 26 24 45 42 Towards Publishing Recommendation Data With Predictive Anonymization
Anonymization of Databases Techniques: • Perturbation • Swapping 52 45 Towards Publishing Recommendation Data With Predictive Anonymization
Anonymization of Databases Techniques: • Perturbation • Swapping • Generalization 52 50s 20s 24 45 40s Def. A database entry is k-anonymousif ≥ k-1 other entries match identically on the insensitive attributes. [SS98] Towards Publishing Recommendation Data With Predictive Anonymization
The Generalization Approach Towards Publishing Recommendation Data With Predictive Anonymization
Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization
Recommender Systems • Users register for service • After buying a good, they submit a rating for it • Get recommendations based on yours and others’ ratings Towards Publishing Recommendation Data With Predictive Anonymization
Recommender Systems ? ? ? The Netflix Challenge: “Anonymized” Netflix data is released to the public. $1 million prize for best movie prediction algorithm. NO! Question: Is privacy really protected? Towards Publishing Recommendation Data With Predictive Anonymization
Privacy in Recommender Systems Narayanan and Shmatikov [NS08] exploited external information to re-identify users in the released Netflix Challenge dataset. Privacy breach! Towards Publishing Recommendation Data With Predictive Anonymization
News Timeline How can we enable sharing of recommendation data without compromising users’ privacy? Towards Publishing Recommendation Data With Predictive Anonymization
Challenges in Anonymization of Recommender Systems • All data may be considered “sensitive” by users. • All data could be used as quasi-identifiers. • Data sparsity helps re-identification attacks, and makes anonymization difficult. [NS08] • Scalability – Netflix matrix has 8.5 billion cells! Towards Publishing Recommendation Data With Predictive Anonymization
Godfather Ben English Patient Star Wars Tim English Patient Attack Models We represent the recommendation database as a labeled bipartite graph: 3 0001 Star Wars “structure-based attack” 5 4 0002 4 Godfather 5 4 5 0003 1 English Patient 1 4 5 0004 Pretty in Pink “label-based attack” Towards Publishing Recommendation Data With Predictive Anonymization
Privacy Models • Node re-identification privacy:Should not be possible to re-identify individuals. • Link existence privacy:Should not be possible to infer whether a user has seen a particular movie. Our approach, Predictive Anonymization, provides these notions of privacy against both the structure-based and label-based attacks. Towards Publishing Recommendation Data With Predictive Anonymization
Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization
Predictive Anonymization Our solution takes a 3-step approach: • Use predictive padding to reduce sparsity. • Cluster users into groups of size k. • Perform homogenization by assigning users in each group to have the same ratings. Achieves k-anonymity! Towards Publishing Recommendation Data With Predictive Anonymization
Predictive Anonymization • Want to cluster users, but there is not enough information due to data sparsity. • Solution: Fill empty cells with predicted values. • Cluster users based on similar tastes, not necessarily similar lists of movies rated. Towards Publishing Recommendation Data With Predictive Anonymization
Predictive Anonymization The final step, homogenization, can be done in one of several ways. We describe two methods, “padded” and “pure” homogenization. • Use predictive padding to reduce sparsity. • Cluster users into groups of size k. • Perform homogenization by assigning users in each group to have the same ratings. Towards Publishing Recommendation Data With Predictive Anonymization
3.5 3 4.5 5 3.5 4 4.5 4 2.5 4.5 3.5 4 5 3.5 1.5 1 2.5 1.5 4.5 4 4.5 4.5 5 Predictive Anonymization “Padded Homogenization” 0001 Star Wars 0002 Godfather 0003 English Patient 0004 Pretty in Pink • All edges are added to the recommendation graph. • Each cluster is averaged using the padded data. Towards Publishing Recommendation Data With Predictive Anonymization
3.5 3 5 4.5 3.5 4 4.5 4 4 4.5 5 4 4 1 1 1 4 4.5 5 5 5 Predictive Anonymization “Pure Homogenization” 0001 Star Wars 0002 Godfather 0003 English Patient 0004 Pretty in Pink • Only necessary edges are added to the graph. • Each cluster is averaged using the original data. Towards Publishing Recommendation Data With Predictive Anonymization
Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization
Experiments • Performed on the Netflix Challenge dataset: • 480,189 users and 17,770 movies • more than 100 million ratings • Singular value decomposition (SVD) is used for padding and prediction. • We compute the root mean squared error (RMSE) for a test set of 1 million ratings on the original and anonymized data. RMSE = Towards Publishing Recommendation Data With Predictive Anonymization
Analysis: Prediction Accuracy • Padded Anonymization preserves prediction accuracy. • However, sparsity is eliminated, which affects the utility of the published dataset for data mining applications. Towards Publishing Recommendation Data With Predictive Anonymization
Summary Utility Privacy Towards Publishing Recommendation Data With Predictive Anonymization
Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization
Conclusions • We have formalized privacy and attack models for recommender systems. • Our solutions show that privacy-preserving publishing of anonymized recommendation data is feasible. • More work is required to find a practical solution that satisfies real-world privacy and utility goals. Towards Publishing Recommendation Data With Predictive Anonymization
Future Work • Investigate the use of differential privacy-like guarantees for recommendation databases • Analyze how to protect against more complex attacks with greater background knowledge • Evaluate the utility of anonymized recommendation data for advanced data mining applications Towards Publishing Recommendation Data With Predictive Anonymization
Thank you! Towards Publishing Recommendation Data With Predictive Anonymization