Towards Publishing Recommendation Data With Predictive Anonymization

Towards Publishing Recommendation Data With Predictive Anonymization Chih-Cheng Chang†, Brian Thompson†, Hui Wang‡, Danfeng Yao† † ‡ ACM Symposium on Information, Computer, and Communications Security (ASIACCS 2010)

Outline • Introduction • Privacy in recommender systems • Predictive Anonymization • Experimental results • Conclusions and future work Towards Publishing Recommendation Data With Predictive Anonymization

Motivation • Inevitable trend towards data sharing • Medical records • Social networks • Web search data • Online shopping, ads • Databases contain sensitive information • Growing need to protect privacy Towards Publishing Recommendation Data With Predictive Anonymization

Privacy in Relational Databases identifiers sensitive information Towards Publishing Recommendation Data With Predictive Anonymization

Privacy in Relational Databases “Pseudo-identifiers” 87% of the U.S. population can be uniquely identified by DOB, gender, and zip code! [S00] Towards Publishing Recommendation Data With Predictive Anonymization

Approaches to Achieving Privacy • Statistical databases • Only aggregate queries: What is average salary? • Differential Privacy [Dinur-Nissim ‘03, Dwork ‘06]Adaptively add random noise to output so querier can not determine if a user is in the database • Quality decreases over multiple queries • Publishing of anonymized databases • No restriction on how data is utilized, good for complex data mining applications • How to address privacy concerns? Towards Publishing Recommendation Data With Predictive Anonymization

Anonymization of Databases Techniques: • Perturbation 52 53 26 24 45 42 Towards Publishing Recommendation Data With Predictive Anonymization

Anonymization of Databases Techniques: • Perturbation • Swapping 52 45 Towards Publishing Recommendation Data With Predictive Anonymization

Anonymization of Databases Techniques: • Perturbation • Swapping • Generalization 52 50s 20s 24 45 40s Def. A database entry is k-anonymousif ≥ k-1 other entries match identically on the insensitive attributes. [SS98] Towards Publishing Recommendation Data With Predictive Anonymization

The Generalization Approach Towards Publishing Recommendation Data With Predictive Anonymization

Recommender Systems • Users register for service • After buying a good, they submit a rating for it • Get recommendations based on yours and others’ ratings Towards Publishing Recommendation Data With Predictive Anonymization

Recommender Systems ? ? ? The Netflix Challenge: “Anonymized” Netflix data is released to the public. $1 million prize for best movie prediction algorithm. NO! Question: Is privacy really protected? Towards Publishing Recommendation Data With Predictive Anonymization

Privacy in Recommender Systems Narayanan and Shmatikov [NS08] exploited external information to re-identify users in the released Netflix Challenge dataset. Privacy breach! Towards Publishing Recommendation Data With Predictive Anonymization

News Timeline How can we enable sharing of recommendation data without compromising users’ privacy? Towards Publishing Recommendation Data With Predictive Anonymization

Challenges in Anonymization of Recommender Systems • All data may be considered “sensitive” by users. • All data could be used as quasi-identifiers. • Data sparsity helps re-identification attacks, and makes anonymization difficult. [NS08] • Scalability – Netflix matrix has 8.5 billion cells! Towards Publishing Recommendation Data With Predictive Anonymization

Godfather Ben English Patient Star Wars Tim English Patient Attack Models We represent the recommendation database as a labeled bipartite graph: 3 0001 Star Wars “structure-based attack” 5 4 0002 4 Godfather 5 4 5 0003 1 English Patient 1 4 5 0004 Pretty in Pink “label-based attack” Towards Publishing Recommendation Data With Predictive Anonymization

Privacy Models • Node re-identification privacy:Should not be possible to re-identify individuals. • Link existence privacy:Should not be possible to infer whether a user has seen a particular movie. Our approach, Predictive Anonymization, provides these notions of privacy against both the structure-based and label-based attacks. Towards Publishing Recommendation Data With Predictive Anonymization

Predictive Anonymization Our solution takes a 3-step approach: • Use predictive padding to reduce sparsity. • Cluster users into groups of size k. • Perform homogenization by assigning users in each group to have the same ratings. Achieves k-anonymity! Towards Publishing Recommendation Data With Predictive Anonymization

Predictive Anonymization • Want to cluster users, but there is not enough information due to data sparsity. • Solution: Fill empty cells with predicted values. • Cluster users based on similar tastes, not necessarily similar lists of movies rated. Towards Publishing Recommendation Data With Predictive Anonymization

Predictive Anonymization The final step, homogenization, can be done in one of several ways. We describe two methods, “padded” and “pure” homogenization. • Use predictive padding to reduce sparsity. • Cluster users into groups of size k. • Perform homogenization by assigning users in each group to have the same ratings. Towards Publishing Recommendation Data With Predictive Anonymization

3.5 3 4.5 5 3.5 4 4.5 4 2.5 4.5 3.5 4 5 3.5 1.5 1 2.5 1.5 4.5 4 4.5 4.5 5 Predictive Anonymization “Padded Homogenization” 0001 Star Wars 0002 Godfather 0003 English Patient 0004 Pretty in Pink • All edges are added to the recommendation graph. • Each cluster is averaged using the padded data. Towards Publishing Recommendation Data With Predictive Anonymization

3.5 3 5 4.5 3.5 4 4.5 4 4 4.5 5 4 4 1 1 1 4 4.5 5 5 5 Predictive Anonymization “Pure Homogenization” 0001 Star Wars 0002 Godfather 0003 English Patient 0004 Pretty in Pink • Only necessary edges are added to the graph. • Each cluster is averaged using the original data. Towards Publishing Recommendation Data With Predictive Anonymization

Experiments • Performed on the Netflix Challenge dataset: • 480,189 users and 17,770 movies • more than 100 million ratings • Singular value decomposition (SVD) is used for padding and prediction. • We compute the root mean squared error (RMSE) for a test set of 1 million ratings on the original and anonymized data. RMSE = Towards Publishing Recommendation Data With Predictive Anonymization

Analysis: Prediction Accuracy • Padded Anonymization preserves prediction accuracy. • However, sparsity is eliminated, which affects the utility of the published dataset for data mining applications. Towards Publishing Recommendation Data With Predictive Anonymization

Summary Utility Privacy Towards Publishing Recommendation Data With Predictive Anonymization

Conclusions • We have formalized privacy and attack models for recommender systems. • Our solutions show that privacy-preserving publishing of anonymized recommendation data is feasible. • More work is required to find a practical solution that satisfies real-world privacy and utility goals. Towards Publishing Recommendation Data With Predictive Anonymization

Future Work • Investigate the use of differential privacy-like guarantees for recommendation databases • Analyze how to protect against more complex attacks with greater background knowledge • Evaluate the utility of anonymized recommendation data for advanced data mining applications Towards Publishing Recommendation Data With Predictive Anonymization

Thank you! Towards Publishing Recommendation Data With Predictive Anonymization

Towards Publishing Recommendation Data With Predictive Anonymization

Towards Publishing Recommendation Data With Predictive Anonymization

Presentation Transcript

Data Management Recommendation

LOD2 KOREA : Towards Publishing Korean Linked Data on the Web

Expanding towards multichannel publishing

Predictive Blacklisting as an Implicit Recommendation System

Data Publishing with Dataverse

Towards a predictive combustion chemistry model

Data publishing

Publishing and Interacting with Linked Data

Data Anonymization (1)

Data Anonymization - Generalization Algorithms

Update on Data Publishing With Dataverse

Predictive Analytics with Oracle Data Mining

Predictive Learning from Data

Privacy-preserving Anonymization of Set Value Data

Data Anonymization – Introduction and k-anonymity

Challenges in Moving Towards Predictive Oncology

Publishing Data

Towards Implementing Better Movie Recommendation Systems

Griffin: Towards an Agile, Predictive Infrastructure