On Link Privacy in Randomizing Social Networks

On Link Privacy in Randomizing Social Networks Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand

Motivation • Privacy Preserving Social Network Publishing • node-anonymization • cannot guarantee identity/link privacy due to subgraph queries. • Backstrom et al. WWW07, Hay et al. UMass TR07 • edge randomization • Random Add/Del • Random Switch • K-anonymity • Hay et al. VLDB08, Liu&Terzi SIGMOD08, Zhou&Pei ICDE08 • Utility preserving randomization • Spectral feature preserving Ying&Wu SDM08 • Real space feature preserving Ying&Wu SDM09

Problem Formalization Add k then del k edges Prior belief vs. Posterior belief Ying&Wu SDM08 This paper similarity measure value between node i and j

Polbooks network Network of US political books (105 nodes, 441 edges, r=8%) Books about US politics sold by Amazon.com. Edges represent frequent co-purchasing of books by the same buyers. Nodes have been given colors of blue, white, or red to indicate whether they are "liberal", "neutral", or "conservative". http://www-personal.umich.edu/˜mejn/netdata/

Proportion of true edges vs. similarity After randomly add/delete 200 edges (totally 441 edges)

Similarity measures vs. Link prediction • Similarity measures • The number of common neighbors • Adamic/Adar, the weighted number of common neighbors • Katz, a weighted sum of the number of paths connecting two nodes • Commute time, the expected steps of random walks from node i to j and back to i. Similarity measures have been exploited in the classic link prediction problem. Liben-Nowell&Kleinberg CIKM03

Proportion of true edges vs. similarity After randomly add/delete 200 edges (totally 441 edges)

Calculating Posterior belief Applying Bayes theorem The attacker does not know this value, what he can do?

MLE estimation Estimate based on randomized graph Posterior belief can be calculated by attackers

Comparison

Empirical Evaluation Attacker’s Prediction Strategy • Calculate posterior probability of all node pairs • Choose top t node pairs (with highest post. Prob.) as predicted candidate links For each t, the precision of predictions (k=0.5m)

Empirical Evaluation The posteriori beliefs with similarity measures achieve higher precision than that without exploiting similarity measures. One measure that is best for one data is not necessarily best for another data.

Determining k to guarantee privacy Data Owner

Conclusion & Future Work • We have shown that node proximity measures can be exploited by attackers to breach link privacy in edge add/del randomized networks • How about other topological properties? • How about other randomization strategies? • Privacy vs. utility tradeoff

Thank You! Questions? Acknowledgments This work was supported in part by U.S. National Science Foundation IIS-0546027 and CNS-0831204.

Utility preserving randomization • Graph space : {G: with the given degree seq. & } • Examining proportion of sample graphs with existence a link between node i and j • Ying&Wu,SDM09 Attacker’s confidence on link (i,j)

On Link Privacy in Randomizing Social Networks