170 likes | 423 Views
Privacy Preserving K -means Clustering on Vertically Partitioned Data. Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton. Overview. Global Problem Privacy Preserving Distributed Data Mining Specific Problem Clustering (K-Means) For Vertically Partitioned Data Using
E N D
Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton
Overview • Global Problem • Privacy Preserving Distributed Data Mining • Specific Problem • Clustering (K-Means) • For • Vertically Partitioned Data • Using • Cryptographic Tools
Vertical Partitioning of Data Global Database View Cell Phone Data Medical Records
Privacy Preserving Data Mining • Perturbation • Agrawal & Srikant, Agrawal & Aggarwal, • Rizvi & Haritsa, Evfimievski et al. • Cryptographic • Lindell & Pinkas, Du & Zhan • Vaidya & Clifton, Kantarcioglu & Clifton
Secure Multiparty Computation (SMC) • Given a function f and n inputs, distributed at n sites, compute the result while revealing nothing to any site except its own input(s) and the result.
Results • Cluster assignment for entities • Not private • Cluster centers • Semi-private
Secure K-means clustering K-means clustering Arbitrarily select k starting points Repeat • Assign to respectively • (re)assign each object to closest cluster based on distance from mean • Re-compute the cluster means Until no change
Key Idea • Disguise site components with random values • Compare distances while revealing only comparison result • Permute order of clusters to conceal meaning of comparison results
Closest Cluster Computation • 3 special sites, P1, P2 and Pr • P1 generates • r random vectors such that • Permutation π (over 1 .. K)
Permutation ProtocolDu and Atallah ’01 B A Homomorphic encryption: Ek(x)*Ek(y) = Ek(x+y)
Closest Cluster Computation P2 P1 P3 P1 Pr Pr-1 Pr Stage 2 Stage 1
Closest Cluster Computation • Stage 3 • P2 and Pr determine i, the index of the cluster with minimum distance • Stage 4 • P1 computes and broadcasts
When to stop? • Locally compute difference in means • Globally known threshold • Use simple random-adding technique to disguise actual values • First party adds random value to its distance and sends to next party • Each party adds its value to total and sends on • Last party compares with first party’s random +threshold
Communication Cost • r parties, n data elements, m bit distances
Conclusion • Presented a solution for Privacy Preserving K-Means Clustering problem • How to use clusters? • Will parties share required information for the possible benefits? • Improve Efficiency • Working on EM-Clustering, implementations