Privacy Preserving K -means Clustering on Vertically Partitioned Data

Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton

Overview • Global Problem • Privacy Preserving Distributed Data Mining • Specific Problem • Clustering (K-Means) • For • Vertically Partitioned Data • Using • Cryptographic Tools

Vertical Partitioning of Data Global Database View Cell Phone Data Medical Records

Is the problem trivial?

Privacy Preserving Data Mining • Perturbation • Agrawal & Srikant, Agrawal & Aggarwal, • Rizvi & Haritsa, Evfimievski et al. • Cryptographic • Lindell & Pinkas, Du & Zhan • Vaidya & Clifton, Kantarcioglu & Clifton

Secure Multiparty Computation (SMC) • Given a function f and n inputs, distributed at n sites, compute the result while revealing nothing to any site except its own input(s) and the result.

Results • Cluster assignment for entities • Not private • Cluster centers • Semi-private

Secure K-means clustering K-means clustering Arbitrarily select k starting points Repeat • Assign to respectively • (re)assign each object to closest cluster based on distance from mean • Re-compute the cluster means Until no change

Assigning objects to closest cluster

Key Idea • Disguise site components with random values • Compare distances while revealing only comparison result • Permute order of clusters to conceal meaning of comparison results

Closest Cluster Computation • 3 special sites, P1, P2 and Pr • P1 generates • r random vectors such that • Permutation π (over 1 .. K)

Permutation ProtocolDu and Atallah ’01 B A Homomorphic encryption: Ek(x)*Ek(y) = Ek(x+y)

Closest Cluster Computation P2 P1 P3 P1 Pr Pr-1 Pr Stage 2 Stage 1

Closest Cluster Computation • Stage 3 • P2 and Pr determine i, the index of the cluster with minimum distance • Stage 4 • P1 computes and broadcasts

When to stop? • Locally compute difference in means • Globally known threshold • Use simple random-adding technique to disguise actual values • First party adds random value to its distance and sends to next party • Each party adds its value to total and sends on • Last party compares with first party’s random +threshold

Communication Cost • r parties, n data elements, m bit distances

Conclusion • Presented a solution for Privacy Preserving K-Means Clustering problem • How to use clusters? • Will parties share required information for the possible benefits? • Improve Efficiency • Working on EM-Clustering, implementations

Privacy Preserving K -means Clustering on Vertically Partitioned Data

Privacy Preserving K -means Clustering on Vertically Partitioned Data

Presentation Transcript

k -means Clustering

K-means Clustering

K-means Clustering

K means Clustering ( Weka )

Canopy Clustering and K-Means Clustering

data privacy-preserving

K-MEANS CLUSTERING

K-Means Clustering

K-means clustering

K-means Clustering

Privacy-Preserving K- means Clustering over Vertically Partitioned Data

Initial K-Means Clustering :

Privacy-Preserving Datamining on Vertically Partitioned Databases

K-means Clustering

K-means Clustering

Clustering Beyond K -means

Clustering: K-Means

Privacy-Preserving Clustering

K-means clustering

Privacy-Preserving K- means Clustering over Vertically Partitioned Data