260 likes | 298 Views
Privacy-Preserving Clustering. Outline. Introduction Related Work Secure Multi-Party Computation Data Sanitization Preliminaries Yao ’ s Millionaires ’ Problem Homomorphic Encryption Privacy-Preserving K-Means Clustering Conclusion. Introduction. Why needs privacy-preserving?
E N D
Outline • Introduction • Related Work • Secure Multi-Party Computation • Data Sanitization • Preliminaries • Yao’s Millionaires’ Problem • Homomorphic Encryption • Privacy-Preserving K-Means Clustering • Conclusion
Introduction • Why needs privacy-preserving? • Data sharing in today's globally networked systems poses a threat to individual privacy and organizational confidentiality. • The privacy problem is not data mining, but the way data mining is done. • So, privacy and data mining can coexist. • An important data mining problem: clustering.
Related Work • Privacy-preserving clustering: • Secure multi-party computation. • High computation and communication costs. • Data sanitization. • Lost of accuracy. • Dimensionality reduction. • Model-based solutions.
Yao’s Millionaires’ Problem • Millionaires’ problem: • Two millionaires wish to know who is richer; however, they do not want to find out any additional information about each other’s wealth.
Solutions • Suppose • Alice has i millions. • Bob has j millions. • 1 < i, j < 10.
Solutions • Suppose • Alice: i = 5, Bob: j = 3. • (B) x = 7, Ea(x) = 4 = k. • (B) k - j + 1 = 2.
Solutions • (A)
Solutions • (A) 5. (B) Check if z3 = x or not. If yes, means that i ≧ j. If no, means that i < j.
Homomorphic Encryption • Homomorphic encryption: • If there is an algorithm ⊕ to compute H(x⊕y) from H(x) and H(y) that does not reveal x or y. • H(x⊕y) = H(x) ⊙ H(y) • RSA, … • Additive homomorphic: • H(x+y) = H(x) * H(y) • Paillier, …
Privacy-Preserving K-Means Clustering Over Vertically Partitioned Data SIGKDD, 2003
Problem Definition • Goal: • Cluster the known set of common entities without revealing any value that the clustering is based on. • Input: • Each user provides one attribute of all items. • Output: • Assignment of entities to clusters. • Cluster centers themselves.
K-Means Clustering clusterdecision new centercomputation distance matrix
Vertically Partitioned Data User 1 User 2
Terminology • r: # of users, each having different attributes for the same set of items. • n: # of the common items. • k: # of clusters required. • ui: each cluster mean, i = 1, …, k. • uij: projection of the mean of cluster i on user j. • Final result for user j: • The final value / position of uij, i = 1, …, k. • Cluster assignments: clusti for all points i = 1, …, n.
Securely Finding the Closest Cluster • The security of the algorithm is based on three key ideas. • Disguise the site components of the distance with random values that cancel out when combined. • Permute the order of clusters so the real meaning of the comparison results is unknown. • Compare distances so only the comparison result is learned; no party knows the distances being compared.
j m Check Threshold
Conclusion • Horizontally partitioned data: User 1 User 2