Privacy-Preserving Clustering: Secure Data Mining Solutions

Privacy-Preserving Clustering

Outline • Introduction • Related Work • Secure Multi-Party Computation • Data Sanitization • Preliminaries • Yao’s Millionaires’ Problem • Homomorphic Encryption • Privacy-Preserving K-Means Clustering • Conclusion

Introduction • Why needs privacy-preserving? • Data sharing in today's globally networked systems poses a threat to individual privacy and organizational confidentiality. • The privacy problem is not data mining, but the way data mining is done. • So, privacy and data mining can coexist. • An important data mining problem: clustering.

Related Work • Privacy-preserving clustering: • Secure multi-party computation. • High computation and communication costs. • Data sanitization. • Lost of accuracy. • Dimensionality reduction. • Model-based solutions.

Yao’s Millionaires’ Problem • Millionaires’ problem: • Two millionaires wish to know who is richer; however, they do not want to find out any additional information about each other’s wealth.

Solutions • Suppose • Alice has i millions. • Bob has j millions. • 1 < i, j < 10.

Solutions • Suppose • Alice: i = 5, Bob: j = 3. • (B) x = 7, Ea(x) = 4 = k. • (B) k - j + 1 = 2.

Solutions • (A)

Solutions • (A) 5. (B) Check if z3 = x or not. If yes, means that i ≧ j. If no, means that i < j.

Homomorphic Encryption • Homomorphic encryption: • If there is an algorithm ⊕ to compute H(x⊕y) from H(x) and H(y) that does not reveal x or y. • H(x⊕y) = H(x) ⊙ H(y) • RSA, … • Additive homomorphic: • H(x+y) = H(x) * H(y) • Paillier, …

Homomorphic Encryption

Privacy-Preserving K-Means Clustering Over Vertically Partitioned Data SIGKDD, 2003

Problem Definition • Goal: • Cluster the known set of common entities without revealing any value that the clustering is based on. • Input: • Each user provides one attribute of all items. • Output: • Assignment of entities to clusters. • Cluster centers themselves.

K-Means Clustering

K-Means Clustering clusterdecision new centercomputation distance matrix

Vertically Partitioned Data User 1 User 2

Terminology • r: # of users, each having different attributes for the same set of items. • n: # of the common items. • k: # of clusters required. • ui: each cluster mean, i = 1, …, k. • uij: projection of the mean of cluster i on user j. • Final result for user j: • The final value / position of uij, i = 1, …, k. • Cluster assignments: clusti for all points i = 1, …, n.

Privacy-Preserving K-Means Clustering

Securely Finding the Closest Cluster

Securely Finding the Closest Cluster • The security of the algorithm is based on three key ideas. • Disguise the site components of the distance with random values that cancel out when combined. • Permute the order of clusters so the real meaning of the comparison results is unknown. • Compare distances so only the comparison result is learned; no party knows the distances being compared.

Securely Finding the Closest Cluster

j m Check Threshold

Conclusion • Horizontally partitioned data: User 1 User 2

Privacy-Preserving Clustering: Secure Data Mining Solutions

Privacy-Preserving Clustering: Secure Data Mining Solutions

Presentation Transcript

Privacy-Preserving Credit Checking

Privacy-Preserving Linear Programming

Privacy-Preserving Smart Metering

Privacy Preserving Data Mining

Privacy-Preserving Smart Metering

Privacy preserving network forensics

data privacy-preserving

Privacy-preserving DRM

Privacy Preserving Data Mining

Privacy-Preserving K- means Clustering over Vertically Partitioned Data

Privacy Preserving Infrastructural Requirements

Privacy-Preserving Face Recognition

Privacy-Preserving Location Services

PRIVACY-PRESERVING TRAJECTORY COLLECTION

Privacy Preserving OLAP

Privacy-Preserving Data Publishing

Privacy-Preserving Computation

Privacy-Preserving Data Sharing

Privacy-Preserving Transaction Escrow

Privacy Preserving Electronic Surveillance

Privacy Preserving K -means Clustering on Vertically Partitioned Data

privacy preserving e-petitions