Cryptographic methods for privacy aware computing: applications

Cryptographic methods for privacy aware computing: applications

Outline • Review: three basic methods • Two applications • Distributed decision tree with horizontally partitioned data • Distributed k-means with vertically partitioned data

Three basic methods • 1-out-K Oblivious Transfer • Random share • Homomorphic encryption * Cost is the major concern

Two example protocols • The basic idea is • Do not release original data • Exchange intermediate result • Applying the three basic methods to securely combine them

Building decision trees over horizontally partitioned data • Horizontally partitioned data • Entropy-based information gain • Major ideas in the protocol

Horizontally Partitioned Data • Table with key and r set of attributes key X1…Xd K1k2 kn key X1…Xd key X1…Xd key X1…Xd Ki+1ki+2 kj Km+1km+2 kn K1k2 ki Site 1 Site 2 … Site r

Review decision tree algorithm (ID3 algorithm) • Find the cut that maximizes gain • certain attribute Ai, sorted v1…vn • Certain value in the attribute • For categorical data we use Ai=vi • For numerical data we use Ai<vi Ai<vi? yes no … Aj<vj? Ai label E(): Entropy of label distribution v1v2 vn l1l2 ln cut Choose the attribute/value that gives the highest gain!

Key points • Calculating entropy Ai label v1v2 vn l1l2 ln cut • The key is calculating x log x, where • x is the sum of values from the two parties • P1 and P2 , i.e., x1 and x2, respectively • decomposed to several steps • Each step each party knows only a random • share of the result

steps Step1: compute shares for w1 +w2= (x1+x2)ln(x1+x2) * a major protocol is used to compute ln(x1+x2) Step 2: for a condition (Ai, vi), find the random shares for E(S), E(S1) and E(S2) respectively. Step3: repeat step1&2 to all possible (Ai, vi) pairs Step4: a circuit gate to determine which (Ai, vi) pair results in maximum gain. (Ai,vi) with Maximum gain w11 w21 x1 … … w12 w22 x2

2. K-means over vertically partitioned data • Vertically partitioned data • Normal K-means algorithm • Applyingsecure sum and secure comparison among multi-sites in the secure distributed algorithm

Vertically Partitioned Data • Table with key and r set of attributes key X1…Xi Xi+1…Xj … Xm+1…Xd key X1…Xi key Xi+1…Xj key Xm+1…Xd Site 1 Site 2 … Site r

Motivation • Naïve approach: send all data to a trusted site and do k-mean clustering there • Costly • Trusted third party? • Preferable: distributed privacy preserving k-means

Basic K-means algorithm • 4 main steps: step1.Randomly select k initial cluster centers (k means) repeat step2. Assign any point i to its closest cluster center step 3. Recalculate the k means with the new point assignment Until step 4. the k means do not change

Distributed k-means • Why k-means can be done over vertically partitioned data • All of the 4 steps are decomposable ! • The most costly part (step 2 and 3) can be done locally • We will focus on the step 2 (Assign any point i to its closest cluster center)

step 1 • All sites share the index of the initial random k records as the centroids µ11 … µ1i µ1i+1 … µ1j µ1m …µ1d µ1 µk µk1 … µki µki+1 … µkj µkm … µkd Site 1 Site 2 … Site r

Step 2: • Assign any point x to its closest cluster center • Calculate distance of point X (X1, X2, … Xd) to each cluster center µk -- each distance calculation is decomposable! d2 = [(X1- µk1)2 +… (Xi- µki)2] + [(Xi+1- µki+1)2 +… (Xj- µkj)2] + … 2. Compare the k full distances to find the minimum one Partial distances: d1 + d2 + … Site1 site2 For each X, each site has a k-element vector that is the result for the partial distance to the k centroids, notated as Xi

Privacy concerns for step 2 • Some concerns: • Partial distances d1, d2 … may breach privacy (the Xi and µki ) – need to hide it • distance of a point to each cluster may breach privacy – need hide it • Basic ideas to ensure security • Disguise the partial distances • Compare distances so that only the comparison result is learned • Permute the order of clusters so the real meaning of the comparison results is unknown. • Need 3 non-colluding sites (P1, P2, Pr)

Secure Computing of Step 2 • Stage1: prepare for secure sum of partial distances • p1 generate V1+V2 + …Vr = 0, Vi is random k-element vector, used to hide the partial distance for site i • Use “Homomorphic encryption” to do randomization: Ei(Xi)Ei(Vi) = Ei(Xi+Vi) • Stage2: calculate secure sum for r-1 parties • P1, P3, P4… Pr-1 send their perturbed and permuted partial distances to Pr • Pr sums up the r-1 partial distances (including its own part)

Secure Computing of Step 2 Stage 1 Stage 2 * Xi contains the partial distances to the k partial centroids at site i * Ei(Xi)Ei(Vi) = Ei(Xi+Vi) : Homomorphic encryption, Ei is public key * (Xi) : permutation function, perturb the order of elements in Xi * V1+V2 + …Vr = 0, Vi is used to hide the partial distances

Stage 3: secure_add_and_compare to find the minimum distance • Involves only Pr and P2 • Use a standard Secure Multiparty Computation protocol to find the result • Stage 4: • the index of minimum distance (permuted cluster id) is sent back to P1. • P1 knows the permutation function thus knows the original cluster id. • P1 broadcasts the cluster id to all parties. K-1 comparisons:

Step 3: can also be done locally • Update partial means µi locally according to the new cluster assignments. Cluster labels X11 … X1i X1i+1 … X1j X1m …X1d Cluster 2 X21 … X2i Cluster k Cluster k Xn1 … Xni Xni+1 … Xnj Xnm … Xnd Site 1 Site 2 … Site r

Extra communication cost • O(nrk) • n : # of records • r: # of parties • k: # of means • Also depends on # of iterations

Conclusion • It is appealing to have cryptographic privacy preserving protocols • The cost is the major concern • It can be reduced using novel algorithms

Cryptographic methods for privacy aware computing: applications