Privacy Preserving Data Mining

Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas

Summary • Objective • Various components / tools needed • Algorithm

Objective • Perform Data-mining on union of two private databases • Data stays private i.e. no party learns anything but output

Assumptions • Large Databases – Generic Solutions not possible • Semi-Honest Parties

Classification by Decision Tree Learning <attribute,value> Attributes Class Attribute Want to Predict Class, using only non-class attributes Transaction

Decision Tree • Rooted tree with nodes/edges • Internal Nodes => Attributes • Edges leaving nodes => Possible values • Leaves => Expected Class for transaction • Traverse tree using known attributes • Predict class given leaf node’s value

Constructing Tree • Top-down • At each level – find attribute that “best” classifies transactions => gives least overhead • Best => Attribute that minimizes entropy (maximizes information gain) • Entropy = -xlnx • Entropy of class = 0

Entropy calcluations • Entropy – H(T) = sum (-x ln x ) • Hc(T) => Info needed to ID class of transaction T • X = set of transactions for each class • Sum over all possible classes • Hc(T | A) => Info needed to ID class of transaction T, Given value v of attribute A • X = transactions with value = v for attribute A • Gain = Hc(T) – Hc(T | A)

Given only x1 and f1(x1,y), function S1 exists s.t.: P2 provides input x1 to P1 P2 can compute corresponding view of P1’s DB (desired <att,value> pairs) Private Computation S1 View Party 2 f1(x1,y) x1 f1 Party 1

Oblivious Evaluation • What if in previous example: Party 2 does not want Party 1 to know what input (x1) it is providing? • Oblivious Evaluation: Receiver obtains P(x) without learning anything else about polynomial P. Sender learns nothing about x.

ri = receiver’s random number Ri = sender’s random number X = input from rcvr Sender Receiver s (secret key) (ari, as*rj ax) Oblivious Evaluation (2) –Simplified Version (aRi, as*R aP(x) asri) Divide 2nd element by 1st element raised to power s to get P(x) a P(x) = (aRi, as*R aP(x) asri) / (aRi * ari)s

Algorithm • Step 1 - Each party computes ID3 – decision tree learning – (O(# attributes)) • Step 2 - Combine results using cryptographic protocols like oblivious evaluation - (O(log(#transactions))) • Result - Each party gains results of data-mining without learning more than necessary

Algorithm (2)Finding “best” attribute is hardest part • Each party computes their “share” of entropy • For each attribute, combine values from each party • Results in private computation of Entropy (-xlnx) • Choose attribute that minimizes entropy • Provides maximum information gain • Ensures most efficient tree with least overhead • Use oblivious Evaluation

Discussion of Algorithm • Efficient: • Large Databases accommodated: Algorithm relies on number of possible values for attributes – NOT number of transactions in database • Private: • Each step depends on local computation and private protocol • Uses techniques like oblivious transfer / evaluation to exchange information • Paper proves individual steps are private, AND can predict control flow between steps ONLY based on input/output – so also private

Discussion of Algorithm (2) • Approximate ID3 used instead of actual ID3 – shown to be as secure and provide same information

Privacy Preserving Data Mining