Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim) Benny Pinkas HP Labs, Israel 10th Estonian Winter School in Computer Science

d Why not use cryptographic methods? • Many users contribute data. Cannot require them to participate in a cryptographic protocol. • In particular, cannot require p2p communication between users. • Cryptographic protocols incur considerable overhead. … 10th Estonian Winter School in Computer Science

d Data Privacy Data users access mechanism breach privacy 10th Estonian Winter School in Computer Science

Mr. Brown Ms. John d d Mr. Doe Easy Tempting Solution A Bad Solution • But, ‘harmless’ attributes uniquely identify many patients (gender, age, approx weight, ethnicity, marital status…) • Recall, DOB+gender+zip code identify people whp. • Worse:`rare’ attributes (e.g. disease with prob.  1/3000) Idea: a. Remove identifying information (name, SSN, …) b. Publish data 10th Estonian Winter School in Computer Science

What is Privacy? • Something should not be computable from query answers • E.g.  Joe={Joe’s private data} • The definition should take into account the adversary’s power (computational, #of queries, prior knowledge, …) • Quite often it is much easier to say what issurely non-private • E.g. Strong breaking of privacy: adversary is able to retrieve (almost) everybody’s private data Intuition: privacy breached if it is possible to compute someone’s private information from his identity 10th Estonian Winter School in Computer Science

The Data Privacy Game: an Information-Privacy Tradeoff • Private functions: • want to hide x(DB)=dx • Information functions: • want to reveal f(q, DB) for queries q • Here: explicitdefinition of private functions. • The question: which information functions may be allowed? • Different from Crypto (secure function evaluation): • There, want to reveal f() (explicit definition of information function) • want to hide all functions () not computable from f() • Implicit definition of private functions • The question whether f() should be revealed is not asked f x f f 10th Estonian Winter School in Computer Science

answer Mr. Fox 0/1 Ms. John 0/1 d {0,1}n aq=iq di Mr. Doe 0/1  A simplistic model: Statistical Database (SDB) query q  [n] bits 10th Estonian Winter School in Computer Science

Approaches to SDB Privacy • Studied extensively since the 70s • Perturbation • Add randomness. Give `noisy’ or `approximate’ answers • Techniques: • Data perturbation (perturb data and then answer queries as usual)[Reiss 84, Liew Choi Liew 85, Traub Yemini Wozniakowski 84] … • Output perturbation (perturb answers to queries)[Denning 80, Beck 80, Achugbue Chin 79, Fellegi Phillips 74] … • Recent interest:[Agrawal, Srikant 00] [Agrawal, Aggarwal 01],… • Query Restriction • Answer queries accurately but sometimes disallow queries • Require queries to obey some structure[Dobkin Jones Lipton 79] • Restricts number of queries • Auditing[Chin Ozsoyoglu 82, Kleinberg Papadimitriou Raghavan 01] 10th Estonian Winter School in Computer Science

Some Recent Privacy Definitions X – data, Y – (noisy) observation of X [Agrawal, Srikant ‘00] Interval of confidence • Let Y = X+noise (e.g. uniform noise in [-100,100]. • Perturb input data. Can still estimate underlying distribution. • Tradeoff: more noise  less accuracy but more privacy. • Intuition: large possible interval  privacy preserved • Given Y, we know that within c% confidence X is in [a1,a2]. For example, for Y=200, with 50% X is in [150,250]. • a2-a1 defines the amount of privacy at c% confidence • Problem: there might be some a-priori information about X • X = someone’s age & Y= -97 10th Estonian Winter School in Computer Science

The [AS] scheme can be turned against itself • Assume that N is large • Even if the data-miner doesn’t have a-priori information about X, it can estimate it given the randomized data Y. • The perturbation is uniform in [-1,1] • [AS]: privacy interval 2 with confidence 100% • Let fx(X)=50% for x[0,1], and 50% for x[4,5]. • But, after learning fx(X) the value of X can be easily localized within an interval of size at most 1. • Problem: aggregate information provides information that can be used to attack individual data 10th Estonian Winter School in Computer Science

Some Recent Privacy Definitions X – data, Y – (noisy) observation of X • [Agrawal, Aggarwal ‘01] Mutual information • Intuition: • High entropy is good. I(X;Y) = H(X)-H(X|Y) (mutual information) • small I(X;Y)(mutual information)  privacy preserved (Y provides little information about X). • Problem [EGS] : • Average notion. Privacy loss can happen with low but significant probability, but without affecting I(X;Y). • Sometimes I(X;Y) seems good but privacy is breached 10th Estonian Winter School in Computer Science

Output Perturbation (Randomization Approach) • Exact answer to query q: • aq =iq di • Actual SDB answer: âq • Perturbation  : • For all q: | âq – aq | ≤  • Questions: • Does perturbation give any privacy? • How much perturbation is needed for privacy? • Usability 10th Estonian Winter School in Computer Science

âq q/2 q/2 aq Privacy Preserved by Perturbation  n Database: dR{0,1}n (uniform input distribution!) Algorithm: on query q, • Let aq=iq di • If | aq - |q|/2 | <  return âq = |q| / 2 • Otherwise return âq = aq  n (lgn)2  Privacy is preserved • Assume poly(n) queries • If  n (lgn)2, whp always use rule 2 • No information about d is given! • (but database is completely useless…) • Shows that sometimes perturbation  n is enough for privacy. Can we do better? 10th Estonian Winter School in Computer Science

Perturbation  << n Implies no Privacy • The previous useless database achieves the best possible perturbation. • Theorem [Dinur-Nissim]: Given any DB and anyDB response algorithm with perturbation  = o(n), there is a poly-time reconstruction algorithm that outputs a database d’, s.t. dist(d,d’) < o(n). strong breaking of privacy 10th Estonian Winter School in Computer Science

’ The Adversary as a Decoding Algorithm aq1 âq1 aq2 âq2 d d âq3 aq3 pert encode decode aqt âqt partial sums perturbed sums 10th Estonian Winter School in Computer Science

Proof of Theorem [DN03] The Adversary Reconstruction Algorithm • Query phase: Get âqjfor t random subsets q1,…,qt • Weeding phase: Solve the Linear Program (over ): • 0  xi 1 • |iqj xi - âqj |  • Rounding: Let ci = round(xi), output c Observation: A solution always exists, e.g. x=d. 10th Estonian Winter School in Computer Science

Why does the Reconstruction Algorithm Work? • Consider x{0,1}n s.t. dist(x,d)=c·n = (n) • Observation: • A random q contains c’·n coordinates in which x≠d • The differences in the sum of these coordinates is, with constant probability, at least (n) (>  = o(n) ). • Such aq disqualifies x as a solution for the LP • Since the total number of queries q is polynomial, then all such vectors x are disqualified with overwhelming probability. 10th Estonian Winter School in Computer Science

small DB medium DB large DB Summary of Results (statistical database) • [Dinur, Nissim 03] : • Unlimited adversary: • Perturbation of magnitude (n) required • Polynomial-time adversary: • Perturbation of magnitude (sqrt(n)) is required (shown above) • In both cases, adversary may reconstruct a good approximation for the database • Disallows even very weak notions of privacy • Bounded adversary, restricted to T << n queries (SuLQ): • There is a privacy preserving access mechanism with perturbation << sqrt(T) • Chance for usability • Reasonable model as database grows larger and larger 10th Estonian Winter School in Computer Science

k attributes f aq,f f  f n persons 0 0 0 1 0 f 1 1 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 SuLQ for Multi-Attribute Statistical Database (SDB) Database {di,j} Query (q, f) q  [n] f : {0,1}k {0,1} Answer aq,f=iq f(di) Row distributionD (D1,D2,…,Dn) 10th Estonian Winter School in Computer Science

Privacy and Usability Concerns for the Multi-Attribute Model [DN] • Rich set of queries: subset sums over any property of the k attributes • Obviously increases usability, but how is privacy affected? • More to protect: functions of the k attributes • Relevant factors: • What is the adversary’s goal? • Row dependency • Vertically split data(between k or less databases): • Can privacy still be maintained with independently operating databases? 10th Estonian Winter School in Computer Science

use all gained info to choose i, g Privacy Definition - Intuition • 3-phase adversary • Phase 0: defines a target set G of poly(n) functions g: {0,1}k {0,1} • Will try to learn some of this information about someone • Phase 1: adaptively queries the database T=o(n) times • Phase 2: chooses an index i of a row it intends to attack and a function gG • Attack: • given d-i • try to guess g(di,1…di,k) 10th Estonian Winter School in Computer Science

The Privacy Definition • P 0i,g – a-priori probability that g(di,1…di,k)=1 • p Ti,g – a-posteriori probability that g(di,1…di,k)=1 • Given answers to the T queries, and d-i • Define conf(p) = log (p/(1-p)) • 1-1 relationship between p and conf(p) • conf(1/2)=0; conf(2/3)=1; conf(1)= • conf i,g= conf(pTi,g) – conf(p0i,g) • (,T) – privacy: (“relative privacy”) • For all distributions D1…Dn , row i, function g and any adversary making at most T queries: Pr[conf i,g > ] = neg(n) 10th Estonian Winter School in Computer Science

The SuLQ* Database • Adversary restricted toT <<nqueries • On query (q, f): • q  [n] • f : {0,1}k {0,1} (binary function) • Let aq,f = iq f(di,1…di,k) • Let N  Binomial(0, T ) • Return aq,f+N *SuLQ – Sub Linear Queries 10th Estonian Winter School in Computer Science

Privacy Analysis of the SuLQ Database • Pmi,g- a-posteriori probability that g(di,1…di,k)=1 • Given d-iand answers to the first m queries • conf(pmi,g)Describes a random walk on the line with: • Starting point:conf(p0i,g) • Compromise: conf(pmi,g) – conf(p0i,g)>  • W.h.p. more than T steps needed to reach compromise conf(p0i,g) conf(p0i,g) + 10th Estonian Winter School in Computer Science

1 0 0 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 Usability: One multi-attribute SuLQ DB • Statistics of any property f of thek attributes • I.e. for what fraction of the (sub)population does f(d1…dk) hold? • Easy: just put f in the query • Other applications: • k independent multi-attribute SuLQ DBs • Vertically partitioned SulQ DBs • Testing whether Pr[|] ≥ Pr[]+ • Caveat: we hide g() about a specific row (not about multiple rows) 10th Estonian Winter School in Computer Science

(Restricted) Query SDB User Exact Response Or Denial Query SDB SDB’ Data Perturbation User Response (Restricted) Query SDB User Perturbed Response Overview of Methods • Input Perturbation • Output Perturbation • Query Restriction 10th Estonian Winter School in Computer Science

(Restricted) Query SDB User Exact Response Or Denial Query restriction • The decision whether to answer or deny the query • Can be based on the content of the query and on answers to previous queries • Or, can be based on the above and on the content of the database 10th Estonian Winter School in Computer Science

Auditing • [AW89] classify auditing as a query restriction method: • “Auditing of an SDB involves keeping up-to-date logs of all queries made by each user (not the data involved) and constantly checking for possible compromise whenever a new query is issued” • Partial motivation:May allow for more queries to be posed, if no privacy threat occurs. • Early work: Hofmann 1977, Schlorer 1976, Chin, Ozsoyoglu 1981, 1986 • Recent interest:Kleinberg, Papadimitriou, Raghavan 2000, Li, Wang, Wang, Jajodia 2002, Jonsson, Krokhin 2003 10th Estonian Winter School in Computer Science

How Auditors may Inadvertently Compromise Privacy 10th Estonian Winter School in Computer Science

q = (f ,i1,…,ik) f (di1,…,dik) The Setting • Dataset: d={d1,…,dn} • Entries di: Real, Integer, Boolean • Query: q = (f ,i1,…,ik) • f : Min, Max, Median, Sum, Average, Count… • Bad users will try to breach the privacy of individuals • Compromise  uniquely determine di (very weak def) Statisticaldatabase 10th Estonian Winter School in Computer Science

Statisticaldatabase Auditor Auditing Here’s the answer OR Query denied (as the answer would cause privacy loss) Here’s a new query: qi+1 Query log q1,…,qi 10th Estonian Winter School in Computer Science

Auditor Example 1: Sum/Max auditing • di real, sum/max queries, privacy breached if some di learned q1 = sum(d1,d2,d3) sum(d1,d2,d3) = 15 q2 = max(d1,d2,d3) Denied (the answer would cause privacy loss) There must be a reason for the denial… q2 is denied iff d1=d2=d3 = 5 I win! Oh well… 10th Estonian Winter School in Computer Science

Sounds Familiar? • David Duncan, Former auditor for Enron and partner in Andersen: Mr. Chairman, I would like to answer the committee's questions, but on the advice of my counsel I respectfully decline to answer the question based on the protection afforded me under the Constitution of the United States. 10th Estonian Winter School in Computer Science

dn d8 d7 d5 d3 d6 d4 d2 d1 dn-1 … q2 = max(d1,d2,d3) q2 = max(d1,d2) Auditor Max Auditing • di real q1 = max(d1,d2,d3,d4) M1234 M123 / denied If denied: d4=M1234 M12 / denied If denied: d3=M123 Learn an item with prob ½ 10th Estonian Winter School in Computer Science

d4 d2 d1 d3 d5 d6 d8 … dn dn-1 d7 q1 = sum(d1,d2) q2=sum(d2,d3) Auditor Boolean Auditing? • di Boolean 1 / denied 1 / denied … qi denied iff di = di+1  learn database/complement 10th Estonian Winter School in Computer Science

The Problem • The problem: • Query denials leak (potentially sensitive) information • Users cannot decide denials by themselves Possible assignments to {d1,…,dn} Assignments consistent with (q1,…qi, a1,…,ai) qi+1 denied 10th Estonian Winter School in Computer Science

q1,…,qi Statisticaldatabase q1,…,qia1,…,ai qi+1 qi+1 Simulator Auditor Deny/answer Deny/answer Solution to the problem: simulatable Auditing An auditor is simulatable if a simulator exists s.t.:  Simulation  denials do not leak information 10th Estonian Winter School in Computer Science

Assignments consistent with (q1,…qi, a1,…,ai ) qi+1 denied/allowed Why Simulatable Auditors do not Leak Information? Possible assignments to {d1,…,dn} 10th Estonian Winter School in Computer Science

Simulatable auditing 10th Estonian Winter School in Computer Science

Query Restriction for Sum Queries • Given: • D={x1,..,xn} dataset, xi • S is a subset of X. Query: xiS xi • Is it possible tocompromise D? • Here compromisemeans: uniquely determine xifrom the queries • Can compromise if subsets arbitrarily small: • sum(x9)= x9 10th Estonian Winter School in Computer Science

Query Set Size Control • Do not permit queries that involve a small subset of the database. • Compromise still possible • Want to discover x: sum(x,y1,..,yk) - sum(y1,..,yk) = x • Issue: Overlap • In general, overlap is not enough. • Need to restrict also the number of queries • Note that overlap itself sometimes restricts number of queries (e.g. size of queries = cn, overlap = const, only about 1/c possible queries) 10th Estonian Winter School in Computer Science

Restricting Set-Sum Queries • Restricting the sum queries based on • Number of database elements in the sum • Overlap with previous sum queries • Total number of queries • Note that the criteria are known to the user • They do not depend on the contents of the database • Therefore, the user can simulate the denial/no-denial answer given by the DB • Simulatable auditing 10th Estonian Winter School in Computer Science

≥ k ≥ k xl • 0 0 0 1 1 1 1 • 1 0 0 1 0 0 1 0 Q1 Q2 Q3 ... Qt • ≤r x1 x2 x3 .. xn • ≤r = ≥ k • ≤r ≥ k Restricting Overlap and Number of Queries • Assume: • |Query Qi| ≥ k • |Qi Qj| ≤ r • Adversary knows a-priori at most Lvalues, L+1 < k • Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries. 10th Estonian Winter School in Computer Science

Overlap + Number of Queries Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries [Dobkin,Jones,Lipton][Reiss] • k< query size, r> overlap, L a-priori known items • Suppose xc compromised aftertqueries where each query represented by: • Qi = xi1 + xi2 + … + xik for i =1, …, t • Implies that: • xc = i=1,t i Qi = i=1,t ij=1,k xij • Let i = 1 if x in query i, 0 otherwise • xc= i=1,t i=1,n i x = =1,n (i=1,t i i)x 10th Estonian Winter School in Computer Science

Overlap + Number of Queries We have: xc= =1,n (i=1,ti i)x • In the above sum, (i=1,ti i) must be 0 for all x except for xc(in order for xc to be compromised) • This happens iff i=0 for all i, or if i =j =1 and i jhave opposite signs • or i =0, in which case the ith query didn’t matter 10th Estonian Winter School in Computer Science

Overlap + Number of Queries • Wlog, first query contains xc, second query is of opposite sign. • In the first query,kelements are probed • The second query adds at least k-relements • Elements from first and second queries cannot be canceled within the same (additional) query (opposite signs requires) • Therefore each new query cancels items from first or from second query, but not from both. • Need to cancel 2k-r-Lelements. • Need 2+(2k-r-L)/r queries, i.e. 1+(2k-L)/r. 10th Estonian Winter School in Computer Science

Notes • The number of queries satisfying|Qi|≥ k and |Qi Qj| ≤r is small • If k=n/cfor some constant c and r=const, then there are only ~c queries where no two queries overlap by more than 1. • Hence , query sequence length may be uncomfortably short. • Or, if r=k/c (overlap is a constant fraction of query size) then number of queries, 1+(2k-L)/r, is O( c). 10th Estonian Winter School in Computer Science

Conclusions • Privacy should be defined and analyzed rigorously • In particular, assuming randomization  privacy is dangerous • High perturbation is needed for privacy against polynomial adversaries • Threshold phenomenon – above n: total privacy, below n: no privacy (for poly-time adversary) • Main tool: a reconstruction algorithm • Careless auditing might leak private information • Self Auditing (simulatable auditors) is safe • Decision whether to allow a query based on previous `good’ queries and their answers • Without access to DB contents • Users may apply the decision procedure by themselves 10th Estonian Winter School in Computer Science

ToDo • Come up with good model and requirements for database privacy • Learn from crypto • Protect against more general loss of privacy • Simulatable auditors are a starting point for designing more reasonable audit mechanisms 10th Estonian Winter School in Computer Science

References • Course web page: • A Study of Perturbation Techniques for Data Privacy, Cynthia Dwork and Nina Mishra and Kobbi Nissim, http://theory.stanford.edu/~nmishra/cs369-2004.html • Privacy and Databases http://theory.stanford.edu/~rajeev/privacy.html 10th Estonian Winter School in Computer Science

Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

Presentation Transcript

Privacy-Preserving Data Mining

Privacy Preserving Data Mining

Privacy-Preserving Data Mashup

Randomization in Privacy Preserving Data Mining

Privacy Preserving Data Dissemination

data privacy-preserving

Privacy Preserving Data Mining

Privacy-Preserving Distributed Data Mining

Privacy Preserving Data Mining within Anonymous Credentials

Technical Seminar On Privacy Preserving Data Mining

Data Transformation for Privacy-Preserving Data Mining

Data Transformation for Privacy-Preserving Data Mining

Privacy-Preserving Databases and Data Mining

Privacy-Preserving Data Publishing

Privacy-Preserving Data Sharing

Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto

Privacy Preserving Approaches for High Dimensional Data

Inference Problem Privacy Preserving Data Mining

Privacy-Preserving Data Mining

Privacy Preserving Data Mining

Data Transformation for Privacy-Preserving Data Mining

Privacy Preserving Data Mining