370 likes | 517 Views
Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto. Benny Pinkas HP Labs, Israel. Course structure. Lecture 1: Introduction to privacy Introduction to cryptography , in particular, to rigorous cryptographic analysis. Definitions Proofs of security
E N D
Privacy Preserving Data Mining Lecture 1 Motivating privacy research, Introducing Crypto Benny Pinkas HP Labs, Israel 10th Estonian Winter School in Computer Science
Course structure • Lecture 1: • Introduction to privacy • Introduction to cryptography, in particular, to rigorous cryptographic analysis. • Definitions • Proofs of security • Lecture 2 • Cryptographic tools for privacy preserving data mining. • Lecture 3 • Non-cryptographic tools for privacy preserving data mining • In particular, answer perturbation. 10th Estonian Winter School in Computer Science
Privacy-Preserving Data Mining • Allow multiple data holders to collaborate in order to compute important information while protecting the privacy of other information. • Security-related information • Public health information • Marketing information • Advantages of privacy protection • protection of personal information • protection of proprietary or sensitive information • enables collaboration between different data owners (since they may be more willing or able to collaborate if they need not reveal their information) • compliance with the law 10th Estonian Winter School in Computer Science
Privacy Preserving Data Mining • Two papers appeared in 2000 • “Privacy preserving data mining”, Agrawal and Srikant, SIGMOD 2000. (statistical approach) • “Privacy preserving data mining”, Lindell and Pinkas, Crypto 2000. (cryptographic approach) • Why privacy now? • Technological changes erode privacy: ubiquitous computing, cheap storage. • Public awareness: health coverage, employment, personal relationships. • Historical changes: Small towns vs. Cities vs. Connected society. • Privacy is a real problem that needs to be solved 10th Estonian Winter School in Computer Science
Some data privacy cases: hospital data • Hospital data contains • Identifying information: name, id, address • General information: age, marital status • Medical information • Billing information • Database access issues: • Your doctor should get every information that is required to take care of you • Emergency rooms should get all medical information that is required to take care of whoever comes there • Billing department should only get information relevant to billing • Problem: how to stop employees from getting information about family, neighbors, celebrities? 10th Estonian Winter School in Computer Science
Some data privacy cases: Medical Research • Medical research: • Trying to learn patterns in the data, in “aggregate” form. • Problem: how to enable learning aggregate data without revealing personal medical information? • Hiding names is not enough, since there are many ways to uniquely identify a person • A single hospitals/medical researcher might not have enough data • How can different organizations share research data without revealing personal data? 10th Estonian Winter School in Computer Science
Public Data • Many public records are available in electronic form: birth records, property records, voter registration • “Your information serves as an error correcting code of your identity” • Latanya Sweeney: • Date of birth uniquely identifies 12% of the population of Cambridge, MA. • Date of birth + gender: 29% • Date of birth + gender + (9 digit) zip code: 95% • Sweeney was therefore able to get her medical information from an “annonymized” database 10th Estonian Winter School in Computer Science
Census data • A trusted party (the census bureau) collects information about individuals • Collected data: • Explicitly identifying data (names, address..) • Implicitly identifying data (combination of several attributes) • Private data • The data should is collected to help decision making • Partial or aggregate data should therefore made public 10th Estonian Winter School in Computer Science
Total Information Awareness (TIA) • Collects information about transactions (credit card purchases, magazine subscriptions, bank deposits, flights) • Early detection of terrorist activity • Check a chemistry book in the library, buy something at a hardware store and something in a pharmacy… • Early collection of epidemic bursts • Early symptoms of Anthrax are similar to the flu • Check non-traditional data sources: grocery and pharmacy data, school attendance records, etc.. • Such systems are developed and used • Could the collection of data be done in a privacy preserving manner? (without learning about individuals?) 10th Estonian Winter School in Computer Science
Basic Scenarios • Single (centralized) database, e.g., census data: • This is often a simple abstraction of a more complicated scenario, so we better solve this one • Need to collect data and present it in a privacy preserving way • Published data (e.g., on a CD) • A “trusted” party collects data and then publishes a “sanitized” version • Users can do any computation they wish with the sanitized data • For example, statistical tabulations. 10th Estonian Winter School in Computer Science
Basic Scenarios • Multi database scenarios: • Two or more parties with private data want to cooperate. • Horizontally split: Each party has a large database. Databases have same attributes but are about different subjects. For example, the parties are banks which each have information about their customers. • Vertically split: Each party has some information about the same set of subjects. For example, the participating parties are government agencies; each with some data about every citizen. bank 1 u1 un u1 un houses u’1 u’n bank 2 bank taxes 10th Estonian Winter School in Computer Science
Issues and Tools • Best privacy can be achieved by not giving any data, but.. • Privacy tools: cryptography [LP00] • Encryption: data is hidden unless you have the decryption key. However, we also want to use the data. • Secure function evaluation: two or more parties with private inputs. Can compute any function they wish without revealing anything else. • Strong theory. Starts to be relevant to real applications. • Non-cryptographic tools [AS00] • Query restriction: prevent certain queries from being answered. • Data/Input/output perturbation: add errors to inputs – hide personal data while keeping aggregates accurate. (randomization, rounding, data swapping.) • Can these be understood as well as we understand Crypto? Provide the same level of security as Crypto? 10th Estonian Winter School in Computer Science
Introduction to Cryptography 10th Estonian Winter School in Computer Science
Why learn/use crypto to solve privacy issues? • Why are we referring to crypto? • Cryptography is one of the tools we can use for preserving privacy • A mature research area: • many useful results/tools • Can reflect on our thinking – how is “security” defined in cryptography? How should we define “privacy”? 10th Estonian Winter School in Computer Science
What is Cryptography? Traditionally: how to maintain secrecy in communication Alice and Bob talk while Eve tries tolisten Bob Alice Eve 10th Estonian Winter School in Computer Science
History of Cryptography • Very ancient occupation • Up to the mid 70’s - mostly classified military work • Exception: Shannon, Turing* • Since then - explosive growth • Commercial applications • Scientific work: tight relationship with Computational Complexity Theory • Major works: Diffie-Hellman, Rivest, Shamir and Adleman (RSA) • Recently - more involved models for more diverse tasks. • Scope: How to maintain the secrecy, integrity and functionality in computer and communication system. 10th Estonian Winter School in Computer Science
Relation to computational hardness • Cryptography uses problems that are infeasible to solve. • Uses the intractability of some problems in order to construct secure systems. • Feasible – computable in probabilistic polynomial time (PPT) • Infeasible – no probabilistic polynomial time algorithm • Usually average case hardness is needed • For example, the discrete log problem 10th Estonian Winter School in Computer Science
The Discrete Log Problem • Let Gbe a group and g an element in G. • Given yGlet x be minimalnon-negative integer satisfying the equation y=gx. x is called the discrete log of y to base g. • Example: y=gx mod p in the multiplicative group of Zp* (p is prime). (For example, p=7, g=3, y=4 x=4.) • In general, it is easy to exponentiate • (using repeated squaring and the binary representation of x) • Computing the discrete log is believed to be hard in Zp* if p is large. (E.g., p is a prime, |p|>768 bits, p=2q+1 and q is also a prime.) 10th Estonian Winter School in Computer Science
Encryption • Alice wants to send a messagem {0,1}n to Bob • Set-up phase is secret • Symmetric encryption: Alice and Bob share a secret key k • They want to prevent Eve from learning anything about the message Ek(m) Alice Bob k k Eve 10th Estonian Winter School in Computer Science
Public key encryption • Alice generates a private/public key pair (SK,PK) • Only Alice knows the secret key SK • Everyone (even Eve) knows the public key PK, and can encrypt messages to Alice • Only Alice can decrypt (using SK) EPK(m) Alice Bob SK PK EPK(m) Charlie Eve PK 10th Estonian Winter School in Computer Science
Rigorous Specification of Security To define the security of a system we must specify: • What constitute a failure of the system • The power of the adversary • computational • access to the system • what it means to break the system. 10th Estonian Winter School in Computer Science
What does `learn’ mean? • Even if Eve has some prior knowledge of m, she should not have any advantage in • Probability of guessing m, or probability of guessing whether m is m0 or m1, or prob. of computing any other function f of m ,or even computing |m| • Ideally: the message sent is aindependent of the message m • Implies all the above • Achievable: one-time pad (symmetric encryption) • Let rR{0,1} n be the shared key. • Let m {0,1} n • To encrypt msend r m • To decrypt z send m = z r • Shannon: achievable only if the entropy of the shared secret is at least as large as that of m. Therefore must use long key . 10th Estonian Winter School in Computer Science
Defining security The power of the adversary • Computational: Probabilistic polynomial time machine (PPTM) • Access to the system: e.g. can it change messages? • Passive adversary, (adaptive) chosen plaintext attack, chosen ciphertext attack… • What constitutes a failure of the system? • Recovering plaintext from ciphertext – not enough • Allows for the leakage of partial information • In general, hard to answer which partial information may/should not be leaked. Application dependent. • How would partial information the adversary already holds be combined with what he learns to affect privacy? • Better: Prevent learning anything about an encrypted message • There are two common, equivalent, definitions… 10th Estonian Winter School in Computer Science
Security of Encryption: Definition 1Indistinguishability of Encryptions • AdversaryAchooses anyX0 , X1 0,1n • Receives encryptionof Xb for bR0,1 • Has to decide whetherb 0 or b 1. For every PPTM A, choosing a pair X0 , X1 0,1n : | Pr A(E(X0))= ‘1’ - Pr A(E(Xb1)) ‘1’ | = neg(n) • (Probability is over the choice of keys, randomization in the encryption and A‘s coins) • Note that a proof of security must be rigorous 10th Estonian Winter School in Computer Science
Security of Encryption: Definition 2Semantic Security Simulation: Whatever Adversary A can compute given an encryption of X0,1nso can a `simulator’ S that does not get to see the encryption of X. • A selects a distribution Dn on0,1nand a relation R(X,Y) - computable in PPT (e.g. R(X,Y)=1 iff Y is last bit of X). • XR Dnis sampled • GivenE(X),Aoutputs Ytrying to satisfyR(X,Y) • The simulator Sdoes the same without access to E(X) • Simulation is successful ifA and Shave the same success probability • Successful simulation semantic security 10th Estonian Winter School in Computer Science
Security of Encryption (2)Semantic Security More formally: For every PPTM A there is a PPTM S so that for all PPTM relations R forXR Dn Pr R(X,A(E(X)) - Pr R(X,S()) is negligible. In other words: The outputs of A andSare indistinguishable even for a test that is aware of X. 10th Estonian Winter School in Computer Science
Which is the Right Definition? • Semantic security seems to convey that the message is protected • But it is usually easier to prove indistinguishability of encryptions • Would like to argue that the two definitions are equivalent • Must define the attack: chosen plaintext attack • Adversary can obtain the encryption for any message it chooses, in an adaptive manner • More severe attacks: chosen ciphertext • The Equivalence Theorem: A cryptosystem is semantically secure if and only if it has the indistinguishability of encryptions property 10th Estonian Winter School in Computer Science
Equivalence Proof (informal) Semantic security Indistinguishability of encryptions • Suppose no indistinguishability: • A chooses a pair X0 , X10,1nfor which it can distinguish encryptions with non-negligible advantage • Choose • DistributionDn= {X0 , X1 } • RelationRwhich is “equality with X ” • S that doesn’t get E(X), and outputs Y’ we have Prob[ R( X, Y’ ) ]= ½ • Given E(Xb ), run A(E(Xb )), get output b{0,1}, set Y=Xb • Now, | PrA(E(Xb))= ‘1’ b 1 - PrA(E(Xb)) ‘1’ b 0 | > • Therefore, | PrR(X,Y) - PrR(E(X,Y’) | > / 2 10th Estonian Winter School in Computer Science
Equivalence Proof (informal) Indistinguishability of encryptions Semantic security • Suppose no semantic security:Achooses some distribution Dn and some relation R • Choose X0, X1 R Dn , choose bR {0,1}, compute E(Xb). • Give E(Xb) to A, ask Ato compute Yb = A(E(Xb)) • For X0 , X1 R Dnlet • 0 = Prob[R(X0, Yb)], 1 = Prob[R(X1, Yb)] • With noticeable probability |0 - 1 |is non-negligible, since otherwise Yb can be computed without the encryption. • If |0 - 1 |is non-negligible, then we can distinguish between an encryption of X0andX1 10th Estonian Winter School in Computer Science
Lessons learned? • Rigorous approach to cryptography • Defining security • Proving security 10th Estonian Winter School in Computer Science
References Books: • O. Goldreich, Foundations of Cryptography Vol 1, Basic Tools, Cambridge, 2001 • Pseudo-randomness, zero-knowledge • Vol 2, Basic Applications (to be available May 2004) • Encryption, Secure Function Evaluation) • Other volumes in www.wisdom.weizmann.ac.il/~oded/books.html Web material/courses: • S. Goldwasser and M. Bellare, Lecture Notes on Cryptography, http://www-cse.ucsd.edu/~mihir/papers/gb.html • M. Naor, 9th EWSCS, http://www.cs.ioc.ee/yik/schools/win2004/naor.php 10th Estonian Winter School in Computer Science
Secure Function Evaluation • A major topic of cryptographic research • How to let n parties, P1,..,Pncompute a function f(x1,..,xn) • Where input xiis known to party Pi • Parties learn the final input and nothing else 10th Estonian Winter School in Computer Science
The Millionaires Problem [Yao] x y Alice Bob Whose value is greater? Leak no other information! 10th Estonian Winter School in Computer Science
Comparing Information without Leaking it • Output: Is x=y? • The following solution is insecure: • Use a one-way hash function H() • Alice publishes H(x), Bob publishes H(y) x y Alice Bob 10th Estonian Winter School in Computer Science
Secure two-party computation - definition y x Input: F(x,y) and nothing else Output: y As if… x Trusted third party F(x,y) F(x,y) 10th Estonian Winter School in Computer Science
Leak no other information • A protocol is secure if itemulates the ideal solution • Alice learns F(x,y), and therefore can compute everything that is implied by x, her prior knowledge of y, and F(x,y). • Alice should not be able to compute anything else • Simulation: • A protocol is considered secure if: For every adversary in the real world There exists a simulator in the ideal world, which outputs an indistinguishable ``transcript” , given access to the information that the adversary is allowed to learn 10th Estonian Winter School in Computer Science
More tomorrow… 10th Estonian Winter School in Computer Science