190 likes | 360 Views
Microdata Sharing Via Pseudonymization. UNECE Work session on statistical data confidentiality Manchester, 2007 December 18th. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A. Motivation. Individuals microdata is essential for empirical research
E N D
Microdata Sharing Via Pseudonymization UNECE Work session on statistical data confidentiality Manchester, 2007 December 18th TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA
Motivation • Individuals microdata is essential for empirical research • Its direct release thwarts the privacy of the individuals • Goal: to build privacy-preserving microdata sharing systems through pseudonymization
Problem statement • Suppliers own confidential microdata on individuals ((id1,D(id1)),…, (idn,D(idn)) • Researchers want to correlate microdata from different Suppliers • Example: A Researcher wants to find out the correlation between drug prescription (Chemists) and traffic accidents (Insurers) • Question: How to enable Researchers to correlate microdata without having access to sensitive information?
Framework I want to correlate Maybe de-identifieddata?
Supplying de-identified data • If Suppliers de-identify the data by: • - removing the identifier field • applying Statistical Disclosure Control (SDC) mechanisms • no sensitive information is leaked, but… Matching is not possible!
Pseudonymizing data via TTPs • Solution 1: a Trusted Third Party replaces real identifiers by random identifiers (pseudonyms) Where P(id) is random This table is only know to the TTP Matching!
Pseudonymizing data via TTPs (II) • Advantages: • Unconditional security (w.r.t. pnymization) • Matching is possible • Drawback: TTP must store a huge table secretly • Solution 2: Use a block cipher (Enc(K,·),Dec(K,·)), and then P(id)= Enc(K,id) • Advantage: • Only the key K must be stored secretly • Drawbacks: • Security is not unconditional • Different Researchers might not have the same access rights
Pseudonymizing data via TTPs (III) We share and win! Not allowed to match Chemists and Insurers data Not allowed to match Chemists and Insurers data
Pseudonymizing data via TTPs (IV) • Solution 3: Allocate a different key Ki for every Researcher Ri • Pseudonyms are destination-dependant: P(id,Ri)=Enc(Ki,id) P(id*,R1) and P(id*,R2) look unrelated
Pseudonymizing data via TTPs (V) • Advantage: • Disallowed matching among malicious Researchers is prevented • Drawbacks: • TTP must be on-line to perform sensitive operations (pseudonymization and matching) Let’s see why…
Pseudonymization with symmetric encryption Supplying pseudonymized data: • Supplier Sj sends datablocks D(id1),…,D(idl) to Researcher Ri • Sj sends the identities id1,…,idl in the same order to the TTP • TTP sends the list P(id,Ri)=Enc(Ki,id) to Ri • Ri forms the pnymized database (P(id1,Ri),D(id1)),…,(P(idl,Ri),D(idl))
Pseudonymization with symmetric encryption • Matching Ri and Rd pnymized databases: • Ri sends to Rd the data D(id1,i),…,D(idl,i) • Ri sends to TTP P(id1,Ri),…, P(idl,Ri) • TTP decrypts Dec(Ki,P(id,Ri))=id and encrypts P(id,Rd)=Enc(Kd,id). The result is sent to Rd • Rd matches the pnymized databases (P(id1,Rd),D(id1,i)),…,(P(idl,Rd),D(idl,i))(P(idl,Rd),D(id1,d)),…,(P(idm,Rd),D(idm,d)) • As a result the TTP is a bottleneck to the system
Pseudonymization using public key crypto • Let G=<g> a prime order group. Let H:{0,1}*! G a hash function • TTP assigns a secret key xi2 Zp to Researcher Ri • P(id,Ri)=H(id)x{i} • Supplying pseudonymized data from Sj to Ri • Supplier Sj and Researcher Ri jointly compute the pnymized database {P(id,Ri),D(id)} • TTP allocates pnymizing keys (¹,º) 2 Zp£Zp, such that ¹¢º=xi; ¹ is sent to Si, º is sent to Rj • Sj computes and sends H(id1)¹,…,H(idl)¹ to Rj • Rj computes (H(id)¹)º=H(id)x{i} =P(id,Ri) • Ri forms the pnymized database (P(id1,Ri),D(id1)),…,(P(idl,Ri),D(idl))
Pseudonymization with public key crypto (II) • Matching Ri and Rd pnymized databases: • This can be done by Ri and Rd with a 1-round interactive protocol provided certain keys are obtained off-line from the TTP • Ri nor Rd learn their pnymizing keys xi, xd even if colluding • Rd only learns D(id,Ri) for id’s in the intersection • Security is based on Decision Diffie-Hellman assumption
Pseudonymization with public key crypto (III) • Advantages: • Matching is possible • Disallowed matching among malicious Researchers is prevented • TTP is not a bottleneck (only delivers off-line crypto keys) • Drawbacks: • Suppliers must collaborate for every pnymization • Interactive protocols (on-line communication)
Properties • Suppliers and Accumulators are assumed Honest-But-Curious • Researchers are assumed Malicious • Accumulators’ intersection and union operations are non-interactive • Two levels of pseudonymization corresponding to the different levels of trust • It uses ‘composite bilinear groups’
Governance • The allowance of these protocols is governed by a Regulatory Privacy Body (RPB) from a functional perspective. A strict licensing infrastructure will be enforced by the RPB, describing: • Which parties are allowed to perform what protocols with each • What kind of data can be exchanged • Which subsets of identities or pnyms are allowed as input to the protocols