Revealing Information while Preserving Privacy

Revealing Information while Preserving Privacy Kobbi Nissim NEC Labs, DIMACS Based on work with: Irit Dinur,Cynthia Dwork and Joe Kilian

q? a Patient data MedicalDB The Hospital Story

Mr. Smith Ms. John d d Mr. Doe A Bad Solution Easy Tempting Solution Idea: a. Remove identifying information (name, SSN, …) b. Publish data • Observation: ‘harmless’ attributes uniquely identify many patients (gender, approx age, approx weight, ethnicity, marital status…) • Worse:`rare’ attribute (CF  1/3000)

d {0,1}n Mr. Smith aq=iqdi Ms. John  q  [n] Mr. Doe Our Model: Statistical Database (SDB)

Crypto: secure function evaluation • want to reveal f() • want to hide all functions () not computable from f() • Implicit definition of private functions The Privacy Game: Information-Privacy Tradeoff • Private functions: • want to hide i(d1, … ,dn)=di • Information functions: • want to reveal fq(d1, … ,dn)=iq di • Explicit definition of private functions

Approaches to SDB Privacy [AW 89] • Query Restriction • Require queries to obey some structure • Perturbation • Give `noisy’ or `approximate’ answers This talk

Perturbation • Database: d = d1,…,dn • Query: q  [n] • Exact answer: aq=iqdi • Perturbed answer: âq PerturbationE:For all q: | âq– aq| ≤ E General Perturbation: Prq [|âq– aq| ≤ E] = 1-neg(n) = 99%, 51%

Perturbation Techniques [AW89] Data perturbation: • Swapping [Reiss 84][Liew, Choi, Liew 85] • Fixed perturbations [Traub, Yemini, Wozniakowski 84] [Agrawal, Srikant 00] [Agrawal, Aggarwal 01] • Additive perturbation d’i=di+Ei Output perturbation: • Random sample queries [Denning 80] • Sample drawn from query set • Varying perturbations [Beck 80] • Perturbation variance grows with number of queries • Rounding [Achugbue, Chin 79] Randomized [Fellegi, Phillips 74] …

Main Question: How much perturbation is needed to achieve privacy?

On query q: • Let aq=iq di • If |aq-|q|/2| > E return âq = aq • Otherwise return âq= |q|/2 Privacy from n Perturbation (an example of a useless database) Database: dR{0,1}n Can we do better? • Smaller E ? • Usability ??? • Privacy is preserved • IfE  n (lgn)2, whp always use rule 3 • No information about d is given! No usability!

Instead of defining privacy: What is surely non-private… • Strong breaking of privacy (not) Defining Privacy Defining Privacy • Elusive definition • Application dependent • Partial vs. exact compromise • Prior knowledge, how to model it? • Other issues …

Strong Breaking of Privacy The Useless Database Achieves Best Possible Perturbation:Perturbation << n Implies no Privacy! • Main Theorem:Given a DB response algorithm with perturbation E << n, there is a poly-time reconstruction algorithm that outputs a database d’, s.t. dist(d,d’) < o(n).

The Adversary as a Decoding Algorithm âq1 âq2 âq3 d 2n subsets of [n] encode n bits (Recall âq = iqdi + pertq) Decoding Problem: Given access to âq1,…, âq2nreconstruct d in time poly(n). ’

Goldreich-Levin Hardcore Bit Sideremark âq1 âq2 âq3 d 2n subsets of [n] encode n bits Where âq = iqdimod 2 on 51% of the subsets The GL Algorithm finds in time poly(n) a small list of candidates, containing d

Sideremark Encoding: aq = iqdi (mod 2) aq = iqdi Noise: Corrupt ½- of the queries Additive perturbation  fraction of the queries deviate from perturbation Queries: Dependent Random Decoding: List decoding d’ s.t. dist(d,d’) < n (List decoding impossible) Comparing the Tasks

Recall Our Goal: Perturbation << n Implies no Privacy! • Main Theorem:Given a DB response algorithm with perturbation E < n, there is a poly-time reconstruction algorithm that outputs a database d’, s.t. dist(d,d’) < o(n).

Proof of Main TheoremThe Adversary Reconstruction Algorithm Query phase: Get âqjfor t random subsets q1,…,qtof [n] • Weeding phase: Solve the Linear Program: 0  xi 1 |iqjxi - âqj|  E Rounding: Let ci = round(xi), output c Observation: An LP solution always exists, e.g. x=d.

Observation: A randomqoften shows a n advantage either to 0’s or to 1’s. q d x Proof of Main Theorem Correctness of the Algorithm Consider x=(0.5,…,0.5) as a solution for the LP - Such a q disqualifies x as a solution for the LP • We prove that if dist(x,d) > n, then whp there will be a q among q1,…,qt that disqualifies x

Extensions of the Main Theorem • `Imperfect’ perturbation: • Can approximate the original bit string even if database answer is within perturbation only for 99% of the queries • Other information functions: • Given access to “noisy majority” of subsets we can approximate the original bit-string.

Notes on Impossibility Results • Exponential Adversary: • Strong breaking of privacy if E << n • Polynomial Adversary: • Non-adaptive queries • Oblivious of perturbation method and database distribution • Tight threshold E n • What if adversary is more restricted?

Bounded Adversary Model • Database: dR{0,1}n • Theorem: If the number of queriesis bounded by T, then there is a DB response algorithm with perturbation of ~T that maintains privacy. With a reasonable definition of privacy

Summary and Open Questions • Very high perturbation is needed for privacy • Threshold phenomenon – above n: total privacy, below n: none (poly-time adversary) • Rules out many currently proposed solutions for SDB privacy • Q: what’s on the threshold? Usability? • Main tool:A reconstruction algorithm • Reconstructing an n-bit string from perturbed partial sums/thresholds • Privacy for aT-bounded adversary with a random database • T perturbation • Q: other database distributions • Q: Crypto and SDB privacy?

dR{0,1}n d d-i Our Privacy Definition (bounded adversary model) i … (transcript, i) di Fails w.p. > ½-

Revealing Information while Preserving Privacy

Revealing Information while Preserving Privacy

Presentation Transcript

Privacy-Preserving Credit Checking

Privacy-Preserving Linear Programming

Privacy-Preserving Smart Metering

Privacy Preserving Data Mining

Privacy-Preserving Data Mashup

Privacy preserving network forensics

Privacy Preserving In LBS

data privacy-preserving

Privacy-preserving DRM

Privacy-Preserving Schema Matching Using Mutual Information

Network Analysis While Preserving Privacy*

Privacy-Preserving Face Recognition

Privacy-Preserving Location Services

Privacy-Preserving Distributed Information Sharing

Privacy Preserving OLAP

Privacy-Preserving Data Publishing

Privacy-Preserving Computation

Privacy-Preserving Data Sharing

Privacy-Preserving Transaction Escrow

Privacy-Preserving Clustering

Privacy-Preserving Data Mining

Privacy Preserving Data Mining