400 likes | 530 Views
Foundations of Privacy Lecture 5. Lecturer: Moni Naor. Recap of last week’s lecture. The Exponential Mechanism Differential privacy May yield utility/approximation Is defined and evaluated by considering all possible answers Counting Queries The BLR Algorithm Efficient Algorithm.
E N D
Foundations of PrivacyLecture 5 Lecturer:Moni Naor
Recap of last week’s lecture • The Exponential Mechanism • Differential privacy • May yield utility/approximation • Is defined and evaluated by considering all possible answers • Counting Queries • The BLR Algorithm • Efficient Algorithm
Synthetic DB: Output is a DB ? answer 1 answer 3 answer 2 Sanitizer query 1,query 2,. . . Database Synthetic DB: output also a DB (of entries from same universe X), user reconstructs answers by evaluating query on output DB Software and people compatible Consistent answers
Counting Queries DatabaseDof sizen • Queries with low sensitivity Counting-queries Cis a setof predicates c: U {0,1} Query: how many D participants satisfy c ? Relaxed accuracy: answer query withinαadditive errorw.h.p Not so bad:error anyway inherent in statistical analysis Assume all queries given in advance Query c U Non-interactive
The BLR Algorithm For DBs F and Ddist(F,D) = maxq2C |q(F) – q(D)| Intuition: far away DBs get smaller probability Blum Ligett Roth08 Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. /e-ε·dist(F,D)
Counting Queries DatabaseDof sizen • Queries with low sensitivity Counting-queries Cis a setof predicates c: U {0,1} Query: how many D participants satisfy c ? Relaxed accuracy: answer query withinαadditive errorw.h.p Not so bad:error anyway inherent in statistical analysis Query c U SampleFof sizem approx D on all given predicates c
The BLR Algorithm: Error Õ(n2/3 log|C|) There exists Fgood of size m=Õ((n\α)2·log|C|) s.t. dist(Fgood,D) ≤α Pr[Fgood] / e-εα For any Fbad with dist2α,Pr[Fbad] / e-2εα Union bound: ∑bad DB FbadPr[Fbad]/ |U|me-2εα Forα=Õ(n2/3log|C|), Pr[Fgood] >> ∑ Pr[Fbad] Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n)DBF gets picked w.p. /e-ε·dist(F,D)
The BLR Algorithm: Running Time Generating the distribution by enumeration:Need to enumerate every size-m database,where m= Õ((n\α)2·log|C|) Running time ≈|U|Õ((n\α)2·log|c|) Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. /e-ε·dist(F,D)
Conclusion Offline algorithm, 2ε-Differential Privacy for anyset C of counting queries Error α is Õ(n2/3 log|C|/ε) Super-poly running time: |U|Õ((n\α)2·log|C|)
Can we Efficiently Sanitize? The good news If the universe is small, Can sanitize EFFICIENTLY The bad news cannot do much better, namely sanitize in time:sub-poly(|C|) AND sub-poly(|U|) Timepoly(|C|,|U|)
How Efficiently Can We Sanitize? |C| subpoly poly |U| subpoly ? ? poly ? ? Good news!
The Good News: Can Sanitize When Universe is Small Efficient Sanitizer for query set C • DB size n ¸ Õ(|C|o(1) log|U|) • error is ~ n2/3 • Runtime poly(|C|,|U|) Output is a synthetic database Compare to [Blum Ligget Roth]: n ¸ Õ(log|C| log|U|), runtime super-poly(|C|,|U|)
Recursive Algorithm Start with DB D and large query set C Repeatedly choose random subset Ci+1of Ci:shrink query set by (small) factor C0=C C1 C2 Cb
Recursive Algorithm Start with DB D and large query set C Repeatedly choose random subset Ci+1of Ci:shrink query set by (small) factor End recursion: sanitize D w.r.t. small query set Cb Output is good for all queries in small setCi+1 Extract utility on almost-all queries in large set Ci Fix remaining “underprivileged” queries in large set Ci C0=C C1 C2 Cb
Recursive Algorithm Overview Where? Want to sanitize DB D for query set C Say we have a small sanitizer A’ for smaller subsets C’½C,and A’ outputs small synthetic database Choose random C’½ C, sanitize D for C’ using A’ “Magic”: Sanitization gives accurate answers on all but small subset B ½ C Fix “underprivileged” queries in B “manually” Why? C’ C B How? Fix manually A’sanitizes
By* By C Sanitize for few queries, get utility for almost all C’ Consider m-bit synthetic DB output y of A’ vs. DB D: If y is “bad” for query set By of fractional size ≥m/s: PrC’[C’By=φ] ≤ (1-m/s)|C’| ≈ e-m W.h.p. simultaneously for all y‘s with large set By of bad queries, C’ intersects By y*=A’(D) good for all ofC’ y: potential m-bitoutput DB y* good for almost allC Occam’s Razor
How to get Synthetic DB? Syntheticizer Problem: need smallsynthetic DB, have large other output Lemma [“Syntheticizer”] Given sanitizer A with α-accuracy and arbitrary output Produce sanitizer A’ with 2α-accuracy and synthetic DB output of size Õ(log|C|/α2) Runtime is poly(|U|,|C|) Transform output to synthetic DB using linear programming Variable per item in U, constraint per query in C
The Linear Program • Run the sanitizer A and then use it to get differentially private counts vc on all the concepts in C • Database never used again - privacy • Come up with a low-weight fractional database that approximates these counts. • Transform this fractional database into a standard synthetic database by rounding the fractional counts.
For all i 2 U variable xi • For all c 2 C constraint vc- · i s.t c(i)=1 xi·vc+
The Linear Program • Why is there a fractional solution? • The real one integer solution is one example! • Rounding: • scale the fractional database so that its total weight is 1, • Round down each fractional point to closest multiple of /|U| • Treat the rounded fractional database, as an integer synthetic database of size at most |U|/ • If too large -sample
How Do We Use Synthetic DB? Why Synthetic DB? • Easy to “shrink” DBs by sub-sampling Õ(log|C|/α2) DB items • Gives counts for every query output is well-defined even for queries that were not around when sanitizing
C C’ B C’’ Utility for all queries: First Attempt Sanitizing small C’ is easy (“brute force”),can “shrink” using syntheticizer Sub-sample small C’, work for all but a few queries Repeat many times, take majority Doesn’t work: Underprivileged queries
Utility for all queries: fix “underpriveleged” Lemma Given query set C, diff. private sanitizer A that: • Works for every C’ ½C, |C’|=s • Outputs synthetic DB of size ≤ m Get sanitizer for C, utility on all queries Need DB size n ≥ Õ(|C|m/s)
Proof Outline Subsample small C’, get synthetic DB that works forall but a few (~|C|m/s) “underprivileged” queries Now “manually” correct those few:“brute force”: release noisy counts vc (noise ~|C|m/s) Also need to say which ones are underprivileged…depends on DB D.What about privacy? Key point:regardless of D, almost all queries strongly privileged. Release noisy indicator vector. For privacy analysis, need only consider the ~|C|m/s potentially underprivileged queries
Recursive Algorithm: Recap Start with DB D and large query set C Repeatedly choose rand. subset Ci+1of Ci: shrink by f factor v C0=C C1 C2 Cb
Recursive Algorithm: Recap Start with DB D and large query set C Repeatedly choose rand. subset Ci+1of Ci: shrink by f factor Sanitize D w.r.t. small Cb (use “brute force” sanitizer) Syntheticizer transforms output to small synthetic DB Fix “underprivileged” (need n ≥ Õ(f)) Lose 2b accuracy, “brute force” needs n ≥ 2b|Cb| n ≥ |C|o(1) by trading off b,f C0=C C1 C2 Cb
And Now… Bad News Runtime cannot be subpoly in |C| or |U| • Output is synthetic DB (as in positive result) • General output Exponential Mechanism cannot be implemented Want hardness… Got Crypto?
The Bad News For large C and U can’t get efficient sanitizers! • Output is synthetic DB (as in positive result) • General output Exponential Mechanism cannot be implemented Want hardness… Got Crypto?
m1 m2 mn m’ sig(m1) sig(m2) sig(mn) sig(m’) Digital Signatures Digital Signatures (sk,vk) Can build from one-way function [NaYu,Ro] Hard to forge new signature valid signatures under vk
m’1 s1 m1 m2 mn sig(m1) sig(m2) sig(mn) m’k sk Signatures ! No Synthetic DB Universe: (m,s) msg,sig pair Queries:cvk(m,s) output 1 iff s valid sig of m under vk sanitizer most are valid signatures under vk inputs appear in output, no privacy! valid signatures under same vk
Can We output Synthetic DB Efficiently? |C| subpoly poly |U| subpoly ? ? poly ?
Where is Hardness Coming From? Signature example: Hard to satisfy a given query Easy to maintain utility for all queries but one More natural: Easy to satisfy each individual query Hard to maintain utility for most queries
vk vk’1 m’1 s1 vk m1 m2 mn sig(mn) sig(m2) sig(m1) vk’k m’k sk vk Hardness on Average Universe: (vk,m,s) key,msg,sig Queries:ci(vk,m,s) - i-th bit of ECC(vk)cv(vk,m,s) - 1 iff valid sig under vk sanitizer are these keys related to vk? Yes! At least one isvk! valid signatures under vk
Hardness on Average Samples: (vk,m,s) key,msg,sig Queries:ci(vk,m,s) - i-th bit of ECC(vk)cv(vk,m,s) - 1 iff valid sig under vk 8i 3/4 of vk’j agree w. ECC(vk)[i] 9vk’j s.t. ECC(vk’j), ECC(vk)are 3/4-close vk’j = vk(error-correcting code) m’j appears in input. No privacy! vk’1 m’1 s1 vk’k m’k sk are these keys related to vk? Yes! At least one isvk!
Where is Hardness Coming From? Signature example: Hard to satisfy a given query Easy to maintain utility for all queries but one More natural: Easy to satisfy each individual query Hard to maintain utility for most queries
Can We output Synthetic DB Efficiently? |C| subpoly poly |U| subpoly ? ? poly ? Signatures Hard on Avg. Using PRFs
General output sanitizers Theorem Traitor tracing schemes exist if and only if sanitizing is hard Tight connection between |U|,|C|hard to sanitizeand key,ciphertext sizes in traitor tracing Separation betweenefficient/non-efficient sanitizersuses [BoSaWa] scheme
Traitor Tracing: The Problem • Center transmits a message to a large group • Some Users leak their keys to pirates • Pirates construct a clone: unauthorized decryption devices • Given a Pirate Box want to find who leaked the keys K1 K3 K8 E(Content) Content Pirate Box Traitors ``privacy” is violated!
Equivalence of TT and Hardness of Sanitizing Traitor Tracing Sanitizing hard for distribution of DBs (collection of) Key Database entry (collection of) Ciphertext Query TT Pirate Sanitizer
Traitor Tracing ! Hard Sanitizing Theorem If exists TT scheme cipher length c(n), key length k(n), can construct: Query set C of size ≈2c(n) Data universe U of size ≈2k(n) Distribution D on n-user databases with entries from U D is “hard to sanitize”: exists tracer that can extract an entry in D from any sanitizer’s output Separation betweenefficient/non-efficient sanitizersuses [BoSaWa06] scheme Violate its privacy!