Private Inference Control: Protecting Privacy in Data Access

Private Inference Control David Woodruff MIT dpwood@mit.edu Joint work with Jessica Staddon (PARC)

Contents • Background • Access Control and Inference Control • Our contribution: Private Inference Control (PIC) • Related Work • PIC model & definitions • Our Results • Conclusions

Access Control • User queries a database. Some info in DB sensitive. What’s Bob’s salary? Server DB of n records Sensitive: Access denied • Access control prevents user from learning individual sensitive relations/attributes. • Does access control prevent user from learning sensitive info?

Inference Control • Combining non-sensitive info may yield something sensitive • Inference Channel: {(name, job), (job, salary)} • Inference Control : block all inference channels

Inference Control • Inference engine: generates collection C of subsets of [m] denoting all the inference channels • We assume have an engine [QSKLG93] (exhaustive search) • Database x 2 ({0,1}m)n • DB of n records, m attributes 1, …, m per record • n tending to infinity, m = O(1) • F 2 C means for all i, user shouldn’t learn xi, j for all j 2 F • Assume C is monotone. • Assume C input to both user and server • User learns C anyway when his queries are blocked • C is data-independent, reveals info only about attributes

Our contribution: Private Inference Control • Existing inference control schemes require server to learn user queries to check if they form an inference • This talk: arbitrary malicious users U*, semi-honest S • Our goal: user Privacy + Inference Control = PIC • Privacy:polytime S learns nothing about honest user’s queries except # made so far • # queries made so far enables S to do inference control • Private and symmetrically-private information retrieval • Not sufficient since they are stateless • User’s permissions change over time • Generic secure function evaluation • Not efficient – our communication exponentially smaller

DB DB Application • Government analysts inspect repositories for terrorist patterns • Inference Control: prevent analysts from learning sensitive info about non-terrorists. • User Privacy: prevent server from learning what analysts are tracking – if discovered this info could go to terrorists!

Related Work • Data perturbation [AS00, B80, TYW84] • So much noise required data not as useful [DN03] • Adaptive Oblivious Transfer [NP99] • One record can be queried adaptively at most k times • Priced Oblivious Transfer [AIR01] • One record, supports more inference channels than threshold version considered in [NP99] • We generalize [NP99] and [AIR01] • Arbitrary inference channels and multiple records • More efficient/private than parallelizing NP99 and AIR01 on each record

The Model • Offline Stage: S given x, C, 1k, and can preprocess x • Online Stage: at time t, honest U generates query (it, jt) • (it, jt) can depend on all prior info/transactions with S • Let T denote all queries U makes, (i1, j1), …, (i|T|, j|T|) • T r.v. - depends on U’s code, x, and randomness • T permissable if no i s.t. (i,j) 2 T for all j 2 F for some F 2 C. We require honest U to generate permissable T. • U and S interact in a multiround protocol, then U outputs outt • ViewU consists of C, n, m, 1k , all messages from S, randomness • ViewS consists of C, n, m, 1k, x, all messages from U, randomness

Security Definitions • Correctness: For all x, C, for all honest users U, for all  2 [|T(U, x)|], out = xi, j • User Privacy: For all x, C, for all honest U, for any two sequences T1, T2 with |T1| = |T2|, for all semi-honest servers S* and random coin tosses of S* • (ViewS* | T(U, x) = T1)  (ViewS* | T(U, x) = T2) • Inference Control: Comparison with ideal model – for every U*, every x, any random coins of U*, for every C there exists a simulator U’ interacting with trusted party Ch for which ViewU*  View<U’, Ch>, where U’ just asks Ch for tuples (it, jt) that are permissable

Efficiency • Efficiency measures are per query • Minimize communication & round complexity • Ideally O(polylog(n)) bits and 1 round • Minimize server’s time-complexity • Ideally O(n) without preprocessing • W/preprocessing, potentially better, but O(n) optimal w.r.t. known single-server PIR schemes

Our Results • For any PIR scheme, let C(n) W(n) denote communication and server work for DB size n • PIC scheme #1 • Communication: O(k log n C(n2)), 1-round • Work: O(k log n W(n2)) • PIC scheme #2 • Communication: O(k(n + C(n))), O(1)-round • Work: O(k(n + W(n))) • Plugging in best PIR parameters, • Scheme #1: comm. O(polylog(n)), work O(n2) • Scheme #2: comm. & work: O(npolylog(n))

A Generic Reduction • A protocol is a threshold PIC (TPIC) if it satisfies the definitions of a PIC scheme assuming C = {[m]}. • Theorem (roughly speaking): If there exists a TPIC with communication C(n), work W(n), and round complexity R(n), then there exists a PIC with communication O(C(n)), work O(W(n)), and round complexity O(R(n)).

PIC ideas: … … cnvdselvuiaapxnw • User/server do SPIR on table of encryptions • Idea: Encryptions of both data and keys that will help user decrypt encryptions on future queries • User can only decrypt if has appropriate keys – only possible if not in danger of making an inference

Stateless PIC • Minimizing communication is a data structures problem • What type of keys require least communication for user to: • Update as user makes new queries? • Prove user not in danger of making an inference on current/future queries? • Keys must prevent replay attacks: can’t use “old” keys to pretend made less queries to records than actually have

PIC Scheme #1 – Stage 1 • Let E by a homomorphic semantically secure encryption scheme (e.g., Pallier) • Suppose we allow accessing each record at most once E(i3), E(j3), ZKPOK PK, SK PK (i3, j3) Recovers r1, r2 iff hasn’t previously accessed i3 E(i1) -> E(r1(i1 – i3)) E(i2) -> E(r2(i2 – i3)) • From r1 and r2 user can reconstruct a secret S3

PIC Scheme #1 – Stage 2 E(i3), E(j3), ZKPOK PK, SK PK (i3, j3) Recovers S3 User does “SPIR on records” on table of encryptions

PIC Scheme #1 - Wrapup • To extend to querying a record < m times, on t-th query, let r1, …, rt-1 be (t-m+1) out of (t-1) secret sharing of St • This scheme can be proven to be a TPIC – use generic reduction to get a PIC • User Privacy: semantic security of E, ZK of proof, privacy of SPIR • Inference Control: user can recover at most t-m ri if already queried record m-1 times – can build a simulator using SPIR w/knowledge extractor [NP99]

PIC Scheme #2 - Glimpse t polylog(n)-communication PIC • Balanced binary tree B • Leaves are attributes • Parents of leaves are records • Internal node n accessed when record r queried and n on path from r to root • Keys encode # times nodes in B have been accessed. Ku, a Kv, b Kw,c Kx,d Ky,e Kz,f 1 2 3 4 a+b =t

Conclusions • Extensions not in this talk • Multiple users (pseudonyms) • Collusion resistance: c-resistance => m-channel becomes collection of (m-1)/c channels. • Summary • New Primitive – PIC • (Almost) Communication-optimal implementations

Private Inference Control: Protecting Privacy in Data Access

Private Inference Control: Protecting Privacy in Data Access

Presentation Transcript

Private Inference Control

Inference Control in Statistical databases

Inference

Inference

Inference

INFERENCE

Inference

INFERENCE

Inference

Inference

Computer Science 653 Lecture 5 --- Inference Control

Concurrency Control for Scalable Bayesian Inference

Inference

Inference

Inference

Inference

Inference

Inference

Private Inference Control

Inference

Inference

Inference