270 likes | 420 Views
The Role of History and Prediction in Data Privacy. Kristen LeFevre University of Michigan May 13, 2009. Employment history. Healthcare, insurance information. E-mail. Supermarket transaction data. RFID, GPS Data. Data Privacy. Personal information collected every day.
E N D
The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13, 2009
Employment history Healthcare, insurance information E-mail Supermarket transaction data RFID, GPS Data Data Privacy • Personal information collected every day Web search / clickstream
Data Privacy • Legal, ethical, technical issues surrounding • Data ownership • Data collection • Data dissemination and use • Considerable recent interest from technical community • High-profile mishaps and lawsuits • Compliance with data-sharing mandates
Privacy Protection Technologies for Public Datasets • Goal:Protect sensitive personal information while preserving data utility • Privacy Policies and Mechanisms • Example Policies: • Protect individual identities • Protect the values of sensitive attributes • Differential privacy [Dwork 06] • Example Mechanisms: • Generalize (“coarsen”) the data • Aggregate the data • Add random noise to the data • Add random noise to query results
Observations • Much work has focused on static data • One-time snapshot publishing • Disclosure by composing multiple different snapshots of a static database [Xiao 07, Ganta 08] • Auditing queries on a static database [Chin 81, Kenthapadi 06, …] • What are the unique challenges when the data evolves over time?
Outline • Sample Problem: Continuously publishing privacy-sensitive GPS traces • Motivation & problem setup • Framework for reasoning about privacy • Algorithms for continuous publishing • Experimental results • Applications to other dynamic data speculation
GPS Traces(ongoing work w/ Wen Jin, Jignesh Patel) • GPS devices attached to phones, cars • Interest in collecting and distributing location traces in real time • Real-time traffic reporting • Adaptive pricing / placement of outdoor ads • Simultaneous concern for personal privacy • Challenge:Can we continuously collect and publish location traces without compromising individual privacy?
Data Recipient Central Trace Repository GPS Users (7 AM) Problem Setting Privacy Policy “Sanitized” Location Snapshot “Sanitized” Location Snapshot GPS Users (7:05 AM)
Problem Setting • Finite population of n users with unique identifiers {u1,…,un} • Assume users’ locations are reported and published in discrete epochst1,t2,… • Location snapshot D(tj) • Associates each user with a location during epoch tj • Publish sanitized version D*(tj )
Threat Model • Attacker wants to determine the location of a target user ui during epoch tj • Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages)
Some Naïve Solutions • Strawman 1: Replace users’ identifiers ({u1,…,un}) with pseudonyms ({p1,…,pn}) • Problem: Once attacker “unmasks” user pi, he can track her location forever • Strawman 2: New pseudonyms ({p1j,…,pnj}) at each epoch tj • Problem: Users can still be tracked using multi-target tracking tools [Gruteser 05, Krumm 07]
4 1 5 2 6 3 {Alice, Bob, Charlie} {Alice, Bob, Charlie} Key Problem: Motion Prediction Alice Alice What if the speed limit is 60 mph?
Threat Model • Attacker wants to determine the location of a target user ui during epoch tj • Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages) • Motion prediction: Given one or more locations for ui, attacker can predict (probabilistically) ui’s location during following and preceding epochs
Privacy Principle: Temporal Unlinkability • Consider an attacker who is able to identify (locate) target user uj during m sequential epochs • Under reasonable assumptions, he should not be able to locate uj with high confidence during any other epochs* *Similar in spirit to “mix zones” [Beresford 03], which addressed a related problem in a less-formal way.
Sanitization Mechanism • Needed to select a sanitization mechanism; chose one for maximum flexibility • Assign each user uiconsistent pseudonym pi • Divide users into clusters • Within each cluster, break association between pseudonym, location • Release candidate for D(tj) D*(tj) = {(C1(tj), L1(tj)),…, (CB(tj), LB(tj))} • i=1..BCi(tj) = {p1,…,pn} • Ci(tj) Ch(tj) = (i h) • Each Li(tj)contains the locations of users in Ci(tj)
t0 t1 t2 {p1,p2} {p1,p2} {p1,p3} {p3,p4} {p3,p4} {p2,p4} 6 1 2 3 4 11 8 12 7 9 10 5 Sanitization Mechanism: Example • Pseudonyms {p1, p2, p3, p4}
Reasoning about Privacy • How can we guarantee temporal unlinkability under the threats of auxiliary information and motion prediction? • (Using the cluster-based sanitization mechanism) • Novel framework with two key components • Motion model describes location correlations between epochs • Breach probability function describes an attacker’s ability to compromise temporal unlinkability
Motion Models • Model motion using an h-step Markov chain • Conditional probability for user’s location, given his location during h prior (future) epochs • Same motion model used by attacker and publisher • Forward motion model template • Pr[Loc(P,Tj) = Lj | Loc(P,Tj-1) = Lj-1, …, Loc(P,Tj-h) = Lj-h] • Backward motion model template • Pr[Loc(P,Tj) = Lj | Loc(P,Tj+1) = Lj+1, …, Loc(P,Tj+h) = Lj+h] • Independent and replaceable component • For this work, used 1-step motion model based on velocity distribution (speed and direction)
t2 c b p4 p3 p2 d p3 p1 p2 p4 a p1 Motion Models: Example Pr[loc(p1,t1) = a|Loc(p1,t0)=x] t0 t1 • Pseudonyms {p1, p2, p3, p4} • Epochs t0, t1, t2 {p1,p2} Pr[loc(p1,t1) = a|Loc(p1,t2)=y] Pr[loc(p1,t1) = b|Loc(p1,t0)=x] {p3,p4}
Privacy Breaches • Forward breach probability • Pr[Loc(P,Tj) = Lj | D(Tj-1), …, D(Tj-h), D*(Tj)] • Backward breach probability • Pr[Loc(P,Tj) = Lj | D(Tj+1), …, D(Tj+h), D*(Tj)] • Privacy Breach: Release candidate D*(Tj) causes a breach iff either of the following is true for threshold C max P, Lj Pr[Loc(P,Tj) = Lj | D(Tj-1), …, D(Tj-h), D*(Tj)] > C max P, Lj Pr[Loc(P,Tj) = Lj | D(Tj+1), …, D(Tj-h), D*(Tj)] > C
t0 t1 {p1,p2} Pr[loc(p1,t1) = a|D(T0), D*(T1)] = e1 * e4 e1 * e4 + e2 * e3 … {p3,p4} d c b a p3 p2 p1 p4 Privacy Breaches: Example e1 = Pr[loc(p1,t1) = a|Loc(p1,t0)=x] e2 = Pr[loc(p1,t1) = b|Loc(p1,t0)=x] e3 = Pr[loc(p2,t1) = a|Loc(p2,t0)=y] e4 = Pr[loc(p2,t1) = b|Loc(p2,t0)=y] x y Goal: Verify that all (forward and backward) breach probabilities < threshold C
Checking for Breaches • Does release candidate D*(Tj) cause a breach? • Brute force algorithm • Exponential in release candidate cluster size • Heuristic pruning tools • Reduce the search space considerably in practice
Publishing Algorithms • How to publish useful data, without causing a privacy breach? • Cluster-based sanitization mechanism offers two main options • Increase cluster size (or change composition) • Reduce publication frequency
Publishing Algorithms • General Case • At each epoch Tj, publish the most compact release candidate D*(Tj) that does not cause a breach • Need to delay publishing until epoch Tj+h to check for backward breaches • NP-hard optimization problem; proposed alternative heuristics • Special Case • Durable clusters (same individuals at each epoch) • Motion model satisfies symmetry property • No need to delay publishing
Experimental Study • Used real highway traffic data from UM Transportation Research Institute • GPS data sampled from cars of 72 volunteers • Sampling rate (epoch) = 0.01 seconds • Speed range 0-170 km/hour • Also synthetic data • Able to control the generative motion distribution
k-Condense r-Gather Experimental Study • All static “snapshot” anonymization mechanisms vulnerable to motion prediction attacks • Applied two representative algorithms (r-Gather [Aggarwal 06] and k-Condense [Aggarwal 04]) • Each produces a set of clusters with k users each
Speculation / Future Work • GPS example illustrates importance of reasoning about data dynamics and history, and predictable patterns of change in privacy • Dynamic private data in other apps. • E.g., longitudinal social science data • Study subjects age predictably • Most people don’t move very far • Income changes predictably • Hypothesis: History and prediction are important in these settings, too!