200 likes | 249 Views
Explore the issue of data leakage and strategies to identify the guilty agents involved in sharing data from various sources. Discuss guilt models, distribution tactics, and approaches for minimizing overlap in distributed data.
E N D
Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu
Leakage Problem Name: Sarah Sex: Female …. Name: Mark Sex: Male …. Jeremy Sarah Mark App. U1 App. U2 Other Sources e.g. Sarah’s Network Kathryn Stanford Infolab
Outline • Problem Description • Guilt Models • Pr{U1 leaked data} = 0.7 • Pr{U2 leaked data} = 0.2 • Distribution Strategies Stanford Infolab
Problem Description • Guilt Models • Distribution Strategies Stanford Infolab
Problem Entities Stanford Infolab
Agents’ Data Requests • Sample • 100 profiles of Stanford people • Explicit • All people who added application (example we used so far) • All Stanford profiles Stanford Infolab
Problem Description • Guilt Models • Distribution Strategies Stanford Infolab
Guilt Models (1/3) p: posterior probability that a leaked profile comes from other sources p p Guilty Agent: Agent who leaks at least one profile Other Sources e.g. Sarah’s Network Pr{Gi|S}: probability that agent Ui is guilty, given the leaked set of profiles S Stanford Infolab 8
Guilt Models (2/3) Agents leak all their data items OR nothing Agents leak each of their data items independently p2 p(1-p) (1-p)p or or (1-p)2 or Stanford Infolab 9
Guilt Models (3/3) Independently NOT Independently Pr{G2} Pr{G2} Pr{G1} Pr{G1} Stanford Infolab
Problem Description • Guilt Models • Distribution Strategies Stanford Infolab
The Distributor’s Objective (1/2) U1 R1 S (leaked) Request R2 U2 R1 Request R3 Request U3 R3 Request Pr{G1|S}>>Pr{G2|S}Pr{G1|S}>> Pr{G4|S} U4 R4 Stanford Infolab
The Distributor’s Objective (2/2) • To achieve his objective the distributor has to distribute sets Ri, …, Rn that minimize • Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents Stanford Infolab
Distribution Strategies – Sample (1/4) • Set T has four profiles: • Kathryn, Jeremy, Sarah and Mark • There are 4 agents: • U1, U2, U3 and U4 • Each agent requests a sample of any 2 profiles of T for a market survey Stanford Infolab
Distribution Strategies – Sample (2/4) Poor Minimize U1 U1 U2 U2 U3 U3 U4 U4 Stanford Infolab
Distribution Strategies – Sample (3/4) • Optimal Distribution • Avoid full overlaps and minimize U1 U2 U3 U4 Stanford Infolab
Distribution Strategies – Sample (4/4) Stanford Infolab
Distribution Strategies Sample Data Requests Explicit Data Requests The distributor must provide agents with the data they request General Idea: Add fake data to the distributed ones to minimize overlap of distributed data Problem: Agents can collude and identify fake data NOT COVERED in this talk • The distributor has the freedom to select the data items to provide the agents with • General Idea: • Provide agents with as much disjoint sets of data as possible • Problem: There are cases where the distributed data must overlap E.g., |Ri|+…+|Rn|>|T| Stanford Infolab
Conclusions • Data Leakage • Modeled as maximum likelihood problem • Data distribution strategies that help identify the guilty agents Stanford Infolab