1 / 20

Detecting Data Leakage

Detecting Data Leakage. Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu. Leakage Problem. Name: Sarah. Sex: Female. …. Name: Mark. Sex: Male. …. Jeremy. Sarah. Mark. App. U 1. App. U 2. Other Sources e.g. Sarah’s Network. Kathryn.

vletourneau
Download Presentation

Detecting Data Leakage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Data Leakage Panagiotis Papadimitriou papadimitriou@stanford.edu Hector Garcia-Molina hector@cs.stanford.edu

  2. Leakage Problem Name: Sarah Sex: Female …. Name: Mark Sex: Male …. Jeremy Sarah Mark App. U1 App. U2 Other Sources e.g. Sarah’s Network Kathryn Stanford Infolab

  3. Outline • Problem Description • Guilt Models • Pr{U1 leaked data} = 0.7 • Pr{U2 leaked data} = 0.2 • Distribution Strategies Stanford Infolab

  4. Problem Description • Guilt Models • Distribution Strategies Stanford Infolab

  5. Problem Entities Stanford Infolab

  6. Agents’ Data Requests • Sample • 100 profiles of Stanford people • Explicit • All people who added application (example we used so far) • All Stanford profiles Stanford Infolab

  7. Problem Description • Guilt Models • Distribution Strategies Stanford Infolab

  8. Guilt Models (1/3) p: posterior probability that a leaked profile comes from other sources p p Guilty Agent: Agent who leaks at least one profile Other Sources e.g. Sarah’s Network Pr{Gi|S}: probability that agent Ui is guilty, given the leaked set of profiles S Stanford Infolab 8

  9. Guilt Models (2/3) Agents leak all their data items OR nothing Agents leak each of their data items independently p2 p(1-p) (1-p)p or or (1-p)2 or Stanford Infolab 9

  10. Guilt Models (3/3) Independently NOT Independently Pr{G2} Pr{G2} Pr{G1} Pr{G1} Stanford Infolab

  11. Problem Description • Guilt Models • Distribution Strategies Stanford Infolab

  12. The Distributor’s Objective (1/2) U1 R1 S (leaked) Request R2 U2 R1 Request R3 Request U3 R3 Request Pr{G1|S}>>Pr{G2|S}Pr{G1|S}>> Pr{G4|S} U4 R4 Stanford Infolab

  13. The Distributor’s Objective (2/2) • To achieve his objective the distributor has to distribute sets Ri, …, Rn that minimize • Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents Stanford Infolab

  14. Distribution Strategies – Sample (1/4) • Set T has four profiles: • Kathryn, Jeremy, Sarah and Mark • There are 4 agents: • U1, U2, U3 and U4 • Each agent requests a sample of any 2 profiles of T for a market survey Stanford Infolab

  15. Distribution Strategies – Sample (2/4) Poor Minimize     U1 U1     U2 U2     U3 U3     U4 U4 Stanford Infolab

  16. Distribution Strategies – Sample (3/4) • Optimal Distribution • Avoid full overlaps and minimize   U1   U2   U3   U4 Stanford Infolab

  17. Distribution Strategies – Sample (4/4) Stanford Infolab

  18. Distribution Strategies Sample Data Requests Explicit Data Requests The distributor must provide agents with the data they request General Idea: Add fake data to the distributed ones to minimize overlap of distributed data Problem: Agents can collude and identify fake data NOT COVERED in this talk • The distributor has the freedom to select the data items to provide the agents with • General Idea: • Provide agents with as much disjoint sets of data as possible • Problem: There are cases where the distributed data must overlap E.g., |Ri|+…+|Rn|>|T| Stanford Infolab

  19. Conclusions • Data Leakage • Modeled as maximum likelihood problem • Data distribution strategies that help identify the guilty agents Stanford Infolab

  20. Thank You!

More Related