170 likes | 321 Views
Top k Queries Across Multiple Private Databases (2005) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu (GA Tech) Presented by: Cesar Gutierrez. Mining Multiple Private Databases. About Me. ISYE Senior and CS minor Graduating December, 2008
E N D
Topk Queries Across Multiple Private Databases (2005) Li Xiong (Emory University) Subramanyam Chitti (GA Tech) Ling Liu (GA Tech) Presented by: Cesar Gutierrez Mining Multiple Private Databases
About Me • ISYE Senior and CS minor • Graduating December, 2008 • Humanitarian Logistics and/or Supply Chain • Originally from Lima, Peru • Travel, paintball and politics
Outline • Intro. & Motivation • Problem Definition • Important Concepts & Examples • Private Algorithm • Conclusion
Introduction • ↓ of information-sharing restrictions due to technology • ↑ need for distributed data-mining tools that preserve privacy • Trade-off Accuracy Efficiency Privacy
Motivating Scenarios • CDC needs to study insurance data to detect disease outbreaks • Disease incidents • Disease seriousness • Patient Background • Legal/Commercial Problems prevent release of policy holder's information
Motivating Scenarios (cont'd) • Industrial trade group collaboration • Useful pattern: "manufacturing using chemical supplies from supplier X have high failure rates" • Trade secret: "manufacturing process Y gives low failure rate"
Problem & Assumptions • Model: n nodes, horizontal partitioning • Assume Semi-honesty: • Nodes follow specified protocol • Nodes attempt to learn additional information about other nodes ...
Challenges • Why not use a Trusted Third Party (TTP)? • Difficult to find one that is trusted • Increased danger from single point of compromise • Why not use secure multi-party computation techniques? • High communication overhead • Feasible for small inputs only
Recall Our 3-D Goal Accuracy Efficiency Privacy
Private Max • Actual Data sent on first pass • Static Starting Point Known start 30 2 1 30 10 40 30 40 20 4 3 40
Multi-Round Max • Randomly perturbed data passed to successor during multiple passes • No successor can determine actual data from it's predecessor • Randomized Starting Point Start 18 32 35 0 D2 D2 30 10 32 35 40 18 32 35 20 40 D4 D3 32 35 40
Evaluation Parameters • Large k = "avoid information leaks" • Large d = more randomization = more privacy • Small d = more accurate (deterministic) • Large r = "as accurate as ordinary classifier"
Conclusion • Problems Tackled • Preserving efficiency and accuracy while introducing provable privacy to the system • Improving a naive protocol • Reducing privacy risk in an efficient manner
Critique • Dependency on other research papers in order to obtain a full understanding • Few/No Illustrations • A real life example would have created a better understanding of the charts