340 likes | 509 Views
DISTRIBUTING DATA FOR SECURE DATA SERVICES Vignesh Ganapathy , Dilys Thomas, Tomas Feder , Hector Garcia Molina, Rajeev Motwani March 25, 2011 Stanford, TRDDC, TRUST. Road Map. Motivation for Secure Databases Distributing Data Encryption, Distribution Privacy Constraints
E N D
DISTRIBUTING DATA FOR SECURE DATA SERVICES VigneshGanapathy, Dilys Thomas, Tomas Feder, Hector Garcia Molina, Rajeev Motwani March 25, 2011 Stanford, TRDDC, TRUST
Road Map • Motivation for Secure Databases • Distributing Data • Encryption, Distribution • Privacy Constraints • Schema Decomposition • Query Partitioning • Cost Estimation • Where and Select clause processing • Query Decomposition • Experiments • Related Work
Health Personal medical details Disease history Clinical research data Govt. Agencies Census records Economic surveys Hospital Records Banking Bank statement Loan Details Transaction history Manufacturing Process details Blueprints Production data Finance Portfolio information Credit history Transaction records Investment details Outsourcing Customer data for testing Remote DB Administration BPO & KPO Insurance Claims records Accident history Policy details Retail Business Inventory records Individual credit card details Audits Motivation 1: Data Privacy in Enterprises Privacy
Motivation 3: Personal Information • Emails • Searches on Google/Yahoo • Profiles on Social Networking sites • Passwords / Credit Card / Personal information at multiple E-commerce sites / Organizations • Documents on the Computer / Network
Data Privacy • Value disclosure: What is the value of attribute salary of person X • Perturbation - Privacy Preserving OLAP • Identity disclosure: Whether an individual is present in the database table • Randomization, K-Anonymity etc. - Data for Outsourcing / Research • Linkage disclosure:Linking columns from multiple sites
Losses due to Lack of Privacy: ID-Theft • 3% of households in the US affected by ID-Theft • US $5-50B losses/year • UK £1.7B losses/year • AUD $1-4B losses/year
Road Map • Motivation for Secure Databases • Distributing Data • Encryption, Distribution • Privacy Constraints • Schema Decomposition • Query Partitioning • Cost Estimation • Where and Select clause processing • Query Decomposition • Experiments • Related Work
Two Can Keep a Secret: A Distributed Architecture for Secure Database Services How to distribute data across multiple sites for : Redundancy and Privacy so that a singlesite being compromised does not lead to data loss Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani, Srivastava, Thomas, Xu CIDR 2005
Cloud Data Services • Data outsourcing growing in popularity • Cheap, reliable data storage and management • 1TB $399 < $0.5 per GB • $5000 – Oracle 10g / SQL Server • $68k/year DBAdmin • Privacy concerns looming ever larger • High-profile thefts (often insiders) • UCLA lost 900k records • Berkeley lost laptop with sensitive information • Acxiom, JP Morgan, Choicepoint • www.privacyrights.org
Present solutions • Application level: Salesforce.com • On-Demand Customer Relationship Management • $65/User/Month ---- $995 / 5 Users / 1 Year • Amazon Elastic Compute Cloud • 1 instance = 1.7Ghz x86 processor, 1.75GB RAM, 160GB local disk, 250 Mb/s network bandwidth Elastic, Completely controlled, Reliable, Secure $0.10 per instance hour $0.20 per GB of data in/out of Amazon $0.15 per GB-Month of Amazon S3 storage used • Google Apps for your domain Small businesses, Enterprise, School, Family or Group
Encryption Based Solution Encrypt DSP Client Query Q Q’ Client-side Processor Answer “Relevant Data” Problem: Q’“SELECT *”
The Power of Two DSP1 Client DSP2
The Power of Two DSP1 Q1 Query Q Client-side Processor Q2 DSP2 Key: Ensure Cost (Q1)+Cost (Q2) Cost (Q)
Privacy Constraints SB1386 Privacy • { Name, SSN} { Name, LicenceNo} { Name, CaliforniaID} { Name, AccountNumber} { Name, CreditCardNo, SecurityCode} are all to be kept private. • A set is private if at least one of its elements is “hidden”. • Element in encrypted form ok
Techniques for Satisfying Privacy Constraints • Vertical Fragmentation • Partition attributes across R1 and R2 • E.g., to obey constraint {Name, SSN}, R1 Name, R2 SSN • Use tuple IDs for reassembly. R = R1 JOIN R2 • Encoding One-time Pad • For each value v, construct random bit seq. r • R1 v XOR r, R2 r • Deterministic Encryption • R1 EK (v) R2 K • Can detect equality and push selections with equality predicate • Random addition • R1 v+r , R2 r • Can push aggregate SUM
Example Schema & Privacy Constraints • An Employee relation: {Name, DoB, Position, Salary, Gender, Email, Telephone, ZipCode} • Privacy Constraints • {Telephone}, {Email} • {Name, Salary}, {Name, Position}, {Name, DoB} • {DoB, Gender, ZipCode} • {Position, Salary}, {Salary, DoB} • Will use just Vertical Fragmentation and Encoding.
An Employee relation: {Name, DoB, Position, Salary, Gender, Email, Telephone, ZipCode} • Privacy Constraints • {Telephone}, {Email} • {Name, Salary}, {Name, Position}, {Name, DoB} • {DoB, Gender, ZipCode} • {Position, Salary}, {Salary, DoB} • Decomposed schema • R1: {TID, Name, Email, Telephone, Gender, Salary } • R2: {TID, Name, Email, Telephone, DoB, Position, ZipCode } • Encrypted Attributes E: {Telephone, Email, Name}
Partitioning, Execution • Partitioning Problem • Partition to minimize communication cost for given workload • Even simplified version hard to approximate • Hill Climbing algorithm after starting with weighted set cover • Query Reformulation and Execution • Consider only centralized plans • Algorithm to partition select and where clause predicates between the two partitions
Road Map • Motivation for Secure Databases • Distributing Data • Encryption, Distribution • Privacy Constraints • Schema Decomposition • Query Partitioning • Cost Estimation • Where and Select clause processing • Query Decomposition • Experiments • Related Work
State Definitions for Bottom Up Evaluation • 0: condition clause cannot be pushed to either servers • 1: condition clause can be pushed to Server 1 • 2: condition clause can be pushed to Server 2 • 3: condition clause can be pushed to both servers • 4: condition clause can be pushed to either servers
Query Partitioning Original Query SELECT Name, DoB, Salary FROM R WHERE (Name =’Tom’ AND Position=’Staff’) AND (Zipcode =’94305’ OR Salary > 60000) R1: {TID, Name, Email, Telephone, Gender, Salary R2: {TID, Email, Telephone, DoB, Position, ZipCode } • Query 1: SELECT TID, name, salary FROM R1 WHERE Name=’Tom’ • Query 2: SELECT TID, dob, zipcode FROM R2 WHERE Position=’Staff’
Road Map • Motivation for Secure Databases • Distributing Data • Encryption, Distribution • Privacy Constraints • Schema Decomposition • Query Partitioning • Cost Estimation • Where and Select clause processing • Query Decomposition • Experiments • Related Work
Papers • [CIDR05]Two Can Keep A Secret. • [SIGMOD05] Privacy Preserving OLAP. • [ICDT05]Anonymizing Tables. • [PODS06]Clustering For Anonymity. • [KDD07] Probabilistic Anonymity.
Acknowledgements: Collaborators • Stanford Privacy Group • TRDDC Privacy Group • PORTIA, TRUST, Google