140 likes | 270 Views
Methods of Secure Computation and Data Integration. Jerome Reiter, Duke University Alan Karr, NISS Xiaodong Lin, University of Cincinnati Ashish Sanil, Bristol Myers Squibb. General setting. Multiple agencies seek to improve analyses by “pooling” their data.
E N D
Methods of Secure Computation and Data Integration Jerome Reiter, Duke University Alan Karr, NISS Xiaodong Lin, University of Cincinnati Ashish Sanil, Bristol Myers Squibb
General setting • Multiple agencies seek to improve analyses by “pooling” their data. • Do not want to reveal individual data values unknown to other agencies. • Want accurate results from pooling procedures.
Pooling situations • Horizontally Partitioned:Agencies have different records but same variables. • Purely Vertically Partitioned:Agencies have same records but different variables. • Partially Overlapping, Vertically Partitioned:Agencies have different records and different variables, with some common records and variables.
Horizontal partitioningKarr, Lin, Sanil, Reiter (JCGS, 2005) • Secure data integration-- shares data but protects sources.-- allows any analysis to be done. • Secure summation-- shares sums without sharing data -- allows regressions, association rules, classifications, clustering
Secure summation Obtainwithout sharing individual values • Agency A passes (x + R) to 2nd agency. • Agency B adds its x to this value and passes sum to Agency C. • Process continues until all agencies have added their x. • Agency A subtracts R from the sum.
Purely vertical partitioning • Secure dot/matrix product-- shares dot/matrix products without sharing data.-- allows regressions, association rules, classification, clustering.-- assumes semi-honest. • Synthetic data approaches-- share synthetic copies of data across agencies.-- allows any analysis when distributions used to generate data are accurate.-- generates public use data file.
Secure dot/matrix productsKarr, Lin, Reiter, Sanil (NISS tech. report) Compute not revealing individual values • Agency A passes where for all i,j to Agency B. • Agency B sends to Agency A. • Agency A computes
Purely vertical partitioning • Secure dot/matrix product-- share dot/matrix products without sharing data.-- allows regressions, association rules, classification, clustering.-- assumes semi-honest. • Synthetic data approaches-- share synthetic copies of data across agencies.-- allows any analysis when distributions used to generate data are accurate.-- generates public use data file.
Synthetic data approachKohnen (PhD thesis, 2005) Assume X not sensitive. • Pass real X to Agency B. • Agency B simulates multiple copies of Y for from f(Y|X) estimated using the dataset from Agency A. Pass the copies to Agency A.
Synthetic data approachKohnen (PhD thesis, 2005) • Agency A uses partially synthetic data methods (Reiter, Surv. Meth., 2003) for inferences based on Y|X. • Agency A can release fully synthetic data to public.
Synthetic data approachesKohnen (PhD thesis, 2005) • Agency A simulates disguiser X that look like the genuine values of X, ideally from distribution close to f(X|Y). Pass real X and disguisers to Agency B. • Agency B simulates multiple copies of Y for each f(Y|X) estimated using the datasets from Agency A. Pass the copies to Agency A.
Synthetic data approachesKohnen (PhD thesis, 2005) • Agency A discards disguisers and uses partially synthetic data methods (Reiter, Surv. Meth., 2003) to obtain inferences using the real X. • Agency A can release fully synthetic data to public.
Partially overlapping, vertical partitioning • Secure EM algorithm-- uses secure dot products-- continuous data: estimate covariance matrix for multivariate normal data-- categorical data: estimate parameters of log-linear models
Limitations of methods:Defining a research agenda • Secure computation methods:- How to specify models without viewing data?- What if sophisticated models needed?- How to do posterior simulation? • Synthetic data methods:- How to generate good disguisers? • All methods:- How to incorporate matching errors, differences in data quality and definitions?- How to account for disclosure risks from models that “fit too well?”