190 likes | 342 Views
New Measures of Data Utility. Mi-Ja Woo National Institute of Statistical Sciences. Question How to evaluate the characteristics of SDL methods?.
E N D
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences
Question How to evaluate the characteristics of SDL methods? • Previously, data utility measures were studied in context of moments and linear regression models.- Differences in inferences obtained from the original and masked data are used as the data utility.- Regression model relies on the multivariate normality assumption.- Comparison of mean and variance makes sense under the normality assumption. • Questions : - Is the assumption satisfied in the realistic situation?- What if the assumption is violated?
Example: Two-Dimensional Original Data and Two Masked Data by Synthetic and Resampling Methods.
Moment and Regression Models • Different distributions, but the same moments and estimates of regression coefficients. • New measures are needed.
1. CDF Utility Measure • Extension of univariate case. • Kolmogorov statistics • Cramer-von Mises statistics , where are empirical distributions of original and masked data. Large MD and MCM indicate two data are distributed differently.
2. Cluster Data Utility • A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. • A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. • Cluster data utility: difference between the proportion of observations from original data for each cluster and constant “c”. where is the total number of records, is the number of records from original data, and is the weight assigned to i-th cluster.
3. Propensity Score Data Utility • A propensity score, e(x) is generally defined as the conditional probability of assignment to a particular treatment, z = 1, given a vector of observed covariates, x. (Rosenbaum and Rubin 1983). • Rosenbaum and Rubin (1983) proved that . • Data set is said to be randomly assigned when propensity score for each covariate is constant (1/2 with equal number of observations for two groups). • In the propensity score method, a propensity score is estimated for each observed covariate, and utility is measured by:
Estimation of Propensity Scores: • Combine original and masked data sets, and create an indicator variable Rj with the value 0 for observations from original and 1 otherwise. 1) Logistic regression model such as where , and polynomial in x is used as . 2) Tree model. 3) Modified logistic regression model : Classify all data points into g groups, and fit a logistic model for each group. It combines logistic model with clustering, and it borrows strength of logistics model and clustering method. • Note: Cluster utility is one way of propensity score utility.
4. Application I: Simulation Studies • Eight different types of two-dimensional data with n =10,000: 1) Symmetric/non-symmetric2) High/ low correlated 3) Negative/ positive correlated. • Masking strategies considered:Synthetic, microaggregation, microaggregation followed by noise, rank swapping, and resampling. • Computational details:1) Cluster Utility: g = 500 (5%) and g = 1,000 (10%).2) Propensity score utility with logistic model:
Continued 3) Propensity score utility with tree model: Sizes of tree considered are complexity parameter cp = 0.001, and 0.0001. That is, any split that does not decrease the overall lack of fit by a factor of cp is not attempted. 4) Propensity score utility with modified logistic model:The number of group is g = 100 (1%), and linear and quadratic logistic functions are used to fit logistic regression models.
Application II: March 2000 US CPS, n=51,016. • Disclosure limitation techniques: ● Race: Swap randomly 10%, 30% of races.● Marital Status: Swap randomly 10%, 30% of Marital Status.● Age: Rounded.● Household property tax: For positive values, add random noise. When altered values are negative, re-draw until we get positive values. Zero values are not altered. Topcoding.● Income: Microaggregation with k = 20. • Utilities are obtained for various masked data.
Summary: • CDF utility: - Does not distinguish well.- It is known that Kolmogorov-Smirnov has low power for categorical data. • Cluster utility: - Does not measure the differences between two structures of original and masked data within a cluster, within-cluster variation.- Large number of clusters tend to produce worse utility for the masked data by microaggregation method since there are overlaps in microaggregated data.
Continued • Propensity score utility:most appropriate overall.● Logistic Model:- The choice of degree is very crucial.- It is hard to deal with high-dimensional data.● Modified Logistic Model: - It possesses both advantages and disadvantages of logistic model and clustering since it is the combination of cluster and propensity score utilities.
Acknowledgement • NISS • Dr. Alan Karr • Dr. Jerry Reiter • Dr. Anna Oganyan