130 likes | 142 Views
Comparison of Perturbation Approaches for Spatial Outliers in Microdata. Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester, Natalie.Shlomo@manchester.ac.uk ** IIIA and CSIC, Barcelona jmares@iiia.csic.es.
E N D
Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester, Natalie.Shlomo@manchester.ac.uk ** IIIA and CSIC, Barcelona jmares@iiia.csic.es The project was funded by the Census Statistical Disclosure Control project at Westat, Inc. through the sponsorship of the U.S. Bureau of the Census 1
Topics Covered Introduction Description of Data Outlier Detection Coherence Function Perturbation Methods Record Swapping Method Hot Deck Method Results Conclusions 2
Introduction Geographical spatial outliers arise from multivariate relationships between spatial and non-spatial characteristics and have a high probability of identification Treat through targetted SDC perturbation in the microdata Focus on US American Community Survey (ACS) transportation outputs, trajectories defined as vectors of coordinates: place of residence (origin) and workplace (destination) Example of an outlier: overly long commutes to work on a non-typical means of transportation (MOT), such as cycling Objective: to inform and guide decisions about best practices that could be used for future dissemination strategies on these and other similar types of datasets by the US Census Bureau 3
Description of Data Simulation study based on an artificial population produced from 2006-2008 combined PUMS of the ACS Those living in California, employed and worked within the US (N=438,850) Latitude and longitude of residence and workplace generated by adding random distances around a radius of the centroid of the relevant PUMA (public-use microdata area with population greater than 100K) Did not take into account survey weights (need to recalibrate following perturbation) however use other calibration variables as controls to minimize distortions to original weights 4
Outlier Detection Outlier detection methods include univariate and multivariate methods and can take parametric or non-parametric forms For this study we use a multivariate outlier detection based on the Mahalanobis Distance where large values indicate outliers Replace mean vector by median vector and covariance matrix by minimum covariance determinant (MCD) (Rousseeuw, 1985) Let h be the minimum number of points which are not outlying: Squared Mahalanobis distances based on p variables generallly uses a quantile of the distribution Under robust Mahalanobis distances use the adjusted cut-off: 5
Outlier Detection Robust Mahalanobis distances calculated on distance travelled and minutes to work DistanceToWork=geodist(latitude,longitude,POW_latitude,POW_longitude,'DM'); Determine explanatory variables predictive of distance travelled to produce classes: mode of transport, sex, earnings and occupation SAS macro: ‘Robcov’ Version 1.3-2 (written by Michael Friendly) Collapse classes to at least 20 individuals and calculate robust Mahalanobis distance with a flag if exceeds critical value Reduced dataset to 283,423 without missing values and high degree of consistencies: 60,007 outliers (21.2%) reduced to 59,080 (20.8%) outliers after deleting ‘other’ mode of transport 6
Coherence Function Coherence function maximum and minimum velocity for each mode of transport based on the set of non-outliers Assign high coherence to individuals whose travelled distance is close to mean, and low coherence to individuals whose travelled distance isfar from mean Use as objective function toguide perturbationwhere we aim to obtain a higher coherence for outliers 7
Record Swapping Pair outliers with different workplaces by swapping place of residence and increase coherence funcion for at least one of the outliers (without decreasing coherence) Carry out within classes: mode of transport, sex and age group Split outliers according to workplace, calculate coherence function by swapping residence of outlier with all other outliers in different workplace If one of the outliershave higher coherence then swap Continue iteratively 8
Hot Deck Impute residence of outlier by residence of non-outlier within the class and having same workplace 2 approaches for selecting donor (note: need more than one individual in the workplace) Candidate donors among those having distance to work within the coherence range of distances and donor selected that maximiazes coherence function, i.e. candidate donor whose distance to work is closer to the mean velocity) Instead of coherence function, choose donor from non-outlier in the same workplace having similar travelled minutes (nearest neighbor) 9
Results • Swapping corrected fewer outliers than hot deck methods (16K vs 31K) but swapping carried out only on outliers • Some non-outliers that became outliers since we changed the distribution structure following perturbation (4K swapping vs 8K hotdeck)) • Number of non-outliers defined as outliers following perturbation was much less than those outliers that were corrected to non-outliers 10
Results • Individuals who had their PUMA changed due to the perturbation: Swapping Method: 56,562 ; Hot Deck Method (Minutes): 53,945 ; Hot Deck Method (Coherence): 53,181 • Hotdeck methods perturb bivariate counts more than swapping since swapping does not change marginal frequencies • Hotdeck using the coherence function approach resulted in less information loss than nearest neighbor approach 11
Discussion • Record swapping had lowest information loss (especially for bivariate counts of swapping variable with other control variables) but only corrected 21.3% of the outliers, while the hot-deck methods corrected ~ 40.0% of the outliers • Hot-deck method transformed more non-outliers to outliers compared to record swapping • Recommendation would be to carry out both methods, starting with record swapping and then proceeding to hotdeck method on remaining outliers • Recalibrate survey weights to new place of residence but including calibration variables as controls minimizes distortion to survey weights, especially under record swapping 12