120 likes | 222 Views
Georeferencing in the Social Sciences – Promise and Peril. Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate Director, Harvard-MIT Data Center Senior Research Scientist, Institute for Quantitative Social Sciences
E N D
Georeferencing in the Social Sciences– Promise and Peril Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate Director, Harvard-MIT Data Center Senior Research Scientist, Institute for Quantitative Social Sciences E: micah_altman@harvard.eduW: http://maltman.hmdc.harvard.edu/
The Structural Challenges for Progress in Social Sciences • Pervasive Measurement Error • Scattered Data • Controlled Experiments not Available in Many Fields • Weak Theory Georeferencing in the Social Sciences -- Promise and Peril
Georeferencing Can Make Measurements far More Accurate • E.g. travel, time spent exercising, commutes, time at work, agriculture, distance to voting booth Correlation between reported and real distance to tax office.Source: [McKenzie and Sakho, 2007 as quoted in Gibsen and McKenzie,2007] LA Voting Precincts Relocated.Source: [Hui and Brady, 2006] Georeferencing in the Social Sciences -- Promise and Peril
Georeferencing Can Unify Data • Establishing comparability of most social science measurements is a major undertaking • Yet… most social science phenomenon are unambiguously located in time and space • Complete georeferencing would link almost all datasets at a basic conceptual level • However, most social science data is not yet georeferenced … this is an engineering challenge • Once done, coincident concepts can be revealed … Source: [Weeks, et al. 2007] Georeferencing in the Social Sciences -- Promise and Peril
Can Georeferencing fix Experiments Theory? • Not in general … although visualizations may help Source: [Altman & McDonald 2008] Source: [J. Snow, 1854] Source: [Calabrese, et al 2007; Real Time Rome Project 2007] Georeferencing in the Social Sciences -- Promise and Peril
Mountains of Unified, Accurate Data… What’s not to like? • “The increasing use of linked social-spatial data has created significant uncertainties about the ability to protect the confidentiality promised to research participants... At this time, however, no known technical strategy … adequately resolves conflicts among the objectives of data linkage, open access, data quality, and confidentiality protection across datasets and data uses” -- [Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and Self-Identifying Data, National Research Council, 2007] Georeferencing in the Social Sciences -- Promise and Peril
Can Privacy Problems be Fixed? • Maybe not, some challenging findings… • Large, sparse datasets can “leak” private information when correlated with external data. Even when significantly sub-sampled, perturbed, etc. [Narayan and Shmatikov 2008] • Repeated release of perturbation-masked geospatial point data leaks increasing amounts of information. Does not help to combine with aggregation masking [Zimmerman and Pavlik 2008] • Possible to identify other relationships in networks if you can generate seemingly innocuous relationships in same network [Backstrom, et. al 2007] • Pseudonymous communication can be linked through textual analysis [Tomkins et. al 2004] • K-anonymized data still vulnerable if homogenous, or attacker has enough background knowledge. L-diversity offered as replacement [MachanavaJJhala, et al 2007] • Additional anonymization challenges for geospatial data • Very fine grained location – versus multi-state aggregation mask required by HIPAA, and large social science surveys • Background knowledge very likely • Easy to integrate with other datasets • Some data points may be directly observable • Sequences of locations even more challenging • May cross aggregation units • Repetitive, temporally correlated • Induce unique networks Georeferencing in the Social Sciences -- Promise and Peril
Managing Privacy Issues With Digital Libraries • Embedding all sensitive data access in a digital library can greatly improve subject privacy: • Authentication, vetting, and access control • Standardized license terms governing analysis (derived from metadata and data characteristics) • Models can be run on-line without access to raw data • Monitoring and auditing of data use • Limit sequence of analyses by a user, in some cases ( for promising results, see [Dwork, et al 2006] ) Georeferencing in the Social Sciences -- Promise and Peril
Federated and Virtually Hosted Digital Libraries http://dvn.iq.harvard.edu/ Georeferencing in the Social Sciences -- Promise and Peril
Summary • Georeferencing would (partially) solve big problems for social sciences: measurement error, data integration • Privacy is likely the fundamental challenge for social scientists using this data • Privacy problem may never be fully solved mathematically • Digital libraries can provide leverage for management of data privacy issues with social, legal and technical means Georeferencing in the Social Sciences -- Promise and Peril
References • M. Altman, M.P. McDonald ,2008. “Better Automated Redistricting”, Journal of Statistical Software, Forthcoming. • H.E. Brady, I. Hui. 2006. Is It Worth Going the Extra Mile to Improve Causal Inference?, Political Methodology Annual Meeting, Davis. • L. Backstrom, C. Dwork, J. Kleinberg. Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography. Proc. 16th Intl. World Wide Web Conference, 2007. • Calabrese F., Colonna M., Lovisolo P., Parata D., Ratti C., 2007, "Real-Time Urban Monitoring Using Cellular Phones: a Case-Study in Rome", Working paper # 1, SENSEable City Laboratory, MIT, Boston http://senseable.mit.edu/papers/, [also see the Real Time Rome Project [http://senseable.mit.edu/realtimerome/] • C. Dwork, F. McSherry, K. Nissim, and A. Smith, Calibrating Noise to Sensitivity in Private Data Analysis, Proceedings of the 3rd IACR Theory of Cryptography Conference, 2006 • J. Gibson, and D. McKenzie 2007. Using Global Positioning Systems in Household Surveys for Better Economics and Better Policy, The World Bank Research Observer 22(2):217-241 • A. MachanavaJJhala, D Kifer, J Gehrke, M. Venkitasubramaniam, 2007,"l-Diversity: Privacy Beyond k-Anonymity" ACM Transactions on Knowledge Discovery from Data, 1(1): 1-52 • McKenzie, David, and Yaye Seynabou Sakho. 2007. “Does It Pay Firms to Register for Taxes? The Impact of Formality on Firm Profitability.” Washington, D.C: World Bank. • A. Narayanan and V. Shmatikov, 2008, Robust De-anonymization of Large Sparse Datasets, Proc. of 29th IEEE Symposium on Security and Privacy (Forthcoming) • J. Novak, P. Raghavan, A. Tomkins, 2004. Anti-aliasing on the Web, Proceedings of the 13th international conference on World Wide Web • Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and Self-Identifying Data, National Research Council, 2007. Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data. National Academies Press • J. Snow, 1855, On the mode of communication of cholera. London • J.R. Weeks, A. Hill, D. Stow, A. Getis, D Fugate, 2007, "Can we spot a neighborhood from the air? Defining neighborhood structure in Accra, Ghana", GeoJournal 69(1-2): 9-22. • D.L. Zimmerman, C. Pavlik , 2008. "Quantifying the Effects of Mask Metadata, Disclosure and Multiple Releases on the Confidentiality of Geographically Masked Health Data", Geographical Analysis 40: 52-76 Georeferencing in the Social Sciences -- Promise and Peril
Contact Information http://maltman.hmdc.harvard.edu/ <Micah_Altman@harvard.edu> Georeferencing in the Social Sciences -- Promise and Peril