520 likes | 700 Views
Curating and Managing Research Data for Re-Use Review & Processing Jared Lyle. We Are Here Today: Review & Processing. http://weknowmemes.com/2011/12/this-is-my-room-what-i-think-it-looks-like-what-my-mom-thinks-it-looks-like/.
E N D
Curating and Managing Research Data for Re-UseReview & ProcessingJared Lyle
http://weknowmemes.com/2011/12/this-is-my-room-what-i-think-it-looks-like-what-my-mom-thinks-it-looks-like/http://weknowmemes.com/2011/12/this-is-my-room-what-i-think-it-looks-like-what-my-mom-thinks-it-looks-like/
A well-prepared data collection “contains information intended to be complete and self-explanatory” for future users. Do no harm.
Review • Documentation • Data • [Disclosure Review]
Is the data collection complete, accurate, and well-documented?
Documentation http://dx.doi.org/10.3886/ICPSR31521.v1
Essential Descriptive Elements • Basic front matter • Variable level details • Methodology
Documentation: Front Matter Title http://dx.doi.org/10.3886/ICPSR31521.v1 Principal Investigator(s)
Documentation: Front Matter Description Monitoring the Future: A Continuing Study of American Youth (12th-Grade Survey), 2009. Johnston, Lloyd D., Jerald G. Bachman, Patrick M. O'Malley, and John E. Schulenberg. Monitoring the Future: A Continuing Study of American Youth (12th-Grade Survey), 2009 [Computer file]. ICPSR28401-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2010-10-27. doi:10.3886/ICPSR28401.v1
Documentation: Variable-level Details National Longitudinal Study of Adolescent Health (Add Health), 1994-1995 (National Longitudinal Study of Adolescent Health (Add Health), Wave I School Administrator Codebook. http://www.cpc.unc.edu/projects/addhealth/codebooks/wave1/index.html
Documentation: Variable-level Details Variable Name
Documentation: Variable-level Details Variable Label
Documentation: Variable-level Details Variable Type
Documentation: Variable-level Details Question Text
Documentation: Variable-level Details Value Labels
Documentation: Variable-level Details Missing Data
Documentation: Variable-level Details Summary Statistics
Documentation: Variable-level Details Constructed Variables
Documentation: Variable-level Details Notes Skip Patterns
Documentation: Variable-level Details (examples) American National Election Study, 2008-2009 Panel Study Frequency codebook, version 20090903. http://electionstudies.org/studypages/2008_2009panel/anes2008_2009panel_fcodebook.txt
Documentation: Variable-level Details (examples) Davis, James A., Tom W. Smith, and Peter V. Marsden. General Social Surveys, 1972-2008 [Cumulative File] [Computer file]. ICPSR25962-v2. Storrs, CT: Roper Center for Public Opinion Resarch, University of Connecticut/Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2010-02-08. doi:10.3886/ICPSR25962
Documentation: Variable-level Details (examples) United States Department of Health and Human Services. Substance Abuse and Mental Health Services Administration. Office of Applied Studies. National Survey on Drug Use and Health, 2009 [Computer file]. ICPSR29621-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2010-11-16. doi:10.3886/ICPSR29621
Documentation: Variable-level Details (examples) United States Department of Justice. Office of Justice Programs. Bureau of Justice Statistics. Capital Punishment in the United States, 1973-2008 [Computer file]. ICPSR27982-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2010-09-07. doi:10.3886/ICPSR27982
Documentation: Methodology • Sample design: A description of how the cases that appear in the study were selected, including details about target populations, sampling frames, sample sizes, sampling errors, and sampling methods. • Data collection procedures: The methods used to collect the data (e.g., telephone, mail, computer-assisted). Where applicable, this includes the exact instructions and protocols used by interviewers when they collected the data. • Data processing: The activities and quality checks performed on the data collection to generate the final data products from the raw collected data. If files were merged , a full description of the process should be provided.
Documentation: Methodology • Weighting: Where applicable, a description of the criteria for using weights in the analysis of a data collection, including how the weights were created, all weighting formulae or coefficients, a definition of their elements, and an indication of how the formulae are applied to the data. • Confidentiality issues: Where applicable, a discussion of any confidentiality issues in the data, as well as the steps taken to mitigate disclosure risk.
Other Documentation • Questionnaire • User Guide • Handbook • Manual • Report • Table • User Agreement • Errata
Useful Resources: Description ICPSR, “What is a codebook?” http://www.icpsr.umich.edu/icpsrweb/ICPSR/support/faqs/2006/01/what-is-codebook Institute for Health and Care Research Quality Handbook http://www.emgo.nl/kc/preparation/data%20collection/3%20Codebook.html Princeton University Data and Statistical Services, “How to Use a Codebook” http://dss.princeton.edu/online_help/analysis/codebook.htm UCLA Social Science Data Archive, “Codebooks”http://dataarchives.ss.ucla.edu/tutor/tutcode.htm
Data Labels • Does each variable have a variable name and label? • Do all categorical variables have value labels? • Are labels consistent?
Naming Conventions: Variables Variable Names: • One-up numbers (V1, V2) • Question numbers (Q1, Q2) • Mnemonic names (age, race) • Prefix, root, suffix systems (FAED, MOED)
Naming Conventions: Variables Variable Labels: • Item/Question number • Indicate variable content • Indicate if variable constructed Q14: Assessment of R’s Health
Naming Conventions: Values Value Labels: • Mutually exclusive, exhaustive, and defined • Preserve original information • Retain original coding scheme Respondent’s Employment Status Self-employed (1) Somewhere-else (2) No answer (9) Not applicable (BK)
Missing Data • Are there missing data? • Are missing data labeled? 77 = Inapplicable 88 = Don’t Know 99 = No Answer
Values • Are the values reasonable (for example, date variables contain dates, gender variables don't have 10 categories, variables aren't all system missing)? • Are there weight variables? If so, are they well documented?
Matching Data & Documentation • Do the data match the documentation? Are values and/or labels listed in one but not in the other? • Are all codes in the data valid (documented) according to the data collection instrument or PI's codebook? • Are there duplicate records? • Does the spelling look OK?
Useful Resources: Data UK Data Archive, “Documenting Your Data/Data Level/Structured Tabular Data” http://www.data-archive.ac.uk/create-manage/document/data-level?index=1 ICPSR Guide to Social Science Data Preparation and Archiving: Phase 3: Data Collection and File Creation, “Documenting Your Data/Data Level/Structured Tabular Data” http://www.icpsr.umich.edu/icpsrweb/content/deposit/guide/chapter3quant.html
Activity • Review the following data output and report any issues you find.
Discussion • How much cleaning do you do to a data collection? • When is it appropriate to change the ‘original order’ of a data collection? • How many processing details do you include in the study documentation?