Producing public use data files

Producing public use data files Felicia LeClere University of Michigan IASSIST 2007

Data management plans • From research data to public use data • Research data: useful to original research team. Have implicit and explicit knowledge of data sources, weighting, questionnaire structure, recoding, and design. Cleaned a piece at a time … documentation is institutional knowledge • Public use data: useful to secondary data users with no a priori knowledge of either instrumentation or data documentation. • Always produce data files as if they were for public release!

Producing public use data files • Maximizing measurement: What makes a good data set • How to clean and organize data files: strategies for editing and data display • Issues in Data Preparation: Complex data files • Confidentiality review and protection

What makes a good secondary data set • Good documentation! • Enough information to understand both the survey process and each variable individually • Consistent information on all variables in the data set • Information at the project, file, and variable level • Clear instructions as to the source of the variable (question, recode, cleaned data) • Original data collection instrument and original survey documents • Data editing and imputation procedures

What makes a good secondary data set • Clean data!! • Questionnaire items should match variables or should map to recodes • Missing values should clearly noted and consistent • All text data should be coded where possible • Data errors such as range, consistency, and adherence to questionnaires should be addressed

What makes a good secondary data set • Easy access to data: dissemination • Data and documentation should be in the same place • Clear instructions on how access the data • Examples of how to use the data • Clear metadata description

Coding Editing Imputation Weighting Documentation From raw data to usable data

What is data editing? • Data collected in the field and entered into electronic format by hand or collected through a CASI method has many potential sources of error. Data editing is designed to correct those errors. • Questionnaire error: the questionnaire is written in such a way that the respondent cannot answer • Interviewer error: the interviewer asks or records the wrong information • Data entry error : Data entry technicians do not transfer the data properly • Programming error: CAPI or CATI instrumentation is not appropriate or too complex for dissemination • Coding error: Data coding done later in the process is done poorly or inconsistently

Perceptions of data editing • Most academic scholars have inconsistent views of data editing –there are no established methodological guidelines in academic fields. Government data systems take a wide variety of approaches. • Leahey, Entwisle, and Einaudi(2003) found in a study of academics that nearly two-thirds of respondents “objected” to data editing. Of those who objected, over a quarter did so for ethical reasons…. • Leahey (2004) found that belief in the quality of the data edits depended on the status of the “editor” --- academic status, that is…

BATH G3 Do you currently have any difficulty bathing by yourself? YES 1 NO 5 SKIP: IF 1, GO TO BATHDIF IF 5, GO TO CHECKG1 BATHDIF G3a How much difficulty do you have bathing by yourself—a little,some,a lot, or can't you do this on your own? A LITTLE 1 SOME 2 A LOT 3 CANNOT DO 4 SKIP: IF 1 OR 2, GO TO INTERVIEWER CHECKPOINT IF 3 OR 4, GO TO SRH CHECKG1 G3b INTERVIEWER CHECKPOINT: If BED = 1, go to SRH All others, go to CLIMB CLIMB G4 Do you currently have any difficulty climbing a few flights of stairs because of your health? YES .1 (If vol) Age is only limitation .3 NO .5 SKIP: IF 1 OR 3, GO TO CLMBDIF IF 5, GO TO WALK

Strategies for editing data • Record by record and variable by variable edit --100% edit –use both the data and the questionnaire to edit the data. From previous example, if there is data in G3a but G3=5, we would go back and change G3 to 1. We use knowledge gained from the data to help make decisions about which piece of data is correct. Back editing • Questionnaire edit – use the structure of the questionnaire to perform edits independent of the data. Thus, even if there was data in G3a – the questionnaire rules says when G3=5, G3a must be blank. Thus, we do not use the data but rather the logic of the questionnaire to edit the data

Strategies for editing data • Edit tolerances – we choose a threshold error levels for variables before we edit them. Thus, we can set a 5, 10, or 15% error rate for the variables in the questionnaire. For instance if the error type noted above only occurs in 4% of the sample cases and we have chosen a 5% tolerance, we do not correct it. Although, we have to check for the consistency errors, we don’t have to modify the data

Strategies for editing data • Fellegi-Holt* method which checks variable that may cause the largest number of edit failures and edit that variable only (strategy of the US Census Bureau) –for instance a skip pattern that creates a lot of tied edit failures *Fellegi, I and T. Holt. 1976. “A systematic approach to automatic edit and imputation.” Journal of the American Statistical Association. 71:17-35.

Types of edits • Micro edits –within variable edits • Range edits – within allowable range for variable • Identify outliers – for continuous variables, check values lying outside plausible range. • Macro edits --- across variable edits • Ratio edits (the denominator and the numerator must be a reasonable match) • Balance edits (percentages or proportions must add to a total) • Consistency edits (violation of skip rules or plausible eligibility rules)

Four steps in the editing process • Design and implement “edit checks” • Design and document rules applied to data errors • Fix errors • Check consequences of edits on other variables

Fixing errors • Documenting choices about out of range, imputation errors • Decide back and forward editing strategies • Create working “edits” file • Check consequences of edits on other variables

Consequences of editing for analysis • Examples from literature suggest that editing rules may make substantial differences. Informal edits (that is what analyst do when creating work files) may have some impact on both comparison between data sets and the ability to compare across surveys. • Fendrich and Johnson (2001) found differences in drug use prevalence estimates for three national surveys of youth in the US were mostly due to mode differences not different methods of allocating inconsistent answers or imputation of missing variables. • Bauer and Johnson (2000) conversely found that estimates of smoking among teens varied by 4% when they compared 5 possible methods of handling data inconsistencies. This range was even larger for estimates among minority groups. • Argues for consistent and documented methods for handling inconsistent data

Purposes of documentation • For users to gain knowledge about the data • For users to have enough information to use the data accurately • To provide information specific to the data collection, especially complex data collections • So users can reproduce benchmark statistics as a first step

Why academic data producers usually don’t document…. • The CAPI instrument does that doesn’t it…. • I know what is in the variable description…. • There isn’t any money in the grant for dissemination…. • Recodes are intellectual property….

Benefits of quality documentation • For original data collectors • Preserve original data and project information • Enables continued use of the data at a later date • Reduces questions by secondary analysts of the data • For secondary data users • Allows analysis of data without expense of data collection • Provides quality data sets for training and researchers

Producing quality documentation • Process is easier if it is started early • Use documentation to record decisions about the data when the decisions are made • Plan correspondence between data and the data collection instrument using variable names or labels • Choose software package to record documentation • Create DDI-compliant documentation

Types of documentation • Study documentation –study level descriptions • Codebook –what are the variables, values, and formats • User guide ---how to use the data well • Technical documentation –instruments coding documents

Principal investigator Study title Funding source and grant number Data collector/producer Project description Sample and sampling procedures Weighting information Date of collection and time period covered Geographic location of the data collection Data source Unit of analysis/observation Summary of major variables Response rates Common scales Restrictions on the use of the data Elements in study documentation (meta data record)

Variable names Value labels for each code Missing data designations Variable formats Variable type Variable length Decimal specifications Exact question text Skip patterns Weighted or unweighted univariate statistics Column locations for fixed length ASCII data file Variable groupings Elements in codebooks

Final project reports User guide Overview or background of data collection Sample consent forms Citations of publications related to the data Interviewing schedule or procedures Weighting procedures Core questions asked over waves Difference between public use and restricted use agreements Strengths and limitations of the data Contact information and technical assistance Disclaimer Technical documentation: project level

Data collection instruments Inventory of data and documentation files Data file structure Coder manual Variable naming Skip pattern instructions Missing data policy Code lists Maps List of acronyms Imputation procedures Sample weights or variance estimation descriptions Revisions made after data collection started How to aggregate data How to merge data How to subset data Sample code Errata Software requirements Technical documentation: file level

Technical documentation: variable level • Variable lists • By order stored in the data • By variable groups • Alphabetical • Statistical tables • Code used to create derived variables and scales • Item response rates • Skip pattern instructions

Complex data • Both documentation and data editing are challenged by complex data • Complex data is defined as any data system that may contain more than one file structure, file type, unit of observation, time period…… • The challenge for dissemination is to make access to complex data easier

Ways in which data can be complex 1: Hierarchy • single hierarchy • multiple hierarchy • non-overlapping hierarchy relationships, geography, records, land segments, event records such as hospital stays or doctor visits 2: Record matching Probabilistic and deterministic matching --- either already accomplished or must be completed

Ways in which data can be complex 3. longitudinal files • changing frequency • changing samples • changing questionnaires/items • changing respondents • new sample units • disposition of initial sample • addition of new/purposive samples 4. repeated cross sections • changing instruments • changing samples • changing periodicity

Complex data need additional documentation • ID linkage • Weighting information • Variable comparability • Respondent tracking • Merging criteria • Loss to follow up description

Confidentiality checks ---limiting the paranoia • Direct identifiers • Should never have been part of the research data file to begin with…. Make sure they are not! • Indirect identifiers • Preliminary check of marginals for small cell sizes …particularly for items that may be identifiable. Age, sex, race, occupation, dates of birth, marriage, divorce or transition. • Finding rare populations • Distinguish between rare sample and rare population cases. The later poses the largest risk for independent reidentification. The former only a risk for reidentification for known respondents.

SUMMARY • What will be your editing standard? • What will be the edit work flow documentation? Should have natural language summary of edit steps that can be replicated • Should the editing steps be broken up? How can they be automated? • What does your documentation inventory look like? • How will complex data documentation be enhanced? • Preliminary confidentiality checks

Producing public use data files

Producing public use data files

Presentation Transcript

The Dutch Censuses of 1960, 1971 and 2001 Producing public use files in the IPUMS project

Producing Data: Experiments

GIS data files

Producing Data: Sampling

Producing data: experiments

Producing Data

Producing Data

Binary Data Files

National Immunization Survey: Data Quality and Public-Use Data Files

Chapter 5.1 Producing Data

Producing Data - Introduction

Data Files

Producing Data

Data Files

Chapter 5 Producing Data

Chapter 5: Producing Data

Data Files Structure

Introduction to National Immunization Survey and Public-Use Data Files

ACS Using Public Use Microdata Files

Methods of Producing Data

GATHERING AND PRODUCING DATA

NATIONAL IMMUNIZATION SURVEY: SAMPLING, ESTIMATION AND PUBLIC-USE DATA FILES