150 likes | 253 Views
Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by: Michael Davern, Ph.D. Assistant Professor, Research Director SHADAC, Health Services Research and Policy University of Minnesota.
E N D
Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by: Michael Davern, Ph.D. Assistant Professor, Research Director SHADAC, Health Services Research and Policy University of Minnesota Supported by a grant from The Robert Wood Johnson Foundation
Co-authors This work is coauthored with: • Miriam King, Ph.D., Research Associate We both are with the Minnesota Population Center at the University of Minnesota
Data set harmonization • The goal is to simplify access to all available years of a data set for analysis of trends over time. • This goal has many difficulties associated with it. • We focus on the issues involved with handling major sources of survey error over time.
Survey changes present challenges to harmonization • Sample design • How people and records are drawn into a data set changes and affects how variance estimation is done. • Nonresponse • How surveys account for unit, supplement, person and item nonresponse changes over time. • Survey questions and measurement • Changes to question wording and question universes. • Survey processing/editing • Changes to processing and data editing.
Decennial census sample designs • Decennial census sampling • Involves both sampling of people/households to receive the “long form” and sampling of long form records to release (1% and 5%). • Both the household/person selection changes over time as does the process used to select the public use micro data samples. • Data users need access to the sample design information to calculate appropriate variances/standard errors. • Although appropriate estimates can be obtained with replicate weights at the moment most users do not use them. • We are testing sample design variables to add to the IPUMS for Taylor Series estimation. • Will include both a stratification variable, cluster variable and weighting variable (when available) so analysts can simply program in SAS, Stata, SUDAAN, etc. • Our approach will make the changes in sample design seem seamless to the data user and will increase the use of more appropriate estimation methods.
Survey sample designs • The NHIS and CPS change sample designs over time. • Non-self representing PSUs are shuffled so some are not included between the designs. • Self-representing PSUs (MSAs) can also change (boundaries annex/lose counties). • Pooling data between two sample designs is a major challenge. • Data users often like to pool data to get larger samples or rare characteristics (e.g., those with SSI income). • When working with data from years with two sample designs it’s best to average the estimates and the standard errors from single years. • Also some surveys (e.g., NHIS) release sample design information that can be used for Taylor Series estimates, whereas others do not (e.g., CPS).
Nonresponse • There are several types of survey nonresponse. • Unit, person, supplement and item. • Nonresponse is also handled differently by the various surveys and can cause problems for data users. • Unit nonresponse is generally handled by adjusting survey weights of responders to account for nonrespnders. • Heterogeneity among the weights makes it important to use appropriate statistical routines for variance estimation.
Person and supplement nonresponse • Person and supplement nonresponse can be more difficult to deal with. • NHIS, for example, contains information on a household, but if they refused the supplement there is no supplement data for them. • This makes the data structure uneven. • The CPS, on the other hand, fully imputes the missing ASEC (i.e., March) supplement nonresponders (currently about 10% of the cases). • This evens out the data structure making it easier for data users to work with. • Although this can be problematic as the CPS full supplement imputation process can lead to rather large biases in estimates (e.g., health insurance coverage). • We are investigating ways of evening out portions of the NHIS data structure to make it easier to work with and disseminate.
Item nonresponse • Item nonresponse is also a challenge. • Decennial census and CPS are fully imputed for item nonresponse. • Makes it much easier for data users. • Although it can simplify things too much. • The NHIS, on the other hand, does not impute missing values. • This is a major problem for people who want to work with the income series on the NHIS (recently they released separate imputed income files). • We are experimenting with imputing the income data information on the NHIS files using CPS income data.
Question wording and measurement • Question wording changes take many forms. • Change in the basic question • The inclusion of examples • the placement of the question in the survey • Changes in the type of response allowed (e.g., can income amounts be reported in smaller than yearly intervals?) • Providing facsimiles of question wording, and highlighting wording changes in variable documentation, allows users to decide whether comparability is possible for their analyses.
Changes to question universes • Changes in universe definitions affect multiple variables (e.g., the age limit for “adults” answering work and income questions). • Other changes affect single variables. • Providing universe definitions in variable documentation tells users how to restrict their data to achieve comparability. • Testing variable universes reveals when data cleaning is needed before the data are released to users.
Changes in response categories • Many data harmonization projects lose detail by adopting a “least common denominator” approach. • IPUMS projects adopt the joint goal of: • Losing no information • Providing comparability over time • IPUMS projects achieve these goals through composite coding schemes. • The first digit(s) provides detail available across all years • Trailing digits provide additional detail available in only limited years
Other strategies for handling changes in response categories • Creating “bridging” variables is another means of achieving comparability over time. • When responses are given in intervalled form in some years, and in full detail in other years, IPUMS projects provide both detailed and intervalled variables. • Recoding data using a common standard (e.g., the 1950 occupation and industry codes), together with providing the original, unrecoded data, is a third strategy employed by IPUMS projects. • When response changes are too great to achieve comparability (e.g., the shift from 4 to 5 categories for health status in NHIS), the data are provided in separate variables and the issue is discussed in the documentation.
Changes in data processing • Variable documentation also helps users by pointing out subtle changes in data processing by the agency releasing the non-harmonized public use data.
Conclusions • The goal of simplifying data dissemination and harmonization is difficult and demographic survey design and processing play a major role in making it difficult. • Sample design • Survey nonresponse • Survey questions and items • Survey processing/editing