280 likes | 429 Views
Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information. Donsig Jang, Xiaojing Lin, Amang Sukasih Mathematica Policy Research, Inc. Steve Cohen, Kelly Kang National Science Foundation ITSEW 2008 Research Triangle Park, NC, June 2, 2008. Disclaimer.
E N D
Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information Donsig Jang, Xiaojing Lin, Amang Sukasih Mathematica Policy Research, Inc. Steve Cohen, Kelly Kang National Science Foundation ITSEW 2008 Research Triangle Park, NC, June 2, 2008
Disclaimer The opinions and assertions are those of the authors and do not reflect the views or policies of the National Science Foundation
Survey Data Collection • Involves many complex processes including • Sampling frame construction • Sample selection • Data collection • Data processing • Estimation • Each process subjects to error • Attempt to decompose the total survey errors into separate stages of processes
Parameter Sampling Frame Sample Respondent Data Estimator Total Survey Errors Misclassification error Coverage error Sampling error Nonresponse error Measurement error Estimation error
Misclassification Error in Stratification • Focus of this talk • A part of non-sampling error • Important but often overlooked component
Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocation • Loss of effective sample sizes for some analytic domains Stratification in Sampling • Enhance precision of survey estimates • Precision requirements for analytic domains • Often imperfect information on stratification variables • Misclassification in stratification • Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocation • Loss of effective sample sizes for some analytic domains
Misclassification Matrix True classification A Stratification classification A* the proportion of units classified as category jin true category k and
Measures for Misclassification Effects • Bias • Effective sample size change
Bias Due to Misclassification where = true population props. = Identity matrix = sample proportions s denotes sample, wi the sampling weight for unit i, and I(.) the indicator function Kuha and Skinner 1997
Bias Estimation If the true classification is available from the sample: where
Effective Sample Sizes and Variance Inflation Factors for domain d constructed based on true value for domain d constructed based on misclassified value • Measures the inflation of variance due to weight variation
Example: National Survey of Recent College Graduates (NSRCG) • Sponsored by National Science Foundation • Collecting education, employment, and demographic information from recent graduates with Bachelor’s or Master’s in science, engineering, or health fields • For details, • http://www.nsf.gov/statistics/srvyrecentgrads
NSRCG (Continued) • Two stage sample design: school sample at the first stage and graduate sample at the second stage • Crucial to collect key sampling variables (degree date, degree level, field of major, race/ethnicity, and gender) from schools for eligibility determination and stratification (frame variables) • Sample was designed to have moderate weight variation within domains while meeting certain sample size thresholds • Quality of sampling variables compromised due to schools’ reluctance to release the student’s information, non-standard formats used by schools, and inaccurate/incomplete administrative data Jang and Lin (2007 JSM)
NSRCG (Continued) • Same information (degree date, degree level, field of major, race/ethnicity, and gender) were also collected from sampled graduates • Able to measure the quality of school provided information for stratification by assessing discrepancies between school provided information and reported values • Looking at two survey data (2003 and 2006 NSRCG)
Misclassification for Gender NSRCG2003 NSRCG2006 ReBias for PMale= -0.01% ReBias for PMale = 0.50%
Misclassification for Race/Ethnicity NSRCG2003 NSRCG2006
Effective Sample Sizes and Variance Inflation Factors • What if taking reported values for discrepant cases? • Result in more weight variation within domains based on reported values due to unequal selection probabilities across classes • Check domain specific sample sizes and variance inflation factors
= White, = Asian, = Minority Variance Inflation Factors Domain: race/ethnicity by degree level by major field by gender NSRCG2003 NSRCG2006
= White, = Asian, = Minority Ratio of Sample Size, n_R / n_F Domain: race/ethnicity by degree level by major field by gender NSRCG2003 NSRCG2006
= White, = Asian, = Minority Ratio of Effective Sample Size, n_R / n_F Domain: race/ethnicity by degree level by major field by gender NSRCG2003 NSRCG2006
= White, = Asian, = Minority Variance Inflation Factors Domain: race/ethnicity by degree level by major field NSRCG2003 NSRCG2006
= White, = Asian, = Minority Ratio of Sample Size, n_R / n_F Domain: race/ethnicity by degree level by major field NSRCG2003 NSRCG2006
= White, = Asian, = Minority Ratio of Effective Sample Size, n_R / n_F Domain: race/ethnicity by degree level by major field NSRCG2003 NSRCG2006
= White, = Asian, = Minority Variance Inflation Factors Domain: race/ethnicity by gender NSRCG2003 NSRCG2006
= White, = Asian, = Minority Ratio of Sample Size, n_R / n_F Domain: race/ethnicity by gender NSRCG2003 NSRCG2006
= White, = Asian, = Minority Ratio of Effective Sample Size, n_R / n_F Domain: race/ethnicity by gender NSRCG2003 NSRCG2006
Summary • Misclassification in stratification may reduce the effective sample sizes for domains that were sampled with high sampling rates • Crucial to have good classification in stratification, especially with substantially unequal probability selections implemented
Next Steps • Population counts for key domains available but based on misclassification • Estimation of population counts: • Weighted sums of correct classification from the sample • Use of misclassification parameter estimates, where is the vector with population counts of domains defined by A* • Raking adjustments of the weights using • Comparison of key estimates