260 likes | 273 Views
This paper presents a total error framework for hybrid estimation, including simple and complex decomposition methods for error source targeting. It also provides a mathematical formulation and illustration comparing random samples with large censuses.
E N D
A Total Error Framework for Hybrid Estimation Paul P. Biemer1,2 Asley Amaya1 1RTI International 2University of North Carolina at Chapel Hill BIGSURV18 Barcelona, Spain October 26, 2018
Outline • Total error framework for the mean of some data set • Simple dichotomous decomposition • More complex decompositions for error source targeting • Mathematical formulation of concepts • Expression for the “Total” Mean Squared Error of the unweighted data set mean • Illustration comparing a small random sample with a large census • Data weighting and other error mitigation strategies
Generalized TE Framework Total Error= Sample Recruitment Error + Data Encoding Error Sample Recruitment Error is a generalization of the concept of representation error Data Encoding Error is a generalization of the concept of measurement error
Generalized TE Framework – Sample Recruitment Process No Population members 1. Chance to encounter recruitment system? Exit Process (Non-coverage) No Exit Process (Non-selection) 2. Encounters recruitment system? Exit Process (Nonresponse by noncontact) No 3. Receives stimulus for data input? Exit Process (Nonresponse by refusal) No 4.Inputs data? Data are recorded (observation)
Generalized TE Framework – Sample Recruitment Process No Population members 1. Chance to encounter recruitment system? Exit Process (Non-coverage) No Exit Process (Non-selection) 2. Encounters recruitment system? Exit Process (Nonresponse by noncontact) No 3. Receives stimulus for data input? To be included in the data set, a population member must pass through all 4 gates. Exit Process (Nonresponse by refusal) No 4.Inputs data? Data are recorded (observation)
Generalized TE Framework – Sample Recruitment Process No Population members 1. Chance to encounter recruitment system? Exit Process (Non-coverage) Let Ri = 1 if a population unit makes it through all 4 gates; Let Ri = 0 otherwise No Exit Process (Non-selection) 2. Encounters recruitment system? Exit Process (Nonresponse by noncontact) No 3. Receives stimulus for data input? To be included in the data set, a population member must pass through all 4 gates. Exit Process (Nonresponse by refusal) No 4.Inputs data? Data are recorded (observation)
Generalized TE Framework – Data Encoding Process 1. Construct represented by data is the desired construct? Specification error Construct of interest, x No No Measurement error 2. Data captured without error? No Data processing error 3. Data processed without error? Final data value, y = x + ε
Total Error Identity for the Mean of the Encoded Data Total Error= Data Enc Error + Samp Recr Error Specification Measurement Data Processing Non-coverage Non-selection Nonresponse (noncontact/ refusal)
Data Encoding Error • xi is the true characteristic for the ith sample unit • yi is the encoded value of xi • εi = yi - xiis the error in the encoded value for the ith sample unit • Assume εi ~ i.i.d (Bε, ) Data capture error bias Data capture error variance
Sample Recruitment Error (or Errors of Nonobservation) • Xi denotes the characteristic measured for the ith person in the Recruitment Process • , a measure of selection bias Bias induced by the data capture process Population variance Meng, 2017; Bethlehem, 1988
Sample Recruitment Error • Xi denotes the characteristic measured for the ith person in the Recruitment Process • , a measure of selection bias Example: For SRS sampling and no nonresponse,
Total Mean Squared Error of the Mean of a Generic Data Set Total Error= Data Enc Error + Samp Recr Error • Bias from • Specification error • Measurement error • Data processing error • Variance & Bias • from • Noncoverage • Nonselection • Nonresponse Recording error/capture error interaction • Variance from • Measurement error • Data processing error
Interpretation of • Not much is known about ρRX for nonprobability samples. • However, ρRX has been studied extensively for surveys (through the estimation of nonresponse bias). • ρRXwill be smaller for nonprobability samples when gates 1, 2 and 3 are entered for all members of the population. • Ability to adjust for sample recruitment bias is better for surveys because • We have more control over who enters gates 1-3 and thus more control over ρRX • We often know a lot about sample recruitment failures and how to adjust for them through weighting and imputation.
Illustration of Some Uses of these Results? Which is more accurate? • An estimate of the population average based upon an administrative data set with almost 100,000,000 population records and over 80% coverage? or • A national survey estimate based upon probability sample of size 3,300 completed cases with a 55% response rate?
We try to answer this question for the US Residential Energy Consumption Survey (RECS) • Survey Data: 2015 RECS • Mode changed from face to face to web/mail • Respondent reports of housing unit square footage not reliable • Substituting administrative data could be more accurate • response rate ≈ 55% • n ≈ 6,000*.55 = 3,300 completed cases • Administrative data: Zillow real estate data base • Coverage ≈ 82% • n ≈ 100,000,000 records • Other data bases were also considered (Acxiom and CoreLogic)
We try to answer this question for the US Residential Energy Consumption Survey (RECS) • HU square footage is primarily used for micro-econometric modeling • We will consider estimation of the U.S. average HU square footage to demonstrate the MSE analysis • Similar analysis could be considered for other parameters of interest (i.e., regression coefficients) • However, the current formulation would not be appropriate. • Our analysis will use results from Amaya (2017)
Evidence of Nonsampling Error from the RECS RECS Average Reported Square Footage by Source Official Respondent Zillow
Evidence of Nonsampling Error from the RECS RECS Average Reported Square Footage by Source Official Respondent Zillow
RMSEs as a Function of ρRX for Zillow and RECS Ignoring Data Encoding Error, Zillow has uniformly smaller RMSE Value of ρRX
RMSEs as a Function of ρRX for Zillow and RECS With Data Encoding Error, RECS has the smallest RMSE for most of the range Value of ρRX
RMSEs as a Function of ρRX for Zillow and RECS Value of ρRX
RMSEs as a Function of ρRX for Zillow Data Encoding Error Component Sample Recruitment Error Component Value of ρRX
Results Summary • Whether RMSEZillow < RMSERECS depends on value of ρRX • In this case, reducing ρRX may lead to larger RMSE because two biases are offsetting one another • Ideally, both biases should be minimized because an offsetting biases situation is not sustainable
Potential Zillow Error Mitigation Strategies Data Encoding Error • Estimate the bias and adjust for it • May need ground truth square footage data to model this bias Sample Recruitment Error • Weight the Zillow data to reduce |ρRX| • Weights will approximate [E(Ri)]-1 • Modeling E(Ri|X) will require understanding how Ri varies by housing unit and other characteristics (X) of the sample recruitment process
A Few Take Aways • As we move towards integrating survey and Big Data, need to consider the total error. • Sample recruitment bias is the least understood component of the total error. • Understanding this component will lead to statistical products of greater quality, utility and efficiency. • Focusing solely on data missingness may substantially underestimate total uncertainty. In this illustration, data encoding error dominated the MSE. • The monograph paper will further explore error mitigation strategies for dealing with both data encoding and sample recruitment error.