230 likes | 239 Views
Evaluation of Multiple Components of Error in the Collection and Integration of Survey and Administrative Record Data. John L. Eltinge International Total Survey Error Workshop June 15, 2010. Acknowledgements and Disclaimer
E N D
Evaluation of Multiple Components of Error in the Collection and Integration of Survey and Administrative Record Data John L. Eltinge International Total Survey Error Workshop June 15, 2010
Acknowledgements and Disclaimer The author thanks John Bosley, Moon Jung Cho, Larry Cox, Mike Davern, Mark Denbaly, Jennifer Edgar, Gretchen Falk, Bob Fay, Scott Fricker, Jenna Fulton, Karen Goldenberg, Jeff Gonzalez, Mike Horrigan, Bill Iwig, Alan Karr, Frauke Kreuter, Francois Laflamme, Judy Lessler, Shelly Martinez, Bill Mockovak, Jay Ryan, Adam Safir and Clyde Tucker for many helpful discussions. The views expressed in this paper are those of the author and do not necessarily reflect the policies of the Bureau of Labor Statistics.
Overview: I. Introduction: Types and Uses of Admin Records II. Example: Prospective Redesign of the U.S. Consumer Expenditure Survey III. Expansion of TSE to “Total Statistical Risk” IV. Cost Issues V. Mathematical Structures
I. Introduction: Types and Uses of Administrative Records A. Goals in Statistical Work with Administrative Data 1. State of nature: Current levels, comparison across groups, changes over time 2. Evaluation of a current policy or program 3. Evaluation of a prospective policy or program Links of (1), (2) and (3) with mission, formal constraints, incentive structures and institutional culture of a statistical agency or a “record originating” agency?
B. Statistical Uses of Administrative Records 1. Frames, auxiliary information for surveys 2. Direct use: Simple aggregates, complex modeling, use in microsimulation 3. Complex integration of survey, administrative data 4. Suggestion today: Work with administrative records will require us to expand the ideas of “total survey error” to incorporate “total statistical risk”
II. Example: Prospective Revision of the U.S. Consumer Expenditure Survey A. Population Structure 1. Population: All consumer purchases (transactions) in specified categories by: - Six-digit Universal Classification Code - Geography - Characteristics of the consumer unit (household) and outlet (store) - Time period
2. Inferential goals: Vary widely across stakeholders a. Mean of consumer expenditures in specified categories b. Analytic uses (regression, generalized linear models, quantiles)
B. Prospective Data Sources 1. Traditional sample surveys: Diary and interview 2. Administrative record systems a. Full population b. Superset of full population Ex: Aggregate sales in specified UCCs
c. Specialized subsets of the full population Ex: “Loyalty card” data linked with sample CUs (conditional on informed consent and confidentiality protections) 3. Specialized surveys to calibrate data from administrative records (cf. Lessler, 2006)
III. Expansion of “Total Survey Error” to “Total Statistical Risk” A. Working Model for Methodological Properties X = Frame, weight information Y = Sample survey data Z = Additional auxiliary information Properties of estimator based on variability from: 1. Population structure (superpopulation model) 2. Administrative and survey collection processes (“filters” including all TSE components) 3. Homogeneity of (1) and (2) across cases
B. Formal Evaluation of Properties: Evaluate expectation with respect to each component in (III.A) Current information available at conceptual, empirical levels? Critical importance of understanding the underlying processes for collection and reporting of administrative data
C. Prior Literature (Examples) Davern (2007, 2009), FCSM (1980, SPWP #6), Herzog, Winkler and Scheuren (2007), Jabine and Scheuren (1985), Jeskanen-Sundstrom (2007), Ord and Iglarsh (2007) Penneck (2007), Royce (2007), Winkler (2009)
D. From prior literature: Two concepts of data quality 1. Per Davern (2007), extend usual ideas of “total survey error” (TSE) to admin data: (Estimator) – (True value) = (frame error) + (sampling error) + (nonresponse effects) + (measurement error) + (processing effects)
2. Consider broader defs of data quality, e.g., Brackstone (1999): Accuracy (all components of TSE), AND: Timeliness, Relevance, Interpretability, Accessibility, Coherence 3. Risk: Failure in any component of data quality a. Aggregate risks: Historical focus of quantitative work b. Systemic risks: Often very important for stat programs - cf. “complex and tightly coupled systems” (Perrow, 1984)
IV. Cost Issues A. Statistical products (including surveys and administrative records) are capital intensive - Primarily intangible capital 1. Data originators: - Initial administrative purpose (amortize?) - Accommodate statistical agency (data quality, learning curve, systems) 2. Statistical agencies: - Learning curves - Systems for acquisition, edit/impute - Disclosure limitation
B. Broad acknowledgement of substantial costs C. Less empirical information generally available on: 1. Relative magnitudes of specific cost components 2. Extent of homogeneity of results from (1) with respect to: i. Type of administrative agency ii. Type of administrative records iii. Subpopulation iv. Other factors
D. Level of precision available on cost information 1. Purely qualitative 2. Order-of-magnitude 3. Relatively precise E. Practical uses of cost information 1. Qualitative decisions among options 2. Fine-tuning specific procedure F. Sources of cost information (F. Laflamme, 2008) 1. Special studies (risks: Hawthorne, incomplete accounting) 2. Cost-recovery contract accounting
V. Mathematical Structure for Full-Population Inference Based on Integration of Data from a General Survey and Specialized Administrative Records A. Population: Goal: Inference for
B. Data Sources 1. General survey for (most of) full pop 2. Administrative records: Estimators 3. Integrate (2) with general-survey data? Costs? Risks? Improvements in precision? 4. Example: U.S. Consumer Expenditure Survey Supplement usual (expensive) collection with specialized data from retail administrative records, transaction intermediaries (with permission)?
C. Easy case: 1. Frame allows partition of 2. Use high-quality estimators for direct use in D. Harder cases: 1. Screening questions, multiple-frame surveys 2. Adjust or downweight use of due to quality problems?
E. Risk Factors 1. Each of the usual data-quality issues: accuracy, timeliness, relevance, interpretability, accessibility, coherence 2. Operational risk: Degradation of quality of admin source 3. Costs for: a. Supplementary data source b. Screening for subpopulation membership c. Microdata review, edit and imputation d. Production systems for integration e. Investments in human resources
VI. Closing Remarks: Prospective Integration of Administrative Records with Survey Data A. Good opportunities for 1. Expanded statistical information for stakeholders 2. Reduction of overall production costs B. Suggestion: Expand Evaluation from “Total Survey Error” to “Total Statistical Risk” C. Evaluation of dominant factors of statistical risk and aggregate costs D. Example: Prospective Revision of the U.S. Consumer Expenditure Survey The papers today help us to understand more about some components in (A)-(C) Lots of interesting work for future years
John L. EltingeAssociate CommissionerOffice of Survey Methods Researchwww.bls.gov202-691-7404Eltinge.John@bls.gov