550 likes | 1.07k Views
Data Editing, Coding, and Just a Little Imputation. Katherine (Jenny) Thompson Office of Statistical Methods and Research for Economic Programs Katherine.J.Thompson@census.gov (301) 763-4941. The Basics: What is Editing?.
E N D
Data Editing, Coding, and Just a Little Imputation Katherine (Jenny) Thompson Office of Statistical Methods and Research for Economic Programs Katherine.J.Thompson@census.gov (301) 763-4941
The Basics: What is Editing? • Editing (procedures) review reported/keyed data for errors and pinpoints “inconsistent” values • For “industry” • For respondent Editing does not change the data. Items that fail edits are • referred to an analyst; or • automatically imputed (replaced with consistent values)
The Basics: What Is Imputation? • Imputation is the replacement of a missing or incorrectly reported item using logical edits or statistical procedures. • In other words, • Imputation replaces a missing or incorrect data item with an “educated guess.”
The Basics: What is Coding? • Coding is the assignment of recognizable values to flags that describe key characteristics of the unit or item, such as • Industry (unit level) • Response status (unit or item level) • Source of data correction (item level) • Imputation model (item level)
We Begin With Coding • Before we can evaluate whether a response is reasonable, we have to know where it comes from: • Classification variable(s) value, e.g., industry, state • Frame information may be erroneous or • unit may have changed classification value Each unit must be assigned classification code(s) before editing/imputation
We End With Coding • At the end of the processing cycle, we want to know • How the data were changed, • Where the data were changed, • Why (if possible) data were changed, and • The final status of the reporting unit (respondent, non-respondent).
Some Edit Definitions • Editing: Procedures for detecting • “incorrect” keyed or respondent • data. • Micro-Editing: Editing at the individual record • (questionnaire) level • Macro-Editing: Editing at the tabulated value level
“Typical” Editing Processing Flow • Micro-editing (static) • Performed on a flow basis • Predetermined edit tests and edit parameters (historic data) • Administered by machine • Resolved by machine and human • Outlier detection (dynamic) • Performed after “close-out” • Administered by machine • Often resolved by human • Macro-editing (dynamic) • See above
Micro-Edits are Either: • Fatal Must be resolved before subsequent • editing • Unit is Out-of-Scope for Survey • Unit is missing classification variable value • Required data item not reported • Query Can be corrected “automatically” • Detail items do not add to reported total • Ratio of two items is outside (user-determined) limits
Where Do Micro-Edits Come From? • Questionnaire • Reality • Subject-Matter Expert Rules • (Enforced) Statistical Relationships
Edit Sources: Questionnaire Balance Edit Item 3.a. Value + Item 3.b. Value = Item 3.c. Value Things have to add up!
Edits Sources: Questionnaire/Reality Ratio Edit ANNUAL PAYROLL/1ST QUARTER PAYROLL 1 Can’t spend more on payroll in one quarter than for the entire year!
Edit Sources: Questionnaire/Reality Ratio Edit 0.96 < TOTAL HOURS WORKED/EMPLOYMENT < 8.76
Edits Sources: Questionnaire/Reality Range Edit 0 EMPLOYMENT 5,615,727 A unit can’t have more employees than the population of the resident state (or negatively-value employees!)
Edit Sources: Subject-Matter Rules Ratio Edit TOTAL SALES/ANNUAL PAYROLL > 1 “Full-year reporters should operate at a profit!”
Edit Sources: Statistical Relationships Ratio Edit A ANNUAL PAYROLL/EMPLOYMENT B Wage per employee should be within the (industry) range.
Examples of Fatal Micro-edits • Classification Edits • Required Data Item Tests
Examples of Query Micro-edits • List Directed (Verification) Edits • Skip Pattern Validation Edits • Range Edits (Including negative tests) • Ratio Edits • Within same questionnaire • Current to prior period • Balance Edits • Subject-matter rules
List Directed/Verification Edits • Purpose: To compare the reported value of a data field to a pre-determined list of legal values. • Machine edits, but highly dependent on data-quality of list • Human (manual) correction of edit failures
Skip Pattern Validation Edits • Purpose: To verify that values of skip items are • consistent with the skip instructions • provided on the questionnaire. • Machine edits that CAN be resolved by machine-imputation • Subject-matter rules (if..then..logic) • Operations Research approach
Range Edits • Purpose: To check the reported value of a data item to see if it is within specified minimum and maximum values. • Form of edit: lower bound data item upper bound • Upper and lower bounds are tolerances. • If data item is not contained within the bounds, then it fails the range edit (“out of tolerance”). • Negative tests are a special case of range edits. • Can be used to define an imputation region.
Range Edits • Examples: • 0 Employment 301,064,982 (2006 U.S. Population) • 0 Sales 12,455.8 billion (2005 Gross Domestic Product) • 0 Percent of work done in category 100%
Ratio Edits • Purpose: To compare two “related” items in a questionnaire to see if reported values are consistent. • Form of Ratio Edit: • Upper and lower bounds are known as tolerances. • Tolerances generally developed from prior period data. • If ratio is not contained within the bounds, then it fails the ratio edit (“out of tolerance”).
Some Reasons for Ratio Editing • One data item is a function of another. • Annual Payroll = 1st Quarter Payroll + Payroll for Remaining 3 Quarters • Ratio Edit:
Some Reasons for Ratio Editing • One data item can only be evaluated in comparison with another item • (reasonable lower bound) • (reasonable upper bound)
Some Reasons for Ratio Editing • One data item is a good predictor of another. • Annual Payroll = factor Total Employment
Advantages of Ratio Edits • Useful for detecting systematic and random errors • Reasonable comparisons for quantitative data • Verifiable assumptions • Often insensitive to changes in economy when both items are in the same units • Imply certain imputation models • Can be solved simultaneously • imputation region implications
Disadvantages of Ratio Edits • Edit failure identifies a pair of potentially incorrect data fields • Need to have a “tie-breaker” • Often work best when combined with other edits (can be ratio edits) • Very dependent on the distribution of ratios • Highly correlated • Goes through origin
“Best” Practices for Ratio Edits • Incorporate unit size categories as well as classification variables in editing cells • Perform preliminary data analysis to determine validity of edit model • Incorporate tests to prior data from same unit and item when reasonable • Use non-parametric outlier-resistant methods for setting ratio edit tolerances • Audit edits • An edit test that has a high rate of failure could indicate problems with the tolerances or the test itself
Brief Digression on Imputation • Situation: Missing item or item marked for • imputation (replacement) due to • edit failure(s) • We would like the machine to automatically replace the “inconsistent” item with a consistent value.
The ideal “imputations” find replacement values that are still considered reported (from the same respondent) • Examples • divide reported data by correct reporting unit • replace reported total with sum of details
Link Between Imputation and Program • Published tabulations(macro-data) • Ratio imputation models • Regression imputation models • Published micro-data • Hot deck imputation
Commonly-Used Imputation Methods(Economic Data) • Rounding/Data Slides (systematic error) • Respondent data divided by unit conversion factor (e.g., imputed value = reported value/1,000) • Direct Substitution • Another data item (same questionnaire) • Absolute value of reported/keyed item • Sum of Reported Details (logical edit) • Derived value from other reported/keyed item • Previously reported value (historic) from same respondent • Administrative data value (same respondent)
Ratio Imputation (Model Imputation) • imputed item = (factor) (another data field) • Same reporting unit/questionnaire • Edit-passing item • Industry (Category) Average Ratio • (use average ratio of two items in industry/category) • e.g., factor = industry wage/employee ratio • Historic Imputation (Auxiliary Trend) • (use ratio of prior data to current data for same respondent) • e.g. factor = previous tabulated value of edit-failing item • previous tabulated value of auxiliary data field
Balance Edits • Purpose: To determine if detail items add to associated reported total. • Form of Edit: TOTAL = DETAIL1 + DETAIL2 + ... + DETAILn • Developed from questionnaire • A set of details along with their associated total is called a balance complex. • More complicated balance complexes • Nested 1-Dimensional • 2-Dimensional
“Fixing” a Failed Balance Edit • Editing generally integrated with imputation: • Editor decides which is more believable: TOTAL or SUM OF DETAILS • Only change one side of balance complex (TOTAL or SUM OF DETAILS)
Balance Edit Definitions • Residual: TOTAL - SUM OF DETAILS • Failed edit solution can depend on • SIZE of residual (absolute tolerance) • RATIO of residual to total (relative tolerance)
A Few Balance Edit Fixes • RAKE* Rake all detail items to TOTAL • YSUMX* Replace TOTAL with the SUM OF DETAILS • ROUND Divide all details by 1000 or • divide TOTAL by 1000 • RESIDUAL Set one missing DETAIL to the RESIDUAL • IMPUTE* Replace all DETAILS with imputed values • *Briefly discussed…
Raking • Adjust each detail item as • Conditions: • Reported TOTAL must be “acceptable.” • Relative tolerance is “small” (e.g., within 5%).
Raking -- Considerations • Is not considered imputation • Preserves reported distribution of the detail items
YSUMX • Set TOTAL equal to SUM OF DETAILS • Conditions: • TOTAL can be changed by edit (not “fixed”); • (Optional, but preferable) SUM OF DETAILS is “reasonable” (e.g., verify with ratio test or range test)
YSUMX -- Considerations • Not (considered: imputation; • logical edit or deductive imputation • Useful when TOTAL is missing (and details are not); • Can be imputation solution to ratio edit
Impute • Replace ALL reported DETAILS with imputed values • Imputed DETAILi for reporting unit c is given by • factori TOTAL • Conditions • TOTAL > 0 (and value of TOTAL “acceptable”) • No restriction on SUM OF DETAILS (all DETAILS are replaced...] • Difference between TOTAL and SUM OF DETAILS too large for raking
Macro-Editing (Brief Comments) • Systematic review of tabulations (estimates) • Tendency to rely on ratio comparisons to identify outlying estimates • Hidiroglou-Berthelot edit • Ratio Edits • Need to analyze micro-data in outlying cells
Back to Coding… • Throughout the editing and imputation process, what do we need to keep track of?
Back to Coding… • Original source of data item • Reported from respondent • Elicited by analyst/subject-matter expert • Missing/not reported