600 likes | 735 Views
Imputasi. Overview. Apa itu Imputasi (Imputation) Definisi Metode Imputasi Strategi Imputasi Contoh. Apa itu Imputasi?.
E N D
Overview • Apa itu Imputasi (Imputation) • Definisi • Metode Imputasi • Strategi Imputasi • Contoh
Apa itu Imputasi? • Tidak semua sampel yg terpilih akan memberikan respon (tidak dapat dicacah). Beberapa akan Partial non-response (not all data items filled in), atau Full non-response (form completely blank). • Tidak semua unit yang memberikan respon isiannya betul atau konsisten. • Sehingga secara umum berdasarkan kondisi di atas ada dua kondisi: • Data yg kotor 'dirty' • dirty unit (if missing all data) • dirty data item (for missing data item, or inconsistent data items) • A correct response is referred to as 'clean'.
Apa itu Imputasi? (Lanjutan) • Untuk melakukan estimasi dalam suatu survei kita perlu memperhatikan 'dirty' data. • Perlu melakukan penghitungan/perkiraan isian yg sebenarnya terhadap ‘dirty’ data tsb. • Proses menghitung/memperkirakan tsb adalah Imputasi
Apa itu Imputasi? (cont.) • Full imputation • Full unit non-response or kesalahan hampir di seluruh isian. • Data diimputasi untuk semua item • Partial imputation • Mixture antara dirty dan clean data items pada suatu unit. • Data diimputasi pada dirty data items saja. • Refusal • Unit diketahui ada tapi data tdk dapat dikumpulkan • Data diimputasi untuk semua item
Akibat Non-Response: • Jika kita hanya mengandalkan responding units represent saja ; • Kehilangan informasi dari sebagian responden • Estimasi hanya berdasarkan pada bagian kecil kelompok data akan memperbesar sampling error • Jika non-respondents mempunyai kharakteristik yg berbeda dg respondent lainnya maka estimasinya akan semakin bias. (non-response bias) • Mengapa kita melakukan imputasi pada non-responses?
Explicit vs Implicit imputasi • Untuk full non-response/full imputation ada dua pendekatan: • Implicit imputation • Dirty unit tdk diberikan penimbang/ weight. • Penimbang clean units di adjust. • Explicit imputation • Dirty unit diberikan penimbang/ weight. • Nilainya diimputasi untuk semua dirty unit dan dituliskan/dimasukkan dalam data survei
Overview Explicit imputation methods • Clerical Imputation • Automated Imputation types • Mean • Ratio • Donor • Constant Value • Pro-ration • Auxiliary variable • Historical • Regression
Explicit imputation methods Explicit imputation methods menggunakan informasi dari unit lain yang terkena sampel dengan menghitung/mengestimasi nilai dari missing units Untuk melakukan ini kita membutuhkan spesifikasi unit-unit lain yg mirip dg unit yg karakteristiknya akan diestimasi Proses ini disebut sbg the donor class and the imputation class
Donor class • Donor class: The donor class is the level at which the mean is calculated. • Determining the most appropriate donor class involves finding a compromise between competing requirements: • Units as similarly behaved as possible to the unit requiring imputation • Sufficiently broad classification to generate classes with enough sample units to calculate a robust impute
Imputation class • Imputation class: The imputation class is the level at which the mean is applied • Usually the imputation class is the same as the donor class but not always • For example: Missing values for Region 1 industry may be replaced by the mean from Region 2 respondents in the same industry, where region 2 is sufficiently similar to region 1.
Quality assurance of Imputation It is a very important step in the imputation process to check the impact imputation is having and the quality of the imputation
Quality assurance of Imputation • The major way of quality assuring the Imputation is to evaluate the effects on estimation; • Number of actual imputes compared with previous cycles • The contribution of the imputes to estimates • May need to revise imputation if: • The response rate is low • Imputes based on small size donor/imputation classes • Large amount of the estimate is based on imputed data (guesstimate)
Estimation following imputation • Units need to be classed as: • Complete non-response or Partial non-response and • Clerical impute • Unit impute (ratio, look-up, pro-ration, aux var) • Donor impute or • Mean impute. • So that we can track how they have been treated. • All imputes are used to produce estimates. • Donor and mean imputes are excluded from variance calculations
Imputation strategy • An imputation strategy should be made for each collections PRIOR to imputation beginning • The strategy allows a consistent and clear approach to imputation be applied • The strategy outlines the approach you will take for imputation on the collection
Imputation strategy • The strategy should outline different approaches/processes taken for different streams • Full non response • Partial non response • Sampled units • Completely enumerated (CE) units • Continuing units • New units • Continuing data items • New data items
Imputation strategy: Fallbacks • For each stream mentioned in the previous slide you should have: • Primary Imputation Method • Fallback 1 Imputation Method • Fallback 2 Imputation Method • Etc You should have enough fallbacks to ensure that no units in scope of imputation fail to be imputed.
Imputation strategy: Outline • One example of an imputation strategy outline: • Overview: The survey, reference period etc • Definitions: imputation classes used, donor classes used • Units in scope of Explicit imputation, implicit imputation and those not in scope of imputation • Key data items to be imputed • Clearly define the different streams • Table/Diagram of the Imputation process used for each stream and data items with the stream if applicable
ABS example: Imputation strategy You can see in the table above, this collection has specified different Imputation methods for different streams. They have also applied different Imputation methods for different data items
Imputation software in ABS • The ABS has developed its own software to be used for imputation for all collections it is called ABSImp • Suite of SAS macros; • Comprehensive (designed to handle any imputation method we've dreamed up so far); • Can easily be extended to include other methods, in the rare event that someone comes up with a new approach.
How ABSImp works • It takes one imputation step at a time, in the order given by a Fallback Sequence. • An imputation step specifies: • a single variable (to be imputed; e.g. SSITOT), • a single imputation method (e.g. Live Respondent Mean) • an imputation class (e.g. STRATUM, or SIZE*STATE) • a single set of parameters • other special constraints that records must meet.
How ABSImpworks (2) • During each imputation step, ABSImp attempts to impute a value for the specified variable for each record meeting the following constraints: • the specified variable is missing for this record; • the record satisfies an impute_where condition (often no constraint is imposed);
How ABSImp works (3) • A given imputation step may fail its attempt to impute a variable for a particular record. Reasons depend on the method and parameters; examples include: • There are too few donor records • There isn't clean historical data for the unit • In this instance the variable is simply left missing. • Since the variable is still missing, the next relevant imputation step in the fallback sequence will attempt this record again.
Prepare data for imputation A diagram of the process in ABSImp Take an imputation step from the fallback sequence Which records need imputing for this variable? Try to impute values with this method Is there another imputation step to try? Update USI & PSI Codes & wrap up Update Codes & Evaluate quality
Clerical imputation • Clerical imputation involves editing staff drawing on their knowledge of the unit and/or subject matter concerned. • It Can be expensive because it its time and resource consuming • Can lead to inconsistent treatments (different staff have different ideas!) • Usually reserved for large Completely Enumerated units only, because there are few of them and their values are usually important to estimates
Clerical Imputation: Example • An editor contact a large business because their export earnings are missing on the form • The business cannot supply current figures on export earnings, but thinks its roughly up by 10% • The imputation staff assign the figure reported by the business last cycle and apply an increase factor of 10%
Mean imputation • The imputed values for the mean methods are calculated as the weighted average value of the respondents/live respondents in either the current or previous time point, at the donor class level. • Methods are of the form:
Mean imputation • Live Respondent Mean (LRM) method • Respondent mean method • Historical live respondent mean (LRM) method • Historical respondent mean method
Live Respondent Mean imputation • The LRM method calculates the weighted average value of LRM contributors at a specified imputation class • LRM contributors are units which are clean and live respondents, excluding imputed units other than clerical imputes
LRM imputation is the live respondent mean imputation weight for unit i is the value for variable Y for unit i is the population size in stratum h is the number of LRM contributors in imputation donor class d
Ratioimputation method • Take a data item we do have for the unit and scale it to predict a response value. • Example 1 repeated survey: assume last quarter's inventory grew by the same amount as all respondents in that stratum. • Example 2: Assume turnover from tax is in the same proportion to Sales for all units in stratum. • Many variants, closely related.
Ratio Imputation methods Unit Ratio Look-up Ratio
Aggregate Ratio imputation • In general, the aggregate ratio imputes are calculated by taking a clean response for the same unit (from a previous cycle of the survey or from a different data item) and applying an adjustment to the clean response. This adjustment is based on the ratio of means, calculated using clean responses from similar units. The general form of the ratio impute is:
Aggregate Ratio • There are eight types of aggregate ratio imputation methods used in the ABS • Historical Live Respondent Aggregate Ratio • Historical Respondent Aggregate Ratio • Historical Live Respondent Auxiliary Aggregate Ratio • Historical Respondent Auxiliary Aggregate Ratio • Historical Auxiliary Live Respondent Aggregate Ratio • Historical Auxiliary Respondent Aggregate Ratio • Auxiliary Live Respondent Aggregate Ratio • Auxiliary Respondent Aggregate Ratio
Historical Live Respondent Aggregate Ratio Method • The Imputed data item value for Unit i is derived by adjusting the previous cycle's value using a ratio of weighted average values of live respondent mean (LRM) contributors in the current and previous cycles for a specified imputation (donor) class, d
Historical Live Respondent Auxillary Aggregate Ratio Method where, and • The historical live respondent auxiliary aggregate ratio imputation method adjusts the value of the variable of interest in the previous time point by the ratio of weighted average auxiliary values of live respondent mean (LRM) contributors in the current and previous time points at a specified imputation donor class d#. The LRM contributors are defined as those units which are clean and live respondents, excluding imputed units other than clerical imputes in both the current and previous time points.
Historical Auxiliary Live Respondent Aggregate Ratio Method where, and • The historical auxiliary live respondent aggregate ratio imputation method adjusts the value of the auxiliary variable in the previous time point by the ratio of weighted average values of live respondent mean (LRM) contributors in the current time point over the weighted average auxiliary values of live respondent mean (LRM) contributors in the previous time point at a specified imputation donor class d#. The LRM contributors are defined as those units which are clean and live respondents, excluding imputed units other than clerical imputes in both the current and previous time points.
Auxiliary Live Respondent Aggregate Ratio (ALRAR) Method • Imputed data item value for Unit i is derived by adjusting the unit's value of the auxiliary variable using a ratio of weighted average values of live respondent mean (LRM) contributors over the weighted average auxiliary values of LRM contributors, for a specified imputation (donor) class, d
Historical Unit Impute • The historical unit imputation methods take a clean response for the same unit from a previous cycle of the survey and may also apply an adjustment to the clean response. The general form of the historical unit impute is: • or • where fc is a factor applied to the clean response from a previous cycle and will differ depending on the type of historical unit imputation method chosen. • There are four types of historical unit impute: • Historical Unit (fc=1) • Historical Unit Classification (fc=1)Note: This method is equivalent to method 1 for classification data items • Historical Unit Growth (fc is a growth factor to be applied)) • Historical Unit Ratio (fc is a ratio of an auxiliary variable for unit i in the current and previous time points)
Auxiliary Unit Impute • The auxiliary unit imputation methods take the value of an auxiliary variable in the current or previous time point and may additionally apply an adjustment to the value. The general form of the auxiliary unit impute is: • or • There are generally 5 auxillary unit imputation methods used in the ABS: • Auxiliary Unit (fc=1 and xt,i is the auxiliary variable in the current time point) • Historical Auxiliary Unit (fc=1 and xt,i is the auxiliary variable in the previous time point) • Auxiliary Unit Adjustment (fc=imputation adjustment factor and xt,i is the auxiliary variable in the current time period) • Historical Auxiliary Unit Growth (fc=imputation adjustment factor and xt,i is the auxiliary variable in the previous time period) • Multiple Auxiliary Unit Adjustment Method (fc=imputation adjustment factor for each xt,i and xt,i are multiple auxiliary variables in the current time period
Donor Imputation • There are two main types of donor imputation; nearest neighbour and percentile, where both are performed at the imputation class level. • The percentile donor imputation methods involve selecting a unit from each imputation class from a list of acceptable donor units (in the current or previous time point), and 'donating' the response of the selected unit to any units, within the imputation class, requiring imputation. The 'donor' unit is selected through the specification of a percentile value • Percentile requires auxiliary variable for clean records only (to enable sorting) and the specification of a percentile to be used as the donor. • Live Respondent Percentile Donor • Respondent Percentile Donor • Historical Live Respondent Percentile Donor • Historical Respondent Percentile Donor
Live respondent percentile donor • The live respondent percentile donor imputation method selects the value of a unit at the percentile from a sorted list of acceptable live respondent donor units at a specified imputation class c. The acceptable live respondent donor units are defined as those units which are clean (i.e. PSI>=p0) and live respondents, excluding imputed units other than clerical imputes (i.e. E = [1,2(R=11,21,31),4,7]). • 1. Sort the list of acceptable live respondent donor units by the variable of interest in imputation class c • 2. Calculate the percentile value (w.d) in donor imputation class c • 3. • Where, p is the proportion of the reference period the unit is alive • is the number of live responder mean contributors in imputation class c, • d is the decimal component of (w.d) • is the y value of the w – th sorted LRM in imp class c, • is the y value of the w +1– th sorted LRM in imp class c,
Nearest neighbour donor imputation • The nearest neighbour donor imputation methods involve selecting a unit which is considered to be similar to the unit requiring imputation (based on a numeric or character auxiliary variable) from a list of acceptable donor units, and 'donating' the response of the selected unit to the unit requiring imputation. • Requires auxiliary/extra variable for clean and dirty units • Sorts all clean units within donor class by some auxiliary variable (also available for dirty unit). • Identify clean units on either side of the dirty unit and randomly select one as the donor.
Nearest neighbour donor imputation • Live Respondent Nearest Numeric Neighbour Donor • Live Respondent Nearest Character Neighbour Donor • Respondent Nearest Numeric Neighbour Donor • Respondent Nearest Character Neighbour Donor • Live Respondent Nearest Numeric Neighbour Classification Donor • Live Respondent Nearest Character Neighbour Classification Donor • Respondent Nearest Numeric Neighbour Classification Donor • Respondent Nearest Character Neighbour Classification Donor
Nearest neighbour • The live respondent nearest numeric neighbour donor imputation method selects the value of a similar unit from a list of acceptable live respondent donor units sorted by an auxiliary numeric variable at a specified imputation class c. The acceptable live respondent donor units are defined as those units which are clean and live respondents, excluding imputed units other than clerical imputes. • 1. Allocate all units in the current time point a random number from the uniform distribution on the interval [ 0, 1) • 2. Sort the list of acceptable live respondent donor units by the auxiliary numeric variable in imputation class c • 3. Calculate the live respondent nearest numeric neighbour donor impute value for unit i in imputation class c • Where & are the closest units to
Regression Imputation • The regression imputation methods calculate the imputed values by regressing the value of the variable of interest on an auxiliary variable, within imputation donor class. Thus, the methods are of the form: • Two regression methods are used in the ABS: • Live respondent regression • Respondent regression
Live Respondent Regression Imputation • The live respondent regression imputation method regresses the value of the variable of interest against the value of the auxiliary variable of live respondent mean (LRM) contributors at a specified imputation donor class d#. • where, • &
Constant Value • For each data item within an imputation class an impute value is specified in a table. • All 'dirty' units in an imputation class receive the same impute. • The Imputed value is based on a constant value specified for imputation class, d • or • Where p is the proportion of the reference period the unit was live for