1.14k likes | 1.69k Views
Missing Data: Problems & Prospects. Daniel A. Newman University of Illinois. Overview. Missing Data Levels Item-level, Scale-level, and Person-level Missing Data Problems Bias/Poor External Validity, Low Power Missing Data Mechanisms MCAR, MAR, MNAR Missing Data Techniques
E N D
Missing Data: Problems & Prospects Daniel A. Newman University of Illinois
Overview • Missing Data Levels • Item-level, Scale-level, and Person-level • Missing Data Problems • Bias/Poor External Validity, Low Power • Missing Data Mechanisms • MCAR, MAR, MNAR • Missing Data Techniques • Listwise & Pairwise Deletion, Single Imputation • Maximum Likelihood and Multiple Imputation • Sensitivity Analysis
Missing Data Levels • Item-Level Missingness • Answering only j out of J possible items on a scale (i.e., leaving a few items blank) • Scale-Level Missingness • Answering zero items from a scale (i.e., omitting an entire scale or an entire construct) • Person-Level Missingness • Failure to return the survey • In the aggregate, this is called response rate
Missing Data Levels • Complete Data
Missing Data Levels • Incomplete Data
Missing Data Levels • Item-Level Missing Data • Scale-Level Missing Data • Person-Level Missing Data
Missing Data Levels • Item-Level Missing Data • Scale-Level Missing Data • Person-Level Missing Data
Missing Data Levels • Item-Level Missing Data • Scale-Level Missing Data • Person-Level Missing Data
Missing Data Levels • Item-Level Missing Data • Scale-Level Missing Data • Person-Level Missing Data
Missing Data Levels • Missing data levels are nested • Item-level missingness can aggregate into scale-level missingness • Scale-level missingness can aggregate into person-level missingness • Choice of appropriate missing data technique can depend upon level of missingness • Person-level missingness can be far more problematic, because you have no information about the nonrespondent
Practical Advice (Newman, 2009)
Missing Data Problems • Missing data reduce the sample size (low N) • More Sampling Error • Lower Statistical Power • Systematically missing data can lead to systematic over- or under-estimation of effect sizes • Bias in Parameter Estimates (mean, SD, corr.)
Parameter Estimates Sample Estimate () Population Parameter ()
Purpose of Data Analysis Parameter Estimates() Data Hypothesis Tests(p-values, standard errors)
Missing Data Problems Parameter Estimates() Bias Missing Data Hypothesis Tests(p-values, standard errors) Low Power
Sampling Distribution r Std. Error
biased r unbiased Sampling Distribution
Sampling Distribution r Std. Error Larger Std. Error
Sampling Distribution crit.05 r Std. Error Type II Error Larger Std. Error
Missing Data Problems Two Major Missing Data Problems: • Bias in Effect Size estimates • Errors of Statistical Inference (p < .05?) • Low Power • Systematically missing data can create Inaccurate Standard Errors (and p-values)
Missing Data Mechanisms • Missing Data can be missing: • Randomly • Systematically • But what does “Systematic” mean?
Missing Data Mechanisms 1) Random • Missing Completely at Random (MCAR) 2) Systematic (Rubin, 1976) • “Missing at Random” (MAR) • Missing Not at Random (MNAR)
Missing Data Mechanisms • MCAR – p(missing) is unrelated to all variables, observed and unobserved • MAR – p(missing) is related to observed variables [observed data] only • MNAR – p(missing) is related to the unobserved/ missing variables [missing data] (see Schafer & Graham, 2002) p(missing|complete data) = p(missing) p(missing|complete data) = p(missing|observed data) p(missing|complete data) ≠ p(missing|observed data)
Missing Data Mechanisms • MCAR – Rmiss_Y is not related to X or Y • MAR – Rmiss_Y is related to X, but is not related to Y after controlling for X • MNAR – Rmiss_Y is related to Y X X X Y RmissY Y RmissY Y RmissY MCAR MAR MNAR
Missing Data Mechanisms • Some missing data techniques (e.g., listwise deletion) assume missing data are MCAR • Some missing data techniques (e.g., maximum likelihood, multiple imputation) assume MAR • It is impossible to test whether missing data are MAR vs. MNAR, because we would need to compare observed values of Y against unobserved values of Y, and unobserved values of Y are unknown
Missing Data Mechanisms Practically speaking … • In the real world, MCAR almost never happens • One exception: “planned missingness” (Graham et al., 2006) • Most missingness falls on a continuum between MAR and MNAR
X X X Y RmissY Y RmissY Y RmissY MCAR MAR MNAR Missing Data Mechanisms
X X X Y RmissY Y RmissY Y RmissY MCAR MAR MNAR Missing Data Mechanisms
Missing Data Mechanisms Practically speaking … • Even though the MAR assumption may not be strictly met in practice, missing data techniques based on this assumption (e.g., Max. Likelihood, Mult. Imputation) can still provide less-biased, more powerful estimates
Missing Data Mechanisms Practically speaking … • An MNAR mechanism can begin to approximate an MAR mechanism if the researcher incorporates more observed variables (i.e., “auxiliary variables”) • (Collins et al., 2001; Graham, 2003)
X X X Y RmissY Y RmissY Y RmissY MCAR MAR MNAR Missing Data Mechanisms
Missing Data Techniques 1) Listwise Deletion 2) Pairwise Deletion 3) Ad Hoc Single Imputation 4) Multiple Imputation 5) Maximum Likelihood • (EM algorithm, FIML) 6) Sensitivity Analysis
Missing Data Techniques 1) Listwise Deletion 2) Pairwise Deletion 3) Ad Hoc Single Imputation 4) Multiple Imputation 5) Maximum Likelihood • (EM algorithm, FIML) 6) Sensitivity Analysis
Missing Data Techniques Listwise Deletion – deleting all cases (persons) for whom any data are missing, then proceeding with the analysis • This procedure converts item-level and scale-level missingness into person-level missingness!
Missing Data Techniques • Incomplete Data
Missing Data Techniques • Incomplete Data Listwise Deletion:
Missing Data Techniques Listwise Deletion • Unbiased under MCAR • But biased under systematic missingness (MAR & MNAR) • [Mean is biased, SD is biased, Correlation is biased] • Amount of bias depends on amount of missing data and strength of missingness mechanism (from completely random to strongly systematic) • Lowest power • Smallest N
Missing Data Techniques Pairwise Deletion – calculating summary estimates (e.g., means, SDs, correlations) using all available cases (persons) who provided data relevant to each estimate, then proceeding with analysis based on these summary estimates • Different correlations are based on different (partly overlapping) subsamples!
Missing Data Techniques • Incomplete Data Pairwise Deletion: • Mean & SD of X1 based on • Mean & SD of Y based on
Missing Data Techniques • Incomplete Data Pairwise Deletion: • Correlation of X2 & X3 based on • Correlation of X3 & Y based on
Missing Data Techniques Pairwise Deletion • Unbiased under MCAR • But still biased under MAR & MNAR • Usually less biased than listwise deletion • Sometimes covariance matrix is not positive definite • Different correlations represent different population mixtures
Missing Data Techniques Pairwise Deletion • More power than listwise • Different Ns for different correlations—no single N makes sense for the whole corr. matrix • minimum N => SEs too big • mean N => some SEs to big, some SEs too small • harmonic mean N => same problem as mean N
Missing Data Techniques Ad hoc single imputation – replacing each missing datum with a “good guess” • Mean imputation (i.e., mean(across persons)) • Hot deck imputation • Regression imputation
Missing Data Techniques Ad hoc single imputation • Mean imputation (i.e., mean(across persons)) – replacing each missing datum with the group mean for the corresponding variable • Hot deck imputation - replacing each missing datum with a value from a “donor” who has similar scores on other variables • Regression imputation - replacing each missing datum with a predicted value based on a multiple regression equation derived from observed cases
Missing Data Techniques Ad hoc single imputation • Mean imputation (i.e., mean(across persons)) – underestimates variance and correlation • Hot deck imputation – using “donors” increases error—worse than regression imputation • Regression imputation – using predicted values underestimates variance and can bias the correlation