430 likes | 895 Views
Missing Values. Adapting to missing data. Sources of Missing Data. People refuse to answer a question Responses are indistinct or ambiguous Numeric data are obviously wrong Broken objects cannot be measured Equipment failure or malfunction Detailed analysis of subsample. Assumptions 1.
E N D
Missing Values Adapting to missing data
Sources of Missing Data • People refuse to answer a question • Responses are indistinct or ambiguous • Numeric data are obviously wrong • Broken objects cannot be measured • Equipment failure or malfunction • Detailed analysis of subsample
Assumptions 1 • Missing Completely at Random • probability of data missing on X is unrelated to the value of X or to values on other variables in data set • Missing at Random • the probability of missing data on X is unrelated to the value of X after controlling for other variables in the analysis
Assumptions 2 • Ignorable • MAR plus parameters governing missing data process unrelated to parameters being estimated • Nonignorable • If not MAR, missing data mechanism must be modeled to get good estimates of parameters
Methods • Listwise Deletion • Pairwise Deletion • Dummy Variable Adjustment • Imputation
Listwise Deletion 1 • Delete any samples with missing data • Can be used for any statistical analysis • No special computational methods • If data are MCAR (esp if random sample of full data set), they are an unbiased estimate of the full data set
Listwise Delete 2 • If data are MAR, can produce biased estimates if missing values in independent variables are dependent on dependent variable • Main issue is the loss of observations and the increase in standard errors (meaning a decrease in the power of the test)
Listwise Deletion 3 • In anthropology listwise deletion often includes removal of variables (columns) as well as cases (rows) • Finding an optimal complete data set involves removing variables with many missing variables and then rows still having missing variables
Pairwise Deletion 1 • Compute means using available data and covariances using cases with observations for the pair being computed • Uses more of the data • If MCAR, reasonably unbiased estimates, but if MAR, estimates may be seriously biased
Pairwise Deletion 2 • Covariance/Correlation matrix may be singular • Less of an issue with distance matrices
Dummy Variable • Create variable to flag observations missing on a particular variable • Used in regression analysis but provides biased estimators
Imputation • Replace missing values with an estimate: • Mean for that variable – biased estimates of variances and covariances • Multiple regression to predict value – complicated with multiple variables containing missing values, but can still lead to underestimated standard errors
Maximum Likelihood • Try to reconstruct the complete data set by selecting values that would maximize the probability of observing the actually observed data • Categorical and continuous data • Expectation-maximization algorithm gives estimates of means and covariances
Expectation Maximization • Iterative steps of expectation and maximization to produce estimates that converge on the ML estimates • These estimates will generally underestimate the standard errors in regression and other statistical models
Multiple Imputation 1 • Has the same optimal properties of ML but several advantages • Can be used with any kind of data and any kind of statistical model • But produces multiple estimates which must be combined • Random component used to give unbiased estimates
Multiple Imputation 2 • Multivariate normal model (relatively resistant to deviations) • Each variable represented as a linear function of the other variables • Methods • Data Augmentation, package norm • Sampling Importance/Resampling, package amelia
Multiple Imputation 3 • Categorical data, multinomial model, package cat • Categorical and interval/ratio data, package mix • Also can use multivariate normal models with dummy variables
Multiple Imputation 4 • Predictive mean matching – use regression to predict values for a particular variable. Find complete cases that have predictions similar to the case with a missing value on that variable and randomly one of the actual values, package Hmisc, function aregImpute
Analysis • The analysis is run on each imputed data set and the estimates (e.g. regression coefficients are combined) • Packages such as zelig provide ways of combining the datasets for generalized linear models
Missing Data with R 1 • NA is used to identify a missing value • is.na() is used to test for a missing value: is.na(c(1:4, NA, 6:10)) • na.omit(dataframe) will delete all cases with missing data (Rcmdr: Data | Active Data set| Remove cases with missing values
Missing Data with R 2 • Some functions have an na.rm= option. True means remove cases with missing values, False means do not remove them so that the function returns NA if there are missing values.
Missing Data in R 3 • Other functions (e.g. lm, princomp, glm) have an na.action= option that must can be set to one of the following options: na.fail, na.omit, na.exclude to remove cases (omit, exclude) or have the analysis fail
Missing Data in R 4 • Other functions (e.g. cor, cov, var) have a use= option: • everything (NA’s propagate) • all.obs (NA causes error) • complete.obs (delete cases with NA’s) • na.or.complete (delete cases with NA’s) • pairwise.complete.obs (complete pairs of observations)
Example 1 • ErnestWitte data set has missing values among the 242 cases and 38 variables • Using R to remove all cases with missing values reduces the number of cases to 52! • If we don’t need all of the variables we can retain more cases
Example 2 • Total NA’s in ErnestWitte (815) • sum(is.na(ErnestWitte)) • Check missing values by variable: • sort(apply(ErnestWitte, 2, function(x) sum(is.na(x))), decreasing=TRUE) • Looking has 171, SkullPos 126, Depos 112 • Removing these gives 112 cases
Multiple Imputation with R • A wide variety of options: • Packages norm, cat, mix • Package amelia • Package mi (relatively new, but flexible)