470 likes | 864 Views
Missing Data. What is missing?. Missing data are unavoidable, and more encompassing than the ubiquitous association of the term . What is missing? Cases Variables Values. Missing Cases. Missing cases - 1 Too few cases.
E N D
What is missing? Missing data are unavoidable, and more encompassing than the ubiquitous association of the term . What is missing? • Cases • Variables • Values
Missing cases - 1Too few cases • Here, missing data means not enough data due to the ‘curse of dimensionality’. • N must increase rapidly as you add variables if you want to maintain even coverage of the space of explanatory variables: • If 1 variable requires N=10, then . . . • 2 variables need N=10×10=100, • 3 variables need N=10×10×10=1000, • D variables need N=10D.
Missing cases - 1Too few cases – ctd • But remember Gelman’s Observation • We do not have enough data as we would like for our research question. But if we had more data we would try to fit a more complicated model. And then we would not have enough data as we would like for our research question...
Missing cases - 2Sampling and Descriptive Inference • we are interested in some parameter (say µ) describing a population (size N) • We only have observations of cases from a random sample (size n) • Missing cases: (N - n) • However, sample mean is a consistent and unbiased estimator of µ • Cost of missing data: uncertainty the exact value of the population parameter (expressed in the confidence interval of the estimate)
Missing cases - 3Prediction • If we are interested in a particular element in a population, which is not (yet) observed, we have a missing cases problem that can be addressed by prediction. • Prediction of the value of an element is based on estimating the relevant population parameter (e.g., µ) and expressing the uncertainty in terms of the standard error of the estimate, which combines the uncertainty generated by variation in the population with uncertainty generated by estimation:
Missing cases - 4Causal Inference • Causal inference (about the effect of a factor X) involves the comparison of observations (where X is present) with counterfactual ‘observations’ where X is absent (see King, Keohane and Verba, 1994: 75-84) • In this situation, half the required cases are missing, and unavoidably so because they pertain to a counterfactual
Missing cases - 4Causal Inference ctd • Practice: compare observed cases where X is present with other observed cases where X is absent: • Involves assumption of unit homogeneity • If possible: condition on relevant factors or match
Missing cases - 5Inaccessibility for Observation Particular cases which one would like to observe turn out to be unobservable (at least with the data collection methods chosen): • Documents are classified • Crimes/accidents are unreported • People cannot be interviewed (cannot be found / refuse/ other causes) • Particularly problematic if unobservable cases differ systematically from other ones, resulting in selection bias in the observations
Missing cases - 6Selection Bias • Inaccessibility for observation is often related to variables of interest: • Classified documents pertain to particularly interesting cases • Politically uninterested and cynical people are less likely to consent in being interviewed • Economic sanctions are only imposed where there are expected to have effect • Particular kinds of crimes go unreported because the victims feel ashamed or embarrassed (e.g., blackmail)
Missing cases - 6Selection Bias ctd Selection bias has pernicious consequences for • Descriptive inference (biased estimates of frequency) • Causal inference (biased estimates of effects, see King, Keohane and Verba 1994, xx-xx).
Missing variables – 1Latent variables Latent variables are always missing. They can often be estimated in situations of multiple-item operationalization and the use of measurement models such as factor analysis and IRT – see Measurement clinic.
Missing variables – 2Manifest variables Missing manifest variables: • under-coverage of elaborated concepts (yielding validity problems) • absent additional operationalizations which would allow the estimation of latent variables (and partial tests of validity assumptions) • use of ‘container’ measures • absent independent and control variables required in analysis stages • Diminish the problem by creative use of proxy variables (including instrumental variables), strategic use of secondary analyses, and possibly by strategies of (synthetic) data linking
Missing values This is the common association with ‘missing data’: for some of the cases observations are missing for some of the variables • ‘Swiss cheese’ analogy • This is the situation that ‘methods for dealing with missing data’ refer to, but these methods do not deal with (completely) missing cases and (completely) missing variables.
Why worry? • Practicalities of data-analysis: Most methods require complete data, in case of missing data software makes the data complete one way or another, you better know how, and what consequences this may have. The simplest ‘solution’ is deletion of cases with missing values. • Quality of substantive findings:Manner of handling missing data has consequences for bias, consistency and precision of inferences. • Cost/benefit considerations:Data required resources (money, time, effort) to be collected and constructed, there is no compelling rationale for not using them optimally (hence one should be wary of deleting available information).
Modelling and missing data • Data-analysis and modelling is done on empirical data • Data = f (SER, MDGP) whereSER: system of empirical relationshipsMDGP: missing data generating process • SER is the object of our substantive interestadequately modelling SER from data thus requires also modelling MDGP; failure to do so may lead to inferential errors about SER
Types of missing - MCAR • MCAR (missing completely at random)the MDGP is independent from any of the observed variables and independent from the SER • (usually) data entry errors:neither case attributes (variables X1 to Xk) , nor their (unknown) scores on the variable with missing data (Y) predict missingness • instrument rotation: for each case determine randomly which version of an instrument to use:probability of using a particular version is a probability independent of X’s or Y.
MCAR • If MCAR missings are deleted: • Inferences are unbiased • Inferences are less precise (due to smaller # of cases) • But, MCAR is uncommon in actual practice
Types of missing - MAR • I is random after conditioning on X (observed variables), which implies that I is random within groups (or sub-populations) defined on X • Implies that missing values can be (partly) predicted from observed values on other variables (as long as there are sufficient cases which have valid scores on both X and Y)
MAR (grey: missing on D2 but observed on D1) • A • NB: dependency on X (D1 in the graph) will generally not be as deterministic, as depicted here
MAR • Ignoring missing data will lead to biased estimates (the mean of the black dots underestimates the mean of all dots on D2) • But the distribution of D1 is known, as well as the relationship between D1 and D2. From this a correct estimate of the mean of D2 can be obtained • Using information about D1 and about the relationship between D1 and D2 makes the missing data MAR, and allows a correct estimate of the mean of D2.
Types of missing - NMAR • NMAR: not missing at random • The probability that a value is missing depends on the true, but unknown missing value
NMAR (grey: missing on D2 but observed on D1) • NB: dependency on Y (D2 in the graph) will generally not be as deterministic as depicted here
NMAR • Ignoring missing data will lead to biased estimates (the mean of the black dots underestimates the mean of all dots on D2) • Knowledge about the distribution of D1 does not help to solve this, does not help to make missing data MAR. • Only hope in these kind of situations is that other variables than D1 may help to make missing data into MAR.
MAR /NMAR Mixture As in selection-bias situations, the selection (on D2 or Y) generated by the missing cases results in biased estimate of the relation between D1 and D2
NMAR into MAR • The problems generated by NMAR missing values are not an inherent characteristic of the empirical world, but of our data and our imagination • Additional variables and sensible proxies that are systematically correlated with the variable with NMAR missings, may make those missings into MAR (if no such variables would exist, missing values would be MCAR) • Hence the value of (simultaneously) looking at all other possible variables, rather than just a few
What to do? Data deletion strategies:- unless MCAR, will generally bias estimates- always inefficient (loss of precision/power) • Pairwise deletionTo be discouraged. May lead to inconsistent results (e.g., not positive definite correlation/covariance matrices) • Listwise deletion (aka Complete Case Analysis)Except in the case of very few missing values, the cumulation of deleted cases may be enormous
What to do? - 2 ‘Working around’ strategies • Full Information Maximum Likelihood (FIML)integrates out the missing data when fitting the desired model • Requires particular assumptions (e.g., multivariate normality) • In a restricted form available in SPSS MVA procedure
What to do? - 3 Imputation strategiesconsist all of replacing missing value with an estimate of the actual value of that case • ‘hot-deck and ‘cold-deck’ • Mean imputation • EM procedures • Regression mean imputation • Multiple imputation
Imputation - 1 • ‘hot-deck’ imputation consists of replacing the missing value by the observed value from another, similar case from the same dataset for which that variable was not missing. • Requires definition of ‘similar’ • Reifies the observed value from the donor case, tends to inflate precision • ‘cold-deck’ uses cases from another (but similar) dataset • Used to be popular amongst Census Bureaus
Imputation - 2 • Mean imputation consists of replacing the missing value by the mean of the variable in question
Imputation - 3 Mean imputation • Is still offered as an option in many analysis procedures of statpacks (e.g., SPSS: regression, factor analysis) • From previous slide: leads generally to severe bias • General advice: do not do this!
Imputation - 4 • Expectation Maximization (EM) proceduresprocedure for arriving at the best point estimates of the true values, given the model (which itself is estimated on the basis of the imputed missings) • Procedure does not take account of uncertainty in the point estimates, therefore tends to inflate precision of estimates • Procedure assumes that the model is correct • Procedure increasingly available in statpacks, e.g., SPSS MVA
Imputation - 5 • Regression-mean imputationReplaces the missing value by the conditional regression mean (ŷ): • Estimate of slope unbiased • Precision inflated
Imputation – 6 • Regression-simulation imputation:replaces missing value by (ŷ)+error, where error is a random draw from the regression derived residual variance • Estimate of slope unbiased • Inflation of precision much less, but • Reifies the single imputed value (and the certainty of the imputation process); solution: multiple imputation
Imputation - 7 Multiple imputation:rather than a single imputed value, multiple ones are (stochastically) derived from a prediction equation. Each is, in principle, as good as any other one, yet they are not the same. • King et al. (2001) recommend the creation of a number of different, imputed, datasets, on each of which the same model is fitted/estimated. Subsequently the parameters of interest are combined in an appropriate fashion.
Imputation - 8 • Software for multiple imputation: Amelia II free download from http://gking.harvard.edu/amelia/Does require that statistical software package R is installed (also free downloadable, see http://www.r-project.org/ • Amelia let you define missing data model:Variables to be used (at least those in subsequent analyses, but any others that are thought to be predictive)Specific data features (e.g. Time dependencies, TSCS) • It assumes MAR (or NMAR MAR via the specified variables) and multidimensional normality • Software samples from the conditional distribution of the missings on the observed values of the other variables, which is equivalent to many simultaneous little regressions
Using multiple imputatations • each of the imputed datasets: run the same analysis • Let Q be the outcome of interest (parameter, mean, etc), then (more details in King et al 2001, 2009):
Literature • Allison, P. D. (2000) Multiple imputation for missing data, Sociological Methods and Research 28, pp.301-309. • Honaker, J. and King, G. (ms) What to do about missing values in time series cross-section data, Available at http://gking.harvard.edu/files/pr.pdf • Horton, N. J. and Kleinman (2007) Much Ado About Nothing, The American Statistician 61(1), pp.79–90. • King, G., Honaker, J., Joseph, A., and Scheve, K. (2001), Analyzing incomplete political science data, American Political Science Review 95, pp.49–69. • Little, R. J. A. and Rubin, D. B. (2002) Statistical Analysis With Missing Data (2nd ed.) Chichester: Wiley. • Schafer, J. L. (1997) Analysis of Incomplete Multivariate Data. London: Chapman and Hall.