Todd D. Little University of Kansas Director, Quantitative Training Program

On the Merits of Planning and Planning for Missing Data* • *You’re a fool for not using planned missing data design Todd D. Little University of Kansas Director, Quantitative Training Program Director, Center for Research Methods and Data Analysis Director, Undergraduate Social and Behavioral Sciences Methodology Minor Member, Developmental Psychology Training Program crmda.KU.edu Workshop presented 5-23-2012 @ University of Turku, Finland Special Thanks to: Mijke Rhemtulla & Wei Wu crmda.KU.edu

Learn about the different types of missing data • Learn about ways in which the missing data process can be recovered • Understand why imputing missing data is not cheating • Learn why NOT imputing missing data is more likely to lead to errors in generalization! • Learn about intentionally missing designs Road Map crmda.KU.edu

Key Considerations • Recoverability • Is it possible to recover what the sufficient statistics would have been if there was no missing data? • (sufficient statistics = means, variances, and covariances) • Is it possible to recover what the parameter estimates of a model would have been if there was no missing data. • Bias • Are the sufficient statistics/parameter estimates systematically different than what they would have been had there not been any missing data? • Power • Do we have the same or similar rates of power (1 – Type II error rate) as we would without missing data? crmda.KU.edu

Types of Missing Data • Missing Completely at Random (MCAR) • No association with unobserved variables (selective process) and no association with observed variables • Missing at Random (MAR) • No association with unobserved variables, but maybe related to observed variables • Random in the statistical sense of predictable • Non-random (Selective) Missing (MNAR) • Some association with unobserved variables and maybe with observed variables crmda.KU.edu

Effects of imputing missing data crmda.KU.edu

Effects of imputing missing data Statistical Power: Will always be greater when missing data is imputed! crmda.KU.edu

Modern Missing Data Analysis MI or FIML • In 1978, Rubin proposed Multiple Imputation (MI) • An approach especially well suited for use with large public-use databases. • First suggested in 1978 and developed more fully in 1987. • MI primarily uses the Expectation Maximization (EM) algorithm and/or the Markov Chain Monte Carlo (MCMC) algorithm. • Beginning in the 1980’s, likelihood approaches developed. • Multiple group SEM • Full Information Maximum Likelihood (FIML). • An approach well suited to more circumscribed models crmda.KU.edu

60% MAR correlation estimates with 1 PCA auxiliary variable (r = .60) Figure 7. Simulation results showing XY correlation estimates (with 95 and 99% confidence intervals) associated with a 60% MAR Situation and 1 PCA auxiliary variable. crmda.KU.edu 9

Three-form design • What goes in the Common Set? crmda.KU.edu

Three-form design: Example • 21 questions made up of 7 3-question subtests crmda.KU.edu

Three-form design: Example • Common Set (X) crmda.KU.edu

Three-form design: Example • Common Set (X) crmda.ku.edu

Three-form design: Example • Set A I start conversations. I get stressed out easily. I am always prepared. I have a rich vocabulary. I am interested in people. crmda.KU.edu

Three-form design: Example • Set B I am the life of the party. I get irritated easily. I like order. I have excellent ideas. I have a soft heart. crmda.KU.edu

Three-form design: Example • Set C I am comfortable around people. I have frequent mood swings. I pay attention to details. I have a vivid imagination. I take time out for others. crmda.KU.edu

crmda.KU.edu

Expansions of 3-Form Design • (Graham, Taylor, Olchowski, & Cumsille, 2006) crmda.KU.edu

2-Method Planned Missing Design crmda.KU.edu

2-Method Planned Missing Design Self-Report Bias Self- Report 1 Self- Report 2 CO Cotinine Smoking crmda.KU.edu

2-Method Measurement • Expensive Measure 1 • Gold standard– highly valid (unbiased) measure of the construct under investigation • Problem: Measure 1 is time-consuming and/or costly to collect, so it is not feasible to collect from a large sample • Inexpenseive Measure 2 • Practical– inexpensive and/or quick to collect on a large sample • Problem: Measure 2 is systematically biased so not ideal crmda.KU.edu

2-Method Measurement • e.g., measuring stress • Expensive Measure 1 = collect spit samples, measure cortisol • Inexpensive Measure 2 = survey querying stressful thoughts • e.g., measuring intelligence • Expensive Measure 1 = WAIS IQ scale • Inexpensive Measure 2 = multiple choice IQ test • e.g., measuring smoking • Expensive Measure 1 = carbon monoxide measure • Inexpensive Measure 2 = self-report • e.g., Student Attention • Expensive Measure 1 = Classroom observations • Inexpensive Measure 2 = Teacher report crmda.KU.edu

2-Method Measurement • How it works • ALL participants receive Measure 2 (the cheap one) • A subset of participants also receive Measure 1 (the gold standard) • Using both measures (on a subset of participants) enables us to estimate and remove the bias from the inexpensive measure (for all participants) using a latent variable model crmda.KU.edu

2-Method Measurement • Example • Does child’s level of classroom attention in Grade 1 predict math ability in Grade 3? • Attention Measures • 1) Direct Classroom Assessment (2 items, N = 60) • 2) Teacher Report (2 items, N = 200) • Math Ability Measure, 1 item (test score, N = 200) crmda.KU.edu

Attention (Grade 1) Math Score (Grade 3) Teacher Report 1 (N = 200) Math Score (Grade 3) (N = 200) Teacher Report 2 (N = 200) Direct Assessment 1 (N = 60) Direct Assessment 2 (N = 60) Teacher Bias crmda.KU.edu

2-Method Planned Missing Design crmda.KU.edu

Developmental time-lag model • Use 2-time point data with variable time-lags to measure a growth trajectory + practice effects (McArdle & Woodcock, 1997) crmda.KU.edu

Time Age student T1 T2 2 4 6 0 1 3 5 1 5;6 5;7 2 5;3 5;8 3 4;9 4;11 4 4;6 5;0 5 4;11 5;4 6 5;7 5;10 7 5;2 5;3 8 5;4 5;8 crmda.KU.edu

T0 T1 T2 T3 T4 T5 T6 crmda.KU.edu

Intercept 1 1 1 1 1 1 1 T0 T1 T2 T3 T4 T5 T6 crmda.KU.edu

Linear growth Intercept Growth 1 0 6 1 1 5 1 2 4 3 1 1 1 1 T0 T1 T2 T3 T4 T5 T6 crmda.KU.edu

Constant Practice Effect Intercept Growth Practice 0 1 0 6 1 1 1 5 1 1 2 4 3 1 1 1 1 1 1 1 1 T0 T1 T2 T3 T4 T5 T6 crmda.KU.edu

Exponential Practice Decline Intercept Growth Practice 0 1 0 6 1 1 1 5 .87 1 2 4 3 .67 1 1 .55 1 .45 .35 1 T0 T1 T2 T3 T4 T5 T6 crmda.KU.edu

The Equations for Each Time Point Constant Practice Effect Declining Practice Effect crmda.KU.edu

Developmental time-lag model • Summary • 2 measured time points are formatted according to time-lag • This formatting allows a growth-curve to be fit, measuring growth and practice effects crmda.KU.edu

age grade 5;6- 5;11 6;6- 6;11 7;6- 7;11 4;6- 4;11 5;0- 5;5 6;0- 6;5 7;0- 7;5 2 student K 1 1 5;6 6;7 7;3 2 5;3 6;0 7;4 3 4;9 5;11 6;10 4 4;6 5;5 6;4 5 4;11 5;9 6;10 6 5;7 6;7 7;5 7 5;2 6;1 7;3 8 5;4 6;5 7;6 crmda.KU.edu

age • Out of 3 waves, we create 7 waves of data with high missingness • Allows for more fine-tuned age-specific growth modeling • Even high amounts of missing data are not typically a problem for estimation 5;6- 5;11 6;6- 6;11 7;6- 7;11 4;6- 4;11 5;0- 5;5 6;0- 6;5 7;0- 7;5 5;6 6;7 7;3 5;3 6;0 7;4 4;9 5;11 6;10 4;6 5;5 6;4 4;11 5;9 6;10 5;7 6;7 7;5 5;2 6;1 7;3 5;4 6;5 7;6 crmda.KU.edu

Growth-Curve Design crmda.KU.edu

Growth Curve Design II crmda.KU.edu

The Impact of Auxiliary Variables • Consider the following Monte Carlo simulation: • 60% MAR (i.e., Aux1) missing data • 1,000 samples of N = 100 www.crmda.ku.edu crmda.KU.edu 49

Excluding A Correlate of Missingness www.crmda.ku.edu crmda.KU.edu 50

Todd D. Little University of Kansas Director, Quantitative Training Program