520 likes | 602 Views
Data Management Approaches: handling, storage, & missingness. Renée El-Gabalawy PhD July 24 2019 2019 Department of Psychiatry Summer School.
E N D
Data Management Approaches: handling, storage, & missingness Renée El-Gabalawy PhD July 24 2019 2019 Department of Psychiatry Summer School
You are interested in understanding the relationship between cognitive functioning before surgery, psychiatric status, and functional brain abnormalities on post-operative delirium. Prior to running a large trial, you complete a feasibility study. Cognitive testing Surgery Psychiatric self-report POD fMRI Study Example
Creating a Successful Data Plan STEP 1:Create specific data collection plan • Aim to recruit 15 pts (no power analysis because feasibility study) from pre-anesthesia clinic who are undergoing high risk surgery • Provide informed consent, have patients fill out 5 self-report measures, book fMRI time (within 2 weeks prior to surgery) • Patients undergo neuropsychological assessment (1 hour) and fMRI (1 hour) • Perioperative monitoring and management? • Post-operative 0 thru 5 days delirium assessments
Creating a Successful Data Plan STEP 2: Assess feasibility and required personnel • Pre-anesthesia recruitment (1 personnel) • Pre-operative fMRI (2-3 personnel) and neuropsychological assessment (1 hire with clinical expertise) • Perioperative management • Post-operative assessments (1 hire CAM trained, nursing staff) Considerations: • blinding to avoid bias (neuropsych hire ≠ CAM hire) • required expertise • budget (what funds are available to hire people? pay for MRI? provide honorarium?) • Increase validity of post-op assessment (multiple measures? chart review?)
Creating a Successful Data Plan STEP 3: Apply for ethics • Have all testing batteries prepared • Personnel identified • Organization strategy in place • Data management plan in place • Each pt is coded with unique number. List of pts and numbers kept on password protected computer • Each pt has file with number. All pt materials put in file. Files kept in locked laboratory (study personnel have key) • Files = consent form, pre-op battery, neuropsych testing, periopinfo, all CAM assessments Important note: organization can make or break a study
Data Entry and Management STEP 4: Conduct study & select method of data entry and management (ongoing or post-study entry?) Important elements to data entry: • Use excel/SPSS (or other data management platform) with clear labels • Use numerical values to identify patients • Create data dictionary where variables and values are clearly defined • Decide on a value for missing data (888, 99, 9?) • Only include text in data file when patients can report open-ended response
4. Variable label should differentiate longitudinal data Important points to remember: 1. Make sure variable label matches with label in dataset 2. Clearly identify continuous vs. categorical vs. text data 3. Make sure missing value does not overlap with real values 5. Identify question in questionnaire associated with each variable
Management Analysis • With a well organized data file you can: • Perform basic statistics in excel • Graph your variables in excel • Transfer the data easily to statistical software (SPSS) for analysis • Clearly identify missings and analyze accordingly • Send the data and data dictionary to a statistician for analysis
How does data go missing? • You lose it • Participants not responding to certain questions • Order/fatigue • Sensitivity/refusal • Skip-outs • Attrition • Data entry errors • Statistical coding errors
Why don’t we think/talk about missings in epidemiological analyses? • You should • Many variables have <5% missing (thus, “ignorable”) • Weighting procedures partially account for missing (“missing by design”) • Weights adjust for non-response
Methods to reduce missings in primary data collection • Select shorter scales • Avoid long, confusing questions • Carefully consider alternate ways of asking sensitive questions and include surrogate questions • Change ordering of scales (e.g., have 3 different versions) • Check for completeness immediately • Have second person review data entry
Understanding missingness • Understand reasons and patterns for missing • Understand distribution of missings • Use this information to select the best method of analysis and imputation (if appropriate)
Defining missing values Missing Completely at Random (MCAR): Missings are unrelated to the actual data. Parameter estimates unbiased but impact on power Example: Some survey questions asked in a random sub-sample of a larger sample Missing at Random (MAR): Missings are related to another factor, but not the variable of interest Example: Respondents in service occupations less likely to report income Missing Not at Random (NMAR): Missings related to the variable itself Example: Respondents with high incomes are less likely to report their incomes
MCAR vs MAR vs MNAR • It is important to understand missingness as statistical methods to deal with missing often assume MCAR or MAR • It is problematic to have MNAR
Analysis strategies Deletion Methods: • Listwise deletion, pairwise deletion Single Imputation Methods: • Mean/mode substitution, dummy variable method, single regression • Must be MCAR Model-Based Methods: • Maximum likelihood, multiple imputation, inverse probability weighting • Can be MAR
Deletion Methods • Simplest approach • Only analyzes cases with available data on each variable • Reduces power • Increases likelihood of biased findings Listwise deletion: deleting a complete case (e.g., participant) Pairwise deletion: deleting single missing variable
Single Imputation Mean/mode substitution: replace missing value with sample mean or mode • Can specify conditions (e.g., over 80% complete) • Reduce variability of data – regressed towards mean Regression imputation: replaces missing values with predicted score from regression • Can overestimate model fit • Reduces variance Dummy variable adjustment with imputation: • Create missing variable (1=missing, 0=not missing) • Impute data (e.g., mean substitution) • Add missing variable as covariate in regression
Dummy Variable Adjustment • Treat missing data as a “level” in categorical variables • E.g., If you had 3 levels for income (<$30,000, $30,000-$60,000, $60,000+), you could include a “missing level” and include this 4-level variable as a covariate in your models • This is preferable over deletion, which will impact your sample size
Model Based Methods Maximum likelihood estimation (EM in SPSS): • Method that identifies a set of parameter values that results in the highest log-likelihood • Unbiased estimates for MCAR and MAR Multiple imputation: • Uses a specified regression model to create multiple datasets with various completed possible values • Analyses are performed within each dataset • Results pooled into one estimate
Preferred Method • Multiple imputation is the gold standard of imputation • Creates most accurate values by taking into account variability due to (1) sampling and (2) imputation • Disadvantage: Time consuming and involves several decisions • MI method, dataset count, iterations between datasets, selection of prior distribution
Missing Data with SPSS Tips and Tricks
What can you do in SPSS? • Understand missingness • Conduct a formal missing value analysis along with single imputation methods • Conduct model-based imputation including MI • Mean value imputation in factor analysis and linear regression • Forecasting add on allows for imputation in time series
Identifying values as missing in SPSS Can define missings in 3 ways: • Identify discrete missing values (up to 3) • A range of values • A range of values and one discrete value **You don’t need to re-code as sysmis
Understanding missings in SPSS • Run frequencies on all primary variables to understand the n that is missing • Identify variables that have high missings: Based on your knowledge of the variable/data collection process/survey, is there a reason why there are more missings? • Understand pattern of missingnessand ignorability • Missing Value Analysis module or Multiple Imputation module or Index for Sensitivity to Nonignorability
Missing Value Analysis • Describes pattern of missing data: where they are located, how extensive they are, is missingness random, do pairs of variables tend to have missings • Little’s MCAR test • Estimates means, SDs, correlations etc for missing value methods • Conducts single imputation
Example SPSS dataset Primary Variables: • Anxiety (yes, no) • Anxiety type (phobia, ocd, gad, ptsd, panic, other) • How nervous are you (1 thru 5) • How restless are you (1 thru 5) • Frequency of alcohol use (never to 4 or more times per week) • Quantity of alcohol use (1-2/day to 10+/day) Does anything stand out that may impact missingness?
Running frequencies Most cases are missing here: Suspect skip-outs and go back to the questionnaire If participants answered no to whether they had an anxiety disorder, they would not be asked about type. If they indicated they did not drink, they would not be asked how many drinks they have per day
Missing Value Analysis • The continuous measures of nervous and restless have a non-significant Little’s MCAR test • This is a good thing! Fail to reject the null, which means that data are missing in a random way • Can proceed with single imputation (with continuous measures)
Replace missing values • Expectation maximization approach • Ensure that imputation is done on variables that are coded appropriately (e.g. reverse code) and complete by subscale (homogeneous items) • Use multiple items to enhance accuracy
Single Imputation SPSS Analyze Missing Value Analysis EM… Save Completed Data
Multiple Imputation Analyze Multiple imputation Analyze patterns
Multiple Imputation: Missing Patterns • 100% of the variables have at least some missing data • 86.85% have complete cases (rows complete) • 88.53% of cells have data Should not impute variables with >15% missing
Multiple imputation: Missing patterns • 7 patterns of missing 1 = no missings across all variables 7 = missings on all variables Around 80% of data have no missings, while around 15% have missings across all variables We cannot impute those with all missings so remove from dataset (indicates those who enrolled in study but did not complete any measures)
MI: Removal of variables • Should no longer have pattern 7 (all missings) • Missing data dropped from around 11% to less than 2% • Complete Little’s MCAR test
Multiple imputation Model specification # of datasets Can specify method or can occur automatically Constraints: specify min and max of variables in models; scan data to provide descriptives
Multiple imputation: new dataset Yellow cells represent imputed values Imputation variable = dataset number 0 = original dataset 1 thru 5 = imputed datasets
MI: Regression Run linear regression; pull down menu’s include a swirl indicating the analysis will be conducted on the imputed datasets individually and as a pooled dataset
Revise & Resubmit AJGP “The authors write "Second, we are unable to determine the effect of respondent attrition at Wave 2 on our results." Actually, the authors could describe the range of possible influence of respondent attrition at Wave 2 on their results with the use of a sensitivity analysis. See http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2047601/ Selective attrition seems to be a very likely candidate to explain the decreasing trend of lifetime psychiatric disorder with older age, far more compelling than "This finding once again supports the notion of older age generally being associated with good emotional health"
What we did • Re-merged Waves 1 & 2 • Merged NESARC had missings due to attrition removed • Coded in missings by identifying those that had no data for Wave 2 stratum • Ran a number of t-tests and chi-square analyses examining significant differences among missing and non-missing on primary variables and sociodemographics
Second round of revisions • Analysis demonstrated factors for predicting missingness • Not sensitivity analysis • For variables that cannot be imputed, re-run analyses with “best” case and “worst” case scenario
What we did • Replace missing independent variables from Wave 1 with extreme scores (endorsed, not endorsed & 1 SD above or below the mean) • Conducted our same analyses with each scenario to examine whether extreme cases changed results
Bottom Line: Who, What, When, Where, Why Always be transparent about missingness. WHO: Who is missing? WHAT: What proportion is missing? WHEN: When are data missing? (in what context) WHERE: Where are these missing? WHY: Why are these missing?