1 / 57

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data. While SPSS contains a rich set of procedures for analyzing patterns of missing data, they are not included in the set of tools licensed by the University.  However, we can replicate much of the analysis with other SPSS procedures.

Albert_Lan
Download Presentation

Analyzing Patterns of Missing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing Patterns of Missing Data While SPSS contains a rich set of procedures for analyzing patterns of missing data, they are not included in the set of tools licensed by the University.  However, we can replicate much of the analysis with other SPSS procedures. The first set of tasks in the missing data analysis involve the creation of diagnostic variables that support the analysis: first, a variable that counts the number of variables with missing data for each case; second, one new dichotomous variable for each original variable that indicates whether or not the original variable had a missing data value; and third, a single pattern variable for each case that summarizes the missing or valid status of values for all of the variables in the analysis. Using the diagnostic variable that counts the missing values for each case, we can identify cases with large concentrations of missing data as candidates for elimination from the analysis.  After we remove specific cases with large numbers of missing variables, we do a frequency distribution for the remaining cases to see if any variables have so many missing cases that the variable should be considered a candidate for exclusion. Next, we compute a frequency distribution for the pattern variable to identify patterns that occur often in the data, indicating a problematic missing data process. Next, using the valid/missing variables as a grouping variable, we examine whether or not the missing cases are statistically different from the valid cases for all of the other variables in the analysis.  If the variable is metric, we do a t-test for group differences; if the variable is non-metric, we do a chi-square test of independence to detect group differences. Finally, we do a correlation matrix of the valid/missing variables to detect concentrations of missing data across multiple variables. Analyzing Patterns of Missing Data

  2. 1. Download the data set Download the HATMISS data set from the course web page and save it in your C:\SW388R7 folder. Analyzing Patterns of Missing Data

  3. 2. Tallying the Number of Missing Variables One of the major information items we need for the missing data analysis is the number of variables that have missing data for each case in the sample. We will create a new variable which we will name num_miss that will contain the number of variables from the first ten in the data set, x1 through x10.  We include only the first ten variables in this calculation to maintain consistency with the text. The SPSS function NMISS counts the number of variables that have missing values.  We will use this function to calculate the value for our NUM_MISS variable for each case. Analyzing Patterns of Missing Data

  4. Computing the Number Missing by Case Analyzing Patterns of Missing Data

  5. Specifying the Variables in the Function Analyzing Patterns of Missing Data

  6. 3. Creating Dichotomous Valid/Missing Variables for Diagnosing Missing Data To determine whether or not the pattern of missing data is random, we create a special diagnostic variable that indicates whether the variable is missing or valid for each case in the data set.  Each diagnostic variable is dichotomous, using the value 1 for 'Valid' and the value 0 for 'Missing' Since we may need to refer back to the original variables in the course of the missing data analysis, I recommend a naming convention for the diagnostic variables that makes it easy to identify the original variable.  If the original variable name is less than eight characters, an underscore is appended to the end of the original variable name, e.g. the diagnostic variable for race would be race_.  If the original variable name is eight characters, the last character is replaced with an underscore, e.g. the diagnostic variable name for response would be respons_.  If replacing the last character with an underscore duplicates the name assigned to another diagnostic variable for an eight-character variable name, we drop the last two characters from the original name and append an underscore followed by a sequence letter or digit, e.g. the diagnostic variable name for response would be respon_1 if we had already used the name respons_ for a diagnostic variable. When we assign variable labels to the diagnostic variables, we can add a keyword to the original variable label to designate it as a missing/valid diagnostic variable, e.g. the variable label for the diagnostic variable that had an original variable label of Grade Level could be Grade Level (Valid/Missing).  We will demonstrate the process of creating dichotomous Valid/Missing variables for diagnosing missing data using the variables in the HATMISS.SAV data set.  If the copy of HATMISS.SAV that you are working with does not have variable labels and value labels, do the exercise Applying a Data Dictionary to apply the data labels from the HATCO.SAV data set to the HATMISS.SAV data set. A quick test for the presence of variable labels is to position the mouse over a variable name in the data editor.  If a variable label appears in a yellow tips box, a variable label has been added for that variable. Analyzing Patterns of Missing Data

  7. Recoding Diagnostic Variables for Missing Data Analyzing Patterns of Missing Data

  8. Opening the Dialog for Old and New Values Analyzing Patterns of Missing Data

  9. Add the Value for Missing Data Analyzing Patterns of Missing Data

  10. Add the Value for Valid Data Analyzing Patterns of Missing Data

  11. Completing the Values Dialog Box Analyzing Patterns of Missing Data

  12. Adding Diagnostic Variables for the Remaining Variables Analyzing Patterns of Missing Data

  13. Adding Value Labels to the Diagnostic Variables Analyzing Patterns of Missing Data

  14. Adding the Value Label for Missing Analyzing Patterns of Missing Data

  15. Add the Value Label for Valid Analyzing Patterns of Missing Data

  16. Apply the Value Labels Analyzing Patterns of Missing Data

  17. Displaying the Value Labels for the Variables Analyzing Patterns of Missing Data

  18. The Diagnostic Variables Analyzing Patterns of Missing Data

  19. 4. Adding a Pattern Variable to the Data Set Another indication of a problematic missing data process is the frequent occurrence of the same pattern of missing data among the variables.  While patterns can be detected by sorting and scanning the data set, this task is facilitated by the creation of a pattern variable.  The pattern variable is a string variable containing one character for each variable in the data set.  Each character in the pattern variable is set to a character indicating missing data or a character indicating valid data.  To make the pattern more visually intuitive, the characters selected should have the same width when printed. If we do not use same width characters, we cannot scan down values to compare them because the column alignment of the characters is not the same from one value to the next.  We will use an X for missing data and a tilde, ~, for valid data, because both are full width characters. To create the pattern variable, we first create a one-character string variable for each of the original variables.  Then, we use the SPSS 'CONCAT' function to add the string variables together into a single variable. Analyzing Patterns of Missing Data

  20. Recode the Original Variables into String Variables Analyzing Patterns of Missing Data

  21. Opening the Dialog for Old and New Values Analyzing Patterns of Missing Data

  22. Add the Value for Missing Data Analyzing Patterns of Missing Data

  23. Add the Value for Valid Data Analyzing Patterns of Missing Data

  24. Completing the Values Dialog Box Analyzing Patterns of Missing Data

  25. Adding String Variables for the Other Original Variables Analyzing Patterns of Missing Data

  26. The String Variables Analyzing Patterns of Missing Data

  27. Create the Variable Containing the Concatenated Data Analyzing Patterns of Missing Data

  28. Enter the Formula for the Concatenated Variable Analyzing Patterns of Missing Data

  29. The Missing Data Pattern Variable Analyzing Patterns of Missing Data

  30. 5. Removing Cases with a Large Proportion of Missing Variables To identify the cases that we should consider removing, we will sort the data set in descending order by the number of missing variables.  The candidates for elimination will appear at the top of the data set. Once we have located the cases that we want to eliminate, we specify a filter condition to eliminate the cases from further analysis.  The cases are not deleted from the data set, so we can include them in later analysis should we desire to do so. Analyzing Patterns of Missing Data

  31. Sorting the Cases Analyzing Patterns of Missing Data

  32. The Cases Sorted by Number Missing Analyzing Patterns of Missing Data

  33. Excluding the Cases Analyzing Patterns of Missing Data

  34. Specifying the If Condition Analyzing Patterns of Missing Data

  35. Specify Filtering for Unselected Cases Analyzing Patterns of Missing Data

  36. The Data Set with Filtered Cases Analyzing Patterns of Missing Data

  37. 6. Summary Statistics for the Unfiltered Cases Filtering cases with 50% or more missing data removed six cases from the data set, reducing our effective sample size to 64 cases. We next look at a frequency distribution for each variable to see if any variables have such a high proportion of missing data that they should be considered candidates for removal from the analysis. We can see the distribution of missing data on each of our variables by using the Frequencies command, which produces the SPSS output equivalent to Table 2.2 on page 56 of the text.  We will use a Frequencies command instead of a Descriptives command, because the Frequencies command will provide a count of the remaining missing cases for each variable. Analyzing Patterns of Missing Data

  38. Requesting the Frequency Distributions Analyzing Patterns of Missing Data

  39. Requesting Specific Statistics Analyzing Patterns of Missing Data

  40. The Frequencies Output Analyzing Patterns of Missing Data

  41. Changing the Orientation of the Table Analyzing Patterns of Missing Data

  42. The Transposed Frequencies Table Analyzing Patterns of Missing Data

  43. 7. Tabulating Missing Data Patterns In a previous exercise, Adding a Pattern Variable to the Data Set, we created a pattern variable that contained a single string of ten characters representing valid or missing data for the first ten variables in the data set.  To create table 2.4 on page 58, we do frequency distribution on the pattern variable.  This frequency distribution will tell us if there are one or two patterns of missing data that occur with sufficient frequency to require further investigation. Analyzing Patterns of Missing Data

  44. Request a Frequency Distribution for the Pattern Variable Analyzing Patterns of Missing Data

  45. The Frequency of Different Patterns Analyzing Patterns of Missing Data

  46. 8. T-tests and Chi-square Tests for Diagnosing Randomness of Missing Data In previous exercises, we created dichotomous grouping variables for the variables X1 through X10, where the grouping variable was assigned a 1 if the data was valid and a 0 if the data was missing.  We will use these grouping variables to determine whether the valid and missing groups differ in their relationship to other variables in the data set.  If the missing and valid groups are statistically equivalent on other variables, then the missing cases can be characterized as random, and of no consequence to our analysis.  If the missing group shows a statistically significant relationship to the other variable, it suggests that there is a missing data process that requires further understanding. The statistical tests that we use in this analysis are chi-square tests of independence, if the variable to be tested is nonmetric, or t-tests for two independent samples, if the variable to be tested is metric.  The authors use the separate variance output for all t-tests instead of examining individual tests of homogeneity.  We will follow this practice.  When this analysis is conducted, there are usually a large number of statistical relationships tested.  We know that using an alpha level of 0.05 in these tests implies that we will make an incorrect inference in one out of every twenty tests.  With a large number of tests, we will get some statistically significant relationships even when there is no serious problem with our data.  We are not looking at the individual test results, as much as we are concerned with an overall pattern of relationships. NOTE.  I cannot reconcile the findings on these tests to the discussion of findings on page 58 of the text.  The statistical results are consistent with table 2.5 on page 59, while the text discussion appears to be a carryover from the fourth edition of the text, which does not contain the same statistical results as the fifth edition. Analyzing Patterns of Missing Data

  47. The Statistical Tests to Be Computed We will use the grouping variable 'Delivery Speed (Valid/Missing)' (X1_) to explore differences among the next nine variables in the data set, 'Price Level' through 'Satisfaction Level' (X2 through X10).  In each statistical test, we are testing the null hypothesis of no relationship associated with the grouping variable, 'Delivery Speed (Valid/Missing)'.  If we reject the null hypothesis, we would conclude that persons who did not answer the question on Delivery Speed had a different pattern of responses than did persons who did provide Delivery Speed.  The variable 'Firm Size' (x8) is a nonmetric variable and we will do a chi-square test of independence for this variable. The variables 'Price Level' (x2), 'Price Flexibility' (x3), 'Manufacturer Image' (x4), 'Service' (x5), 'Salesforce Image' (x6), 'Product Quality' (x7), 'Usage Level' (x9), and 'Satisfaction Level' (x10) are all metric and we will do t-tests for these variables.  Analyzing Patterns of Missing Data

  48. The Chi-square Test of Independence Analyzing Patterns of Missing Data

  49. Requesting the Chi-square Test Analyzing Patterns of Missing Data

  50. Specifying Cell Contents Analyzing Patterns of Missing Data

More Related