1 / 39

Data cleaning

GAP Toolkit 5 Training in basic drug abuse data management and analysis. Data cleaning. Training session 12. Objectives. To establish methods of uncovering coding errors To discuss techniques for implementing logical tests To present methods of selecting cases

lpetty
Download Presentation

Data cleaning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GAP Toolkit 5Training in basic drug abuse data management and analysis Data cleaning Training session 12

  2. Objectives • To establish methods of uncovering coding errors • To discuss techniques for implementing logical tests • To present methods of selecting cases • To reinforce the SPSS skills presented to date

  3. Boolean operators: AND • The AND operator is a logical operator in Boolean algebra • Imagine two statements: X and Y • For the operation (X AND Y) to be true X has to be true and Y has to be true • The rules for Boolean operators are commonly displayed in Truth Tables

  4. Truth table: AND

  5. Boolean operators: OR • The OR operator is a logical operator in Boolean algebra • Imagine two statements: X and Y • For the operation (X OR Y) to be true either X is true or Y is true or both X and Y are true

  6. Truth table: OR

  7. Data cleaning • Check the data for errors • Clean the data before any data analysis

  8. Types of error • There are two broad areas of error: • Coding errors • Logical errors

  9. Coding error • Data entry errors • Out-of-range values

  10. Detecting out-of-range values • For categorical variables, having declared valid values, frequency counts will highlight any peculiar entries • For continuous variables, descriptive statistics, in particular the range and a histogram, will highlight any peculiar values

  11. Examples • Age: generate descriptive statistics • Treatment type: generate a frequency distribution

  12. Descriptives

  13. Treatment type

  14. Resolving errors • The questionnaires should be checked • If possible, return to the interviewer or interviewee • If still unresolved, consider setting the value as missing • Note the importance of ID numbers for linking the computer to the questionnaire

  15. Selecting cases • The ability to select a set of cases according to a criterion is essential in data cleaning • Generating statistics for subsets of the data is also a useful analytical tool

  16. Example: Age • Descriptive statistics of Age indicate that there is a case with a value of 1 and a case with the value 77 • It is advisable to check the extreme values Descriptive Statistics

  17. Example: Age • It would be reasonable to check for values 10 and under and 70 and over • The task is to select those cases and display the results • Data/Select Cases generates the following dialogue box

  18. Choose these options to define selection criteria.

  19. Data/Select Cases • SPSS creates a new variable in the data set called filter_$ which = 1 when AGE<=10 OR AGE >= 70 • All subsequent analysis will be on the reduced data set until Data/Select Cases/All Cases is chosen • The filtered cases are identified by a slash through the case number

  20. Age

  21. Generating a report • Analyse/Reports/Case Summaries • Select the variables to be included in the summary

  22. Case summariesa a. Limited to first 100 cases.

  23. Note: All Cases • Don’t forget that, once certain cases have been selected, all subsequent analysis is on the selected cases only • Once you have finished working with the subset, restore the file to All Cases before doing any further analysis • Data/Select Cases… • Select the All Cases radio button • OK

  24. Locating a case • From the Data Editor: • Data/Go To Case OR • Select a variable, then Edit/Find

  25. Logical errors • Detecting logical errors involves comparing answers to ensure that they are consistent • The type of logical checks appropriate to identify particular errors will depend on the questions in the questionnaire

  26. Detecting logical errors • Cross-tabulations between categorical variables can be used to highlight errors • Check criteria using conditional statements and the Compute facility • Some software, such as SPSS Databuilder, allows tests for logical and coding errors to be built into a data entry form

  27. Example: Cross-tabulation • Cross-tabulations provide a simple method of investigating the joint distribution of two variables • The following slide is a cross-tabulation of Drug1 against Mode1 to check that appropriate modes of ingestion have been reported

  28. Most Frequently Used Drug (Cross-tabulation) Most frequently used drug

  29. Example: conditional statements • Main.sav contains information on the three most frequently used drugs: Drug1, Drug2 and Drug3 • In a single case, no drug should appear in more than one of the three variables • To check this, generate a test variable on the basis of a conditional statement; the test variable should take the value 0 if all three drug variables are different and the value 1 if there is any duplication

  30. Compute: Test = 0 • Transform/Compute • Enter the name of the new variable: TEST • Click the Type and Label button and declare the variable as numeric with the label: TEST VARIABLE FOR DRUG DUPLICATION • Set TEST = 0

  31. Compute: TEST = 1 • If any of the drug options are the same, TEST should equal 1 EXCEPT when Drug2 = Drug3 = 77 (not applicable) • The condition is if • Drug1 = Drug2 OR • Drug1 = Drug3 OR • (Drug2 = Drug3 AND Drug2  77) • THEN Test = 1

  32. Click If… button to define the conditional statement.

  33. Case summariesa a. Limited to first 100 cases.

  34. Exercise • Check for consistency between the drug reported and the method of ingestion for the second and third drugs of use • What additional logical tests could be completed on the data in main.sav?

  35. Summary • Data entry errors • Out-of-range errors • Logical errors • Conditional statements • Selecting cases • Reports

More Related