220 likes | 417 Views
Data Cleaning 101: The Ron Cody Story. DAWG April 26, 2007 Katherine Semrau. So, who is Ron Cody anyway?. Professor at RWJ Medical School, expert SAS programmer, SAS book writer and he loves to cycle. 5 rules of data management. What can go wrong, will.
E N D
Data Cleaning 101:The Ron Cody Story DAWG April 26, 2007 Katherine Semrau
So, who is Ron Cody anyway? Professor at RWJ Medical School, expert SAS programmer, SAS book writer and he loves to cycle
5 rules of data management • What can go wrong, will. • Nothing is ever as simple as it first appears. • Everything takes longer than you expect. • One size does NOT fit all. • A calm sea does not make a skilled sailor -African proverb -Suzette Levenson’s 5 Rules
Outline • Purpose and Flow • Data cleaning from A to Z • Types of Data Cleaning • SAS Coding & examples • Reporting Errors & Corrections • Conclusions
Purpose of Data Cleaning • To verify that the dataset to be used for analysis accurately reflects the truth
First things first… Data cleanliness starts with: • A good form • Is it clear? Readable? Understandable? In the appropriate language? AND • Legible handwriting
Once data is collected… • Reviewing by the data collector • Checked by 2nd pair of eyes • Both asking the following: • Is the form complete (i.e. all pages)? • Are there blanks? • Are there strange answers? • Are multiple choices made when only one choice is allowed?
On to data entry… • Double Data Entry (CSPro, Access, EpiInfo...) • TeleForms • Data entry specialist will come across queries initially missed
On to data entry… • Queries sent to field site for variables that don’t make sense or are illegible • Original data collector should be asked about the query • Continue with data entry or wait for the query return?
Now to data checking & cleaning… • So now you have a database with data entered…now what? • Where to begin…. • Start small and basic • Then move to more complex cleaning • Get your data dictionary out
Types of data checking • Validity Checks • Range Checks • Logic Checks • Missing Data
What’s in the database?(proc contents) • “proc contents” should be run first to make sure all the variable names expected • Check variable names • Format of variable • Character vs. Numeric
Validity Checks (proc freq) • “proc freq” is your friend • Use the standard “proc freq” just to get the possible answers • Do the answers make sense? • Are there a bunch of missing points? • Are there strange outliers from what is expected? • Example: Gender: M, F
Range Checks (proc freq & proc means) • Do the variables fall within expected limits: • Age: 0-100 years • Weight: 2500g-4000g • Proc freq, proc means (example) • Proc Tabulate
Logic Checks • Use “if..then” statements in the data step • “Where” statements to ensure the skip patterns were met
Duplicate Records • Records mistakenly entered twice • Two options: • NODUPKEY • Eliminates duplicate record by valuables you identify… • But you have to be very careful • Proc sort data=name out=name NODUPKEY; • Proc sort NODUP; by _ALL_; • Eliminates records that are EXACTLY identical for all variables
What to do when you find an error… • Generate report • Send to field staff or review from original charts • Receive answers • Correction of answers • In database or • In coding • Document, document, document
When is data cleaning done? • Batches of forms are processed or • Time Schedule (i.e. monthly) or…. • NOT just before the analytic dataset is created
When is the data cleaning just as good as it is going to get… • How many rounds do you go? • Who decides how clean it needs to be? • Conversation between data analyst, PI, field staff…
Conclusions • Data cleaning is very important step • Planning is key • Think of it as CSI of the data