350 likes | 795 Views
Generating new variables and manipulating data with STATA. Biostatistics 212 Lecture 3. Housekeeping. Questions re: Log and Do files?. Today. What we did in Lab 1, and why it was unrealistic What does “data cleaning” mean? How to generate a variable
E N D
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3
Housekeeping • Questions re: Log and Do files?
Today... • What we did in Lab 1, and why it was unrealistic • What does “data cleaning” mean? • How to generate a variable • How to manipulate the data in your new variable • How to label variables and otherwise document your work • Examples
Last time… • What was unrealistic?
Last time… • What was unrealistic? • The dataset came as a Stata .dta file
Last time… • What was unrealistic? • The dataset came as a Stata .dta file • The variables were ready to analyze
Last time… • What was unrealistic? • The dataset came as a Stata .dta file • The variables were ready to analyze • Most variables were labeled
Last time… • i.e. – The data was “clean”
How your data will arrive • On paper forms • In a text file (comma or tab delimited) • In Excel • In Access • In another data format (SAS, etc)
Importing into Stata • Options: • Copy and Paste • insheet, infile, fdause, other flexible Stata commands • A convenience program like “Stat/Transfer”
Importing into Stata • Make sure it worked • Look at the data
Importing into Stata • Demo – neonatal opiate withdrawal data
Exploring your data • Figure out what all those variables mean • Options • Browse, describe, summarize, list in STATA • Refer to a data dictionary • Refer to a data collection form • Guess, or ask the person who gave it to you
Exploring your data • Demo: Neonatal opiate withdrawal data
Exploring your data • Demo: Neonatal opiate withdrawal data • Problems arise… • Sex is m/f, not 1/0 • Gestational age has nonsense values (0, 60) • Breastfeeding has a bunch of weird text values • Drug variables coded y or blank • Many variable names are obscure
Cleaning your data • You must “clean” your data so it is ready to analyze.
Cleaning your data • Cleaning tasks • Check for consistency and clean up nonsense data and outliers • Deal with missing values • Code all dichotomous variables 1/0 • Categorize variables as needed (for Table 1, etc) • Derive new variables • Rename variables • With common sense, or with a consistent scheme • Label variables • Label the VALUES of coded variables
Cleaning your data • The importance of documentation • Retracing your steps • Document every step using a “do” file
Data cleaningBasic skill 1 – Making a new variable • Creating new variables generate newvar = expression
Data cleaningBasic skill 1 – Making a new variable • Creating new variables generate newvar = expression • An “expression” can be: • A number (constant) - generate allzeros = 0 • A variable - generate ageclone = age • A function - generate agesqrt = sqrt(age)
Data cleaningBasic skill 2 – Getting rid of variables/observations • Getting rid of a variable drop var • Getting rid of observations drop if boolean exp
Data cleaningBasic skill 3 – Manipulating values of a variable • Changing the values of a variable replace var = exp [if boolean exp] • A boolean expression evaluates to true or false for each observation
Data cleaningBasic skill 3 – Manipulating values of a variable • Examples generate male = 0 replace male = 1 if sex==“male” generate ageover50 = 0 replace ageover 50 = 1 if age>50 generate complexvar = age replace complexvar = (ln(age)*3) if (age>30 | male==1) & (othervar1>=othervar2)
Data cleaningBasic skill 3 – Manipulating values of a variable • Logical operators for boolean expressions: EnglishStata Equal to == Not equal to !=, ~= Greater than > Greater than/equal to >= Less than < Less than/equal to <= And & Or |
Data cleaningBasic skill 3 – Manipulating values of a variable • Mathematical operators: EnglishStata Add + Subtract - Multiply * Divide / To the power of ^ Natural log of ln(expression) Base 10 log of log10(expression) Etcetera…
Data cleaningBasic skill 3 – Manipulating values of a variable • Another way to manipulate data Recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean expression] • More complicated, but more flexible command than replace
Data cleaningBasic skill 3 – Manipulating values of a variable • Examples Generate male = 0 Recode male 0=1 if sex==“male” Generate raceethnic = race Recode raceethnic 1=6 if ethnic==“hispanic” (Replace raceethnic = 6 if ethnic==“hispanic” & race==1) Generate tertilescac = cac Recode min/54=1 55/82=2 83/max=3
Data cleaningBasic skill 4 – Labeling things • You can label: • A dataset label data “label” • A variable label var varname “label” • Values of a variable (2-step process) label define labelnamevalue1 “label1” [value2 “value2”…] Label values varnamelabelname
Data cleaningBasic skill 5 –Dealing with missing values • Missing values are important, easy to forget • . for numbers • “” for text • tab var1 var2, missing • Watch the total “n” for tab, summarize commands, regression analyses, etc.
Data cleaning • Demo: Neonatal opiate withdrawal data
Cleaning your data • Cleaning tasks • Check for consistency and clean up non-sense data • Deal with missing values • Code all dichotomous variables 1/0 • Categorize variables meaningfully (for Table 1, etc) • Derive new variables • Rename variables • With common sense, or with a consistent scheme • Label variables • Label the VALUES of coded variables
Data cleaning • At the end of the day you have: • 1 raw data file, original format • 1 raw data file, Stata format • 1 do file that cleans it up • 1 log file that documents the cleaning • 1 clean data file, Stata format
Summary • Data cleaning • ALWAYS necessary to some extent • ALWAYS use a do file • NEVER overwrite original data • Check your work • Watch out for missing values • Label as much as you can
Lab this week • It’s long • It’s hard • It’s important • Email lab to your section leader’s email • Due at the beginning of lecture next week
Preview of next week… • Using Excel • What is it good for? • Formulas • Designing a good spreadsheet • Formatting