380 likes | 608 Views
Generating new variables and manipulating data with STATA. Biostatistics 212 Lecture 3. Housekeeping. Lab 1 handed back today Think of red ink as teaching points, not penalties… Log and Do files Cookbook labs vs. complete understanding. The .do file template. cd “C:databiostat212”
E N D
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3
Housekeeping • Lab 1 handed back today • Think of red ink as teaching points, not penalties… • Log and Do files • Cookbook labs vs. complete understanding
The .do file template cd “C:\data\biostat212\” set more off clear set memory 10m capture log close log using “name of your log.log”, replace /* here are my comments */ use “name of your dataset”, clear summarize this browse that tabulate this log close set more on • Tell STATA where to look for things and where to put things • Stop STATA from prompting you to push a button to continue • Tell STATA to clear any datasets in memory and increase its mem capacity • Since your do file may not be perfect, tell STATA to close any logs that are open when you try to run your do file • Tell STATA to create a log of your output for you and what you’re going to call that log. Tell it to overwrite it each time • Stick in some comments to remind you what this do file is for • Tell STATA what dataset to work on • Leave some SPACE for putting in analysis commands you want to keep • Lastly, tell STATA to close the log and go back to its usual “more” mode
Today... • What we did in Lab 1, and why it was unrealistic • What does “data cleaning” mean? • How to generate a variable • How to manipulate the data in your new variable • How to label variables and otherwise document your work • Examples
Last time… • What was unrealistic?
Last time… • What was unrealistic? • The dataset came as a Stata .dta file
Last time… • What was unrealistic? • The dataset came as a Stata .dta file • The variables were ready to analyze
Last time… • What was unrealistic? • The dataset came as a Stata .dta file • The variables were ready to analyze • Most variables were labeled
Last time… • i.e. – The data was “clean”
How your data will arrive • On paper forms • In a text file (comma or tab delimited) • In Excel • In Access • In another data format (SAS, etc)
Importing into Stata • Options: • Copy and Paste • insheet, infile, fdause, other flexible Stata commands • A convenience program like “Stat/Transfer”
Importing into Stata • Make sure it worked • Look at the data
Importing into Stata • Demo – neonatal opiate withdrawal data • Import with cut and paste from Excel • Import with insheet (save as .csv file first)
Exploring your data • Figure out what all those variables mean • Options • Browse, describe, summarize, list in STATA • Refer to a data dictionary • Refer to a data collection form • Guess, or ask the person who gave it to you
Exploring your data • Demo: Neonatal opiate withdrawal data
Exploring your data • Demo: Neonatal opiate withdrawal data • Problems arise… • Sex is m/f, not 1/0 • Gestational age has nonsense values (0, 60) • Breastfeeding has a bunch of weird text values • Drug variables coded y or blank • Many variable names are obscure
Cleaning your data • You must “clean” your data so it is ready to analyze.
Cleaning your data • What does the variable measure? • rename or label var so it’s clear • Find nonsense values and outliers • recode as missing or track down real value? • Deal with missing values • Too many? Coding consistent? • drop variable or observation? • Categorize as needed • generate a new numeric variable • recode (dichotomous variables coded as 1/0, watch missing values) • label define and then label values • Check • tab oldvar newvar, missing • bysort catvar: sum contvar
Cleaning your data • The importance of documentation • Retracing your steps • Document every step using a “do” file
Data cleaningBasic skill 1 – Making a new variable • Creating new variables generate newvar = expression
Data cleaningBasic skill 1 – Making a new variable • Creating new variables generate newvar = expression • An “expression” can be: • A number (constant) - generate allzeros = 0 • A variable - generate ageclone = age • A function - generate agesqrt = sqrt(age)
Data cleaningBasic skill 2 – Manipulating values of a variable • Changing the values of a variable replace var = exp [if boolean_expression] • A boolean expression evaluates to true or false for each observation
Data cleaningBasic skill 2 – Manipulating values of a variable • Examples generate bmi = weight/(height^2) generate male = 0 replace male = 1 if sex==“male” generate ageover50 = 0 replace ageover 50 = 1 if age>50 generate complexvar = age replace complexvar = (ln(age)*3) if (age>30 | male==1) & (othervar1>=othervar2)
Data cleaningBasic skill 2 – Manipulating values of a variable • Logical operators for boolean expressions: EnglishStata Equal to == Not equal to !=, ~= Greater than > Greater than/equal to >= Less than < Less than/equal to <= And & Or |
Data cleaningBasic skill 2 – Manipulating values of a variable • Mathematical operators: EnglishStata Add + Subtract - Multiply * Divide / To the power of ^ Natural log of ln(expression) Base 10 log of log10(expression) Etcetera…
Data cleaningBasic skill 2 – Manipulating values of a variable • Another way to manipulate data recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean_expression] • More complicated, but more flexible command than replace
Data cleaningBasic skill 2 – Manipulating values of a variable • Examples generate male = 0 recode male 0=1 if sex==“male” generate female = male recode female 1=0 0=1
Data cleaningBasic skill 2 – Manipulating values of a variable • Examples generate raceethnic = race recode raceethnic 1=6 if ethnic==“hispanic” (replace raceethnic = 6 if ethnic==“hispanic” & race==1) generate tertilescac = cac recode min/54=1 55/82=2 83/max=3
Data cleaningBasic skill 3 – Getting rid of variables/observations • Getting rid of a variable drop var • Getting rid of observations drop if boolean_expression
Data cleaningBasic skill 4 – Labeling things • You can label: • A dataset label data “label” • A variable label var varname “label” • Values of a variable (2-step process) label define labelnamevalue1 “label1” [value2 “value2”…] label values varnamelabelname
Data cleaningBasic skill 5 –Dealing with missing values • Missing values are important, easy to forget • . for numbers • “” for text • tab var1 var2, missing • Watch the total “n” for tab, summarize commands, regression analyses, etc.
Data cleaning • Demo: Neonatal opiate withdrawal data
Cleaning your data • What does the variable measure? • rename or label var so it’s clear • Find nonsense values and outliers • recode as missing or track down real value? • Deal with missing values • Too many? Coding consistent? • drop variable or observation? • Categorize as needed • generate a new numeric variable • recode (dichotomous variables coded as 1/0, watch missing values) • label define and then label values • Check • tab oldvar newvar, missing • bysort catvar: sum contvar
Data cleaning • At the end of the day you have: • 1 raw data file, original format • 1 raw data file, Stata format • 1 do file that cleans it up • 1 log file that documents the cleaning • 1 clean data file, Stata format
Summary • Data cleaning • ALWAYS necessary to some extent • ALWAYS use a do file • NEVER overwrite original data • Check your work • Watch out for missing values • Label as much as you can
Lab this week • It’s long • It’s hard • It’s important • Email lab to your section leader’s email • Due at the beginning of lecture next week
Preview of next week… • Using Excel • What is it good for? • Formulas • Designing a good spreadsheet • Formatting