STATA Lab: EP521 ‘Learning by Doing’ Session 1: Exploring Data

STATA Lab: EP521 ‘Learning by Doing’ Session 1: Exploring Data Ray Boston boston@vet.upenn.edu Room 604 Blockley 610 925 6557

This 6 session Stata Lab series will expose the 2nd level functionality of Stata through practical demonstrations and exercises relating to your course EP 521 Course Schedule: Presenter: Ray Boston Location: Room 604 Blockley Phone: 610 925 6557 boston@vet.upenn.edu

Commands used in this lab use use a Stata dataset earlier stored on disk: note replace option inspect inspect specific variables: note missing value info available here describe describe variables in a Stata dataset: note detail option summarize summarize a Stata dataset: note works on individual variables codebook report details of data coding for indicated variable(s) display display a message, or a variable value (scalar or local) label label a categorical variable encode make a numeric variable for a sting variable generate generate a new Stata variable replace replace the value of a variable list list specific variables, note tables can be also generated table tabulate an interval variable by a cateogorical variable tabstat tabulate some statistics for specific variables tabulate tabulate some information, note this is the command for Fisher’s test sort sort the data gsort sort the dataset in a specified way

Secondary commands used in this lab* for loop for a series of ‘objects’, note that this is an out-of-date command collapse reduce your dataset to summary statistics scatter produce a Stata 8 scatter plot gr7 produce s Stata 7 graph #d; #d cr change the end-of-line delimiter (; and cr) preserve preserve a copy of the current Stata dataset in computer memory restore retrieve the preserved copy of the Stata dataset: note the stored copy is no-longer available, and the original dataset is replaced scalar generate a Stata scalar variable local generate a Stata local variable: note also called a local macro cc generate a case-control type of epi table cs generate a cohort-study type of epi table logit perform a logistic regression poisson perform a poisson regression * We may return to these commands for specific purposes in later labs

Problem: Woodward presents the following table (Table 2.9, p. 48) relating to sex versus smoking status in the Scottish Heart Health Study. Adapt the information in this table for analysis with STATA Variables: sex, smoker, count Coding: smoker: 0, non-smoker; 1, smoker sex: 0, female; 1, male count: actual cell count

The data can be entered into STATA via the data editor Label the values of sex and smoker so that our table make sense . label define smlabel 0 "Non smoker" 1 " Smoker " . label define selabel 0 "Female" 1 " Male " . label val sex selabel . label val smoker smlabel Note that cell counts, and NOT margins are entered into STATA

Label the variable count . label var count "Cell count" For some preliminary limbering let’s explore the data as it stands . list sex smoker count 1. Female Smoker 1562 2. Male Non smoker 2241 3. Female Non smoker 2259 4. Male Smoker 2279 Why wasn’t count labeled like sex, and smoker? We should now save the table as a file . save “table 2_9 Woodward.dta",replace Where was the data saved? cd Why did we include the replace option? pre-existence Why do we refer to replace as an option? ‘,’ Why did we use quotes (“”) around the file name? space What format was the data saved in? ‘.dta’

Let’s see a table of this data . table sex smoker [fwe=count], row col ---------------------------------------------- | smoker sex | Non smoker Smoker Total ----------+----------------------------------- Female | 2,259 1,562 3,821 Male | 2,241 2,279 4,520 | Total | 4,500 3,841 8,341 ---------------------------------------------- Let’s see how we recall the coding schemes .. why would this be needed? . codebook sex sex --------------------------------------------------------------- (unlabeled) type: numeric (byte) label: selabel range: [0,1] units: 1 unique values: 2 coded missing: 0 / 4 tabulation: Freq. Numeric Label 2 0 Female 2 1 Male

We will explore this data using the Stata command sequence which follows: • First some EXTREMELY important points: • In practice you will ALWAYS build your statistical exploration of data • using command sequences such as we now demonstrate • Why? • The nature of the commands in the command sequence is ALWAYS • retained on your computer in a disk file, usually close to the dataset • (“table 2_9 Woodward.dta”) for which it was developed. • Why? • Commands are stored as ordinary text in files called ‘do’ files • Why? • Stata has a special editor, the ‘do’ file editor, for the creation, and • editing of ‘do’ files. • Why?

use "C:\Stata\EP521\Epi 521 04\Session 1\table 2_9 Woodward.dta",clear * Information about the raw data: correctness/screening list codebook describe summarize summarize sex smoke [fwe=count] label define smlabel 0 "Non smoker" 1 " Smoker " label define selabel 0 "Female" 1 " Male " label val sex selabel label val smoker smlabel list * If we want to copy the table to Excel: * Select, and Edit | copy table, and Paste * the following table list, nolabel noobs clean codebook inspect describe * Some tables describing the data tabulate sex [fwe=count], su(smoke) mean table sex [fwe=count], c(mean smoke freq) format(%7.2f) tabulate sex smoke [fwe=count], chi table sex smoke [fwe=count], row col tabstat smoke [fwe=count], s(mean sd sem N) by(sex) long * Present some simple graphs of this data preserve collapse smoke [fwe=count], by(sex) gen pos=3^(sex+1) Get the data into Stata Screening the input using: list describe summarize codebook inspect, and table variations Preparing to graph

scatter smoke sex, c(l) ml(sex) more scatter smoke sex, c(l) ml(sex) mlabv(pos) more * Now for adjustments required by Stata 8 graphics syntax #d ; scatter smoke sex, c(l) ms(Sh) mlabv(pos) xlabel(0 1, valuelabel) title("Smoking Proportion By Sex") ytitle(" ") ylabel(,angle(0)) ; #d cr more * gr7 requests a Stata 7 type graph * You establish Stata 7 graph preferences using 'oldgprefs' gr7 smoke sex, c(l) s([sex]) xlabel(0 1) ylabel l1("Smoking Proportion By Sex") more * Let's determine the male:female risk ratio for smoking di "Risk ratio = " max(smoke[1],smoke[2])/min(smoke[1],smoke[2]) restore * Two alternate ways of looking at the data - Risk perspective cs smoke sex [fwe=count] poisson smoke sex [fwe=count], irr nolog ro * Using scalars let's calculate the male:female odds ratio for smoking gsort sex -smoke scalar prob_female=count[1]/(count[1]+count[2]) scalar odds_female= prob_female/(1-prob_female) scalar prob_male=count[3]/(count[3]+count[4]) scalar odds_male=prob_male/(1-prob_male) scalar odds_ratio=odds_male/odds_female scalar list _all * Two alternate ways of looking at the data - Odds perpsective cc smoke sex [fwe=count] logit smoke sex [fwe=count], or nolog Stata 8 Graphing commands Stata 7 Graphing command Manual ‘rr’ calculation Two other ways of determining risk ratio - rr Manual ‘or’ calculation Two other ways of determining odds ratio - or

An exercise to get you started using Stata productively on your own

The following table is from Kahn & Sempos (p. 81) and reflects a distillation of some information extracted from the Framingham study. Ultimately we would like to use these numbers to possibly tell us: to what degree blood pressure elevation disposes us to CHD what is the overall risk for CHD amongst study participants in the table how much is the risk of CHD elevated if we have high blood pressure

Getting the CHD data into STATA and naming the variables. What do we mean by naming the variables?

Perform the following tasks: Screen the data entered to confirm its correctness How could you generate the margins to add confidence here? Do it. Label the variables appropriately. What constitutes appropriate labeling? Save the Stata data file. Where did you save it? What format was used? Verify that you have indeed save the Stata data file Perform tests to verify that you have correctly prepared your data Tables: Reproduce the table in which the problem is first introduced Tabulate the proportion of subjects with CHD by blood pressure grouping Add standard error estimates to this table Are the proportions with CHD different by blood pressure group? Graphs: Collapse the data into proportions with CHD, by blood pressure group Produce a simple Stata 8 graph of CHD proportion against blood pressure Add features to your graph to make it publication ready Produce a Stata 7 graph of the same data … which was easier?

The Excel file, ‘cardatarb.xls’ contains some recent (New Yorker, Jan 05, 2004) accident statistics relating to indirect and direct road deaths when a range of different car types were involved. The purpose of the investigation under- pinning the data was to see if large vehicles are associated with different types of accidents than small cars. You are asked to perform the following tasks: Get the data from Excel directly into Stata Describe and summarize the data Generate a neat table of all types of deaths (these are actually death rates per million vehicles of the indicated type) by vehicle type. Is there a suggestion of an association here? Make a numeric variable out of the car type variable. Confirm that the new variable you have created is indeed of the type sought Label the numeric variable appropriately. Hint: you’ll need codebook’s help here The vehicles are essentially of two classes, large and small. Create a new numeric variable which is 1 for large vehicles, and 0 for others. Label this variable appropriately. See if your data breaks down equitably by your new numeric variable. Tabulate a breakdown of deaths of the different types by your new size-related vehicle group variable. How could you actually detect a statistically significant difference here? (see nptrend)

STATA Lab: EP521 ‘Learning by Doing’ Session 1: Exploring Data