Stata Workshop #1

StataWorkshop #1 Chiu-Hsieh (Paul) Hsu Associate Professor College of Public Health pchhsu@email.arizona.edu

Outline • Do files • Data entry • Data management • Data description • Estimation: Confidence Interval • Hypothesis testing

Do files • Stata programs • Easy to add or skip comments • One click/command can run the whole program • Reproducible • Don’t need to retype all of the commands • Interactive work vs. do files

Data Entry

Stata Commands • cd: Change directory • dir or ls: Show files in current directory • insheet: Read ASCII (text) data created by a spreadsheet • infile: Read unformatted ASCII (text) data • infix: Read ASCII (text) data in fixed format • input: Enter data from keyboard • save: Store the dataset currently in memory on disk in Stata data format • use: Load a Stata-format dataset • count: Show the number of observations • list: List values of variables • clear: Clear the entire dataset and everything else • memory: Display a report on memory usage • set memory:Set the size of memory

Ways to enter data • Change the directory to the folder you like • cd c:\Stata • Common separated values (.csv) format files • insheet using test.csv,clear (with variable names) • infile gender id race ses schtyp str10 prgtype read write math science socst using hs0.raw, clear (without variable names) • Stata (.dta) files • use test • Type in data one by one • input id female race ses str3 schtype prog read write math science socst • End (when you are done) • What’s in the dataset? • describe • list

Data Management

Stata Commands • pwd: show: current directory (pwd=print working directory) • keep if: keep observations if condition is met • Keep: keep variables or observations • drop: drop variables or observations • append: append a data file to current file • sort: sort observations • merge: merge a data file with current file • codebook: show codebook information for file • label data: apply a label to a data set • order: order the variables in a data set • label variable: apply a label to a variable • label define: define a set of a labels for the levels of a categorical variable • label values: apply value labels to a variable • encode: create numeric version of a string variable • rename a variable • recode: recode the values of a variable • notes: apply notes to the data file • generate: creates a new variable • replace: replaces one value with another value • egen: extended generate - has special functions that can be used when creating a new variable

Merging two datasets • test1 and test2 have the same variables but different subjects use test1 append using test2 save test12 • test3 and test4 have the same subjects and only share a link variable, e.g. ID use test3, clear sort id save test3,replace use test4, clear sort id save test4,replace use test3 merge id using test4 save test34

Play with Variables • use test • label variable gender "Male" • rename gender male • gen female=1-male • order id male female • encode prgtype, gen(prog) • codebook prog • keep if female==1 (delete male) • drop female

Dummy Variables • A categorical variable with K possible levels • Need K-1 dummy variables (one as the reference) • Dummy variables are convenient for regression analysis • How to create dummy variables? • Use generate command • gen female=1-gender • Use tabulate command • tabulate gender, gen(male) • Use factor variables • xi i.gender • list,clean

Data Description

Stata Commands • describe: describe a dataset • log: create a log file • summarize: descriptive statistics • tabstat: table of descriptive statistics • table: create a table of statistics • stem: stem-and-leaf plot • graph: high resolution graphs • kdensity: kernel density plot • histogram: histogram for continuous and categorical variables • tabulate: one- and two-way frequency tables • correlate: correlations • pwcorr: pairwise correlations

Example: raw data • log using test.txt, text replace • use lead • describe • sum maxfwt, detail • histogram maxfwt, by(Group) normal • graph box maxfwt, by(Group) • stem maxfwt • kdensity maxfwt • tab Group sex • cor ageyrs maxfwt,sig • cor ageyrs maxfwt if sex==1 (male only),sig • pwcorr ageyrs maxfwt fwt_r,sig • log close

Example: grouped data • use group (a grouped dataset) • sum age [fweight=freq],detail • hist age [fweight=freq] • Pretty much the same as raw data. Just need to specify the weight.

Some Review • Use both location and spread measures to summarize a dataset • Mean, standard deviation and range are easily affected by extreme observations • Median and inter-quartile range are less affected by extreme observations • Coefficient of variation (standard deviation divided by mean) removes the scale effect.

Estimation

Estimation of Parameters • Binomial distribution • Parameters n (usually known) and p • How to estimate p? • Poisson distribution • Parameter λ • How to estimate λ? • Normal distribution • Parameters µ and σ2 • How to estimate µ and σ2? • σ2 unknown  t distribution

Stata Commands • Raw data • ci [varlist] [if] [in] [weight] [, options] • confidence intervals for mean, proportion (b) and count (p) • Summarry statistics • cii #obs #mean #sd [, ciin_option] • Normal • cii #obs #succ [, ciib_options] • Binomial

Examples • gen female=sex-1 • tab female Group • What’s the average maxfwt for females in the exposed group? • ci maxfwt if female==1 & Group==2 (raw data) • sum maxfwt if female==1 & Group==2 • cii 16 59 20.887,level(95) (summary statistics) • What’s the proportion of females in the exposed group? • gen expose=Group-1 • ci expose if female==1,b • cii 48 16,level(95)

Hypothesis Testing

Stata Commands (mean) • ttest • Raw data • ttest varname == # [if] [in] [, level(#)] • ttest varname1 == varname2 [if] [in], unpaired [unequal welch level(#)] • ttest varname1 == varname2 [if] [in] [, level(#)] • ttest varname [if] [in] , by(groupvar) [options1] • Summarry statistics • ttesti #obs #mean #sd #val [, level(#)] • ttesti #obs1 #mean1 #sd1 #obs2 #mean2 #sd2 [, options2]

Examples • One sample • Is the average maxfwt for females in the exposed group significantly lower than 45? • ttest maxfwt==45 if female==1 & Group==2 • ttesti 16 59 20.887 45 (summary statistics) • Two samples • Do females have a higher average maxfwt than males in the exposed group? • ttest maxfwt if Group==2, by(female) • sum maxfwt if female==0 & Group==2 • ttesti 16 59 20.887 30 60.167 27.28

Stata Commands (variance) • sdtest • Raw data • sdtest varname == # [if] [in] [, level(#)] • sdtest varname1 == varname2 [if] [in] [, level(#)] • sdtest varname [if] [in] , by(groupvar) [level(#)] • Summarry statistics • sdtesti #obs {#mean | . } #sd #val [, level(#)] • sdtesti #obs1 {#mean1 | . } #sd1 #obs2 {#mean2 | . } #sd2 [, level(#)]

Examples • One sample • Is the variance of maxfwt for females in the exposed group significantly greater than 100? • sdtest maxfwt==10 if female==1 & Group==2 • sdtesti 16 59 20.887 10 (summary statistics) • Two samples • Do females have a greater variation in maxfwt than males in the exposed group? • sdtest maxfwt if Group==2, by(female) • sum maxfwt if female==0 & Group==2 • sdtesti 16 59 20.887 30 60.167 27.28

Stata Commands (proportion) • prtest • Raw data • prtest varname == #p [if] [in] [, level(#)] • prtest varname1 == varname2 [if] [in] [, level(#)] • prtest varname [if] [in] , by(groupvar) [level(#)] • Summarry statistics • prtesti #obs1 #p1 #p2 [, level(#) count] • prtesti #obs1 #p1 #obs2 #p2 [, level(#) count]

Examples • One sample • Is it more than 50% of females in the exposed group? • prtest expose==0.5 if female==1 • prtesti 48 0.3333333 0.5 • Two samples • Are there more females in the exposed group than the control group? • prtest female, by(expose) • tab expose female, r • prtesti 78 0.4103 46 0.3478

Power and Sample Size

Stata Command (sample size) • One sample • continuous • sampsi μ0μ1, sd(.) p(.) a(.) onesam • sampsi 3500 3800, sd(420) p(.9) onesam • Binary proportions • sampsi p0 p1, p(.) onesam • sampsi 0.4 0.25, p(0.9) onesam • Two samples • continuous • sampsi μ1μ2, p(.) sd1(.) sd2(.) a(.) • sampsi 132.86 127.44, p(0.8) sd1(15.34) sd2(18.23) • Binary proportions • sampsi p1 p2, p(.) • sampsi 0.4 0.25, p(0.9)

Stata Command (power) • One sample • continuous • sampsi μ0μ1, sd(.) n(.) a(.) onesam • sampsi 84.4 90.1, sd(10.3) n(5) onesam onesided • Binomial proportion • sampsi p0 p1, n1(.) onesam • sampsi 0.25 0.4, n1(100) onesam • Two samples • continuous • sampsi μ1μ2, n1(.) n2(.) sd1(.) sd2(.) a(.) • sampsi 9 14, n1(100) n2(100) sd1(15.34) sd2(18.23) • Binomial proportions • sampsi p1 p2, n1(.) n2(.) • sampsi 0.4 0.25, n1(100) n2(150)

Useful links • http://www.ats.ucla.edu/stat/stata/ • Once the D2L site is created, all of the handouts and related materials will be posted on the D2L site.

Stata Workshop #1

Stata Workshop #1

Presentation Transcript

Advanced Stata Workshop

STATA APPLICATIONS

Advanced Stata Workshop

STATA