310 likes | 438 Views
Stata Workshop #1. Chiu-Hsieh (Paul) Hsu Associate Professor College of Public Health pchhsu@email.arizona.edu. Outline. Do files Data entry Data management Data description Estimation: Confidence Interval Hypothesis testing. Do files. Stata programs Easy to add or skip comments
E N D
StataWorkshop #1 Chiu-Hsieh (Paul) Hsu Associate Professor College of Public Health pchhsu@email.arizona.edu
Outline • Do files • Data entry • Data management • Data description • Estimation: Confidence Interval • Hypothesis testing
Do files • Stata programs • Easy to add or skip comments • One click/command can run the whole program • Reproducible • Don’t need to retype all of the commands • Interactive work vs. do files
Stata Commands • cd: Change directory • dir or ls: Show files in current directory • insheet: Read ASCII (text) data created by a spreadsheet • infile: Read unformatted ASCII (text) data • infix: Read ASCII (text) data in fixed format • input: Enter data from keyboard • save: Store the dataset currently in memory on disk in Stata data format • use: Load a Stata-format dataset • count: Show the number of observations • list: List values of variables • clear: Clear the entire dataset and everything else • memory: Display a report on memory usage • set memory:Set the size of memory
Ways to enter data • Change the directory to the folder you like • cd c:\Stata • Common separated values (.csv) format files • insheet using test.csv,clear (with variable names) • infile gender id race ses schtyp str10 prgtype read write math science socst using hs0.raw, clear (without variable names) • Stata (.dta) files • use test • Type in data one by one • input id female race ses str3 schtype prog read write math science socst • End (when you are done) • What’s in the dataset? • describe • list
Stata Commands • pwd: show: current directory (pwd=print working directory) • keep if: keep observations if condition is met • Keep: keep variables or observations • drop: drop variables or observations • append: append a data file to current file • sort: sort observations • merge: merge a data file with current file • codebook: show codebook information for file • label data: apply a label to a data set • order: order the variables in a data set • label variable: apply a label to a variable • label define: define a set of a labels for the levels of a categorical variable • label values: apply value labels to a variable • encode: create numeric version of a string variable • rename a variable • recode: recode the values of a variable • notes: apply notes to the data file • generate: creates a new variable • replace: replaces one value with another value • egen: extended generate - has special functions that can be used when creating a new variable
Merging two datasets • test1 and test2 have the same variables but different subjects use test1 append using test2 save test12 • test3 and test4 have the same subjects and only share a link variable, e.g. ID use test3, clear sort id save test3,replace use test4, clear sort id save test4,replace use test3 merge id using test4 save test34
Play with Variables • use test • label variable gender "Male" • rename gender male • gen female=1-male • order id male female • encode prgtype, gen(prog) • codebook prog • keep if female==1 (delete male) • drop female
Dummy Variables • A categorical variable with K possible levels • Need K-1 dummy variables (one as the reference) • Dummy variables are convenient for regression analysis • How to create dummy variables? • Use generate command • gen female=1-gender • Use tabulate command • tabulate gender, gen(male) • Use factor variables • xi i.gender • list,clean
Stata Commands • describe: describe a dataset • log: create a log file • summarize: descriptive statistics • tabstat: table of descriptive statistics • table: create a table of statistics • stem: stem-and-leaf plot • graph: high resolution graphs • kdensity: kernel density plot • histogram: histogram for continuous and categorical variables • tabulate: one- and two-way frequency tables • correlate: correlations • pwcorr: pairwise correlations
Example: raw data • log using test.txt, text replace • use lead • describe • sum maxfwt, detail • histogram maxfwt, by(Group) normal • graph box maxfwt, by(Group) • stem maxfwt • kdensity maxfwt • tab Group sex • cor ageyrs maxfwt,sig • cor ageyrs maxfwt if sex==1 (male only),sig • pwcorr ageyrs maxfwt fwt_r,sig • log close
Example: grouped data • use group (a grouped dataset) • sum age [fweight=freq],detail • hist age [fweight=freq] • Pretty much the same as raw data. Just need to specify the weight.
Some Review • Use both location and spread measures to summarize a dataset • Mean, standard deviation and range are easily affected by extreme observations • Median and inter-quartile range are less affected by extreme observations • Coefficient of variation (standard deviation divided by mean) removes the scale effect.
Estimation of Parameters • Binomial distribution • Parameters n (usually known) and p • How to estimate p? • Poisson distribution • Parameter λ • How to estimate λ? • Normal distribution • Parameters µ and σ2 • How to estimate µ and σ2? • σ2 unknown t distribution
Stata Commands • Raw data • ci [varlist] [if] [in] [weight] [, options] • confidence intervals for mean, proportion (b) and count (p) • Summarry statistics • cii #obs #mean #sd [, ciin_option] • Normal • cii #obs #succ [, ciib_options] • Binomial
Examples • gen female=sex-1 • tab female Group • What’s the average maxfwt for females in the exposed group? • ci maxfwt if female==1 & Group==2 (raw data) • sum maxfwt if female==1 & Group==2 • cii 16 59 20.887,level(95) (summary statistics) • What’s the proportion of females in the exposed group? • gen expose=Group-1 • ci expose if female==1,b • cii 48 16,level(95)
Stata Commands (mean) • ttest • Raw data • ttest varname == # [if] [in] [, level(#)] • ttest varname1 == varname2 [if] [in], unpaired [unequal welch level(#)] • ttest varname1 == varname2 [if] [in] [, level(#)] • ttest varname [if] [in] , by(groupvar) [options1] • Summarry statistics • ttesti #obs #mean #sd #val [, level(#)] • ttesti #obs1 #mean1 #sd1 #obs2 #mean2 #sd2 [, options2]
Examples • One sample • Is the average maxfwt for females in the exposed group significantly lower than 45? • ttest maxfwt==45 if female==1 & Group==2 • ttesti 16 59 20.887 45 (summary statistics) • Two samples • Do females have a higher average maxfwt than males in the exposed group? • ttest maxfwt if Group==2, by(female) • sum maxfwt if female==0 & Group==2 • ttesti 16 59 20.887 30 60.167 27.28
Stata Commands (variance) • sdtest • Raw data • sdtest varname == # [if] [in] [, level(#)] • sdtest varname1 == varname2 [if] [in] [, level(#)] • sdtest varname [if] [in] , by(groupvar) [level(#)] • Summarry statistics • sdtesti #obs {#mean | . } #sd #val [, level(#)] • sdtesti #obs1 {#mean1 | . } #sd1 #obs2 {#mean2 | . } #sd2 [, level(#)]
Examples • One sample • Is the variance of maxfwt for females in the exposed group significantly greater than 100? • sdtest maxfwt==10 if female==1 & Group==2 • sdtesti 16 59 20.887 10 (summary statistics) • Two samples • Do females have a greater variation in maxfwt than males in the exposed group? • sdtest maxfwt if Group==2, by(female) • sum maxfwt if female==0 & Group==2 • sdtesti 16 59 20.887 30 60.167 27.28
Stata Commands (proportion) • prtest • Raw data • prtest varname == #p [if] [in] [, level(#)] • prtest varname1 == varname2 [if] [in] [, level(#)] • prtest varname [if] [in] , by(groupvar) [level(#)] • Summarry statistics • prtesti #obs1 #p1 #p2 [, level(#) count] • prtesti #obs1 #p1 #obs2 #p2 [, level(#) count]
Examples • One sample • Is it more than 50% of females in the exposed group? • prtest expose==0.5 if female==1 • prtesti 48 0.3333333 0.5 • Two samples • Are there more females in the exposed group than the control group? • prtest female, by(expose) • tab expose female, r • prtesti 78 0.4103 46 0.3478
Stata Command (sample size) • One sample • continuous • sampsi μ0μ1, sd(.) p(.) a(.) onesam • sampsi 3500 3800, sd(420) p(.9) onesam • Binary proportions • sampsi p0 p1, p(.) onesam • sampsi 0.4 0.25, p(0.9) onesam • Two samples • continuous • sampsi μ1μ2, p(.) sd1(.) sd2(.) a(.) • sampsi 132.86 127.44, p(0.8) sd1(15.34) sd2(18.23) • Binary proportions • sampsi p1 p2, p(.) • sampsi 0.4 0.25, p(0.9)
Stata Command (power) • One sample • continuous • sampsi μ0μ1, sd(.) n(.) a(.) onesam • sampsi 84.4 90.1, sd(10.3) n(5) onesam onesided • Binomial proportion • sampsi p0 p1, n1(.) onesam • sampsi 0.25 0.4, n1(100) onesam • Two samples • continuous • sampsi μ1μ2, n1(.) n2(.) sd1(.) sd2(.) a(.) • sampsi 9 14, n1(100) n2(100) sd1(15.34) sd2(18.23) • Binomial proportions • sampsi p1 p2, n1(.) n2(.) • sampsi 0.4 0.25, n1(100) n2(150)
Useful links • http://www.ats.ucla.edu/stat/stata/ • Once the D2L site is created, all of the handouts and related materials will be posted on the D2L site.