490 likes | 627 Views
Automating Your Work: An Introduction to Programming in Stata. Shawna N. Smith 29 July 2009. …but why?. GSS Mental Health Replication Study Respondents received one of four different vignettes: depression, schizophrenia, alcohol abuse, normal troubles 38 outcomes [binary ]
E N D
Automating Your Work:An Introduction to Programming in Stata Shawna N. Smith 29 July 2009
…but why? • GSS Mental Health Replication Study • Respondents received one of four different vignettes: depression, schizophrenia, alcohol abuse, normal troubles • 38 outcomes [binary] • Two waves of data: 1996 & 2006 • First question: Is there a survey year difference? • 4 vignettes x 38 outcomes = 152 potential differences
Roadmap • Writing effective do-files [Review] • Automation • Macros • Using stored info • foreach and forvalues loops • Ado-files {brief preview}
The Workflow of Data Analysis:Principles and Practices By J. Scott Long • Much of this talk is from Chapter 4: Automating your work • For example files: type findit workflow and follow the instructions
[aside] Writing effective do-files • Robust: To be robust, a do-file must produce exactly the same result when run at a later time or on another computer • Legible: To be legible, a do-file must be documented and formatted so that it is easy to understand what is being done
Robust • Self-contained • Include version control • Exclude directory information • Never hardcode your directory! Rather set your working directory before you start your work
Legible • Use comments • Use alignment and indentation • Use short lines [<80 characters] • Limit the use of abbreviations
Automating Your Work • Macros • Saved results • Loops • Ado-files {brief preview}
Macros • A macro assigns a string of text or a number to an abbreviation • Two types of macros,{local} & {global} • {Global} • Persists until you delete it or exit Stata • Can lead to do-files that unintentionally depend on a global macro created by another do-file • Such do-files are not robust and can lead to unpredictable results • *{Local} • Can only be used within the do-file or ado-file in which they are defined • When that program ends, the local macro disappears • Macros are the simplest tool for automating your work
Syntax • locallocal-name “string” • local rhs “var1 var2 var3” • display “The local rhs contains: `rhs’” • locallocal-name = expression • local ncases = 198 • display “The local ncases equals: `ncases’” • With the equals sign, expression is limited to 80 characters; without, “string” is limited to 67,784 characters. It is usually better to use “string”
Here is a simple example. I want to estimate the model: . logit y var1 var2 var3 I can create the macro rhs with the names of the independent or right-hand-side variables: . local rhs “var1 var2 var3” Then, I can write the logit command as: . logit y `rhs’ where the ` and ‘ indicate that I want to insert the contents of the macro rhs. i.e., the command: logity `rhs’ works exactly the same as logity var1 var2 var3
Macros can be combined to specify a sequence of nested models. First, I create macros for four groups of independent variables: . local set1_age “age agesquared” . local set2_educ “wchc” . local set3_kids “k5 k618” . local set4_money “lwg inc” Next, I specify four nested models. The first model includes only the first set of variables and is specified as: . local model_1 “`set1_age’” The macro model_2 combines the content of the local model_1 with the variables in local set2_educ: . local model_2 “`model_1’`set2_educ’” The next two models are specified the same way: . local model_3 “`model_2’ `set3_kids’” . local model_4 “`model_3’ `set4_money’”
Next, I check the variables in each model: • . display “model_1: `model_1’” • model_1: age agesquared • . display “model_2: `model_2’” • model_2: age agesquaredwchc • . display “model_3: `model_3’” • model_3: age agesquaredwchc k5 k618 • . display “model_4: `model_4’” • model_4: age agesquaredwchc k5 k618 lwg inc • Using these locals, I estimate a series of logits: • . logitlfp `model_1’ • . logitlfp `model_2’ • . logitlfp `model_3’ • . logitlfp `model_4’
The whole thing: • . local set1_age “age agesquared” • . local set2_educ “wchc” • . local set3_kids “k5 k618” • . local set4_money “lwg inc” • . local model_1 “`set1_age’” • . local model_2 “`model_1’ `set2_educ’” • . local model_3 “`model_2’ `set3_kids’” • . local model_4 “`model_3’ `set4_money’” • . display “model_1: `model_1’” • model_1: age agesquared • . display “model_2: `model_2’” • model_2: age agesquaredwchc • . display “model_3: `model_3’” • model_3: age agesquaredwchc k5 k618 • . display “model_4: `model_4’” • model_4: age agesquaredwchc k5 k618 lwg inc • . logitlfp `model_1’ • . logitlfp `model_2’ • . logitlfp `model_3’ • . logitlfp `model_4’
Automating Your Work • Macros • Saved results • Loops • Ado-files {brief preview}
Saved results • Statacommands send results to your log file but also save those results to memory Drukker’s Dictum: Never type anything that you can obtain from a saved result • This information can be moved into macros and matrices, and used in many ways
Consider a simple example using -prvalue-. Use -prvalue- to calculate discrete change for DSD of age centered on the mean) • [The old way…] • . sum age • Variable | Obs Mean Std. Dev. Min Max • -------------+-------------------------------------------------------- • age | 753 42.53785 8.072574 30 60 • . di 42.53785 + (8.072574/2) • 46.574137 • . di 42.53785 - (8.072574/2) • 38.501563 • . qui prvalue, x(age=46.574137) rest(mean) save label(SD-) • . prvalue, x(age=38.501563) rest(mean) dif label(SD+) • :::
A simpler [& more robust] way: • . local c “age” • . sum `c’ • Variable | Obs Mean Std. Dev. Min Max • -------------+-------------------------------------------------------- • age | 753 42.53785 8.072574 30 60 • . return list • scalars: • r(N) = 753 • r(sum_w) = 753 • r(mean) = 42.53784860557769 // scalar for mean of age • r(Var) = 65.16645121641095 • r(sd) = 8.072574014303674 // scalar for sd of age • r(min) = 30 • r(max) = 60 • r(sum) = 32031 • . local sdup = r(mean) + (r(sd)/2) • . local sddn= r(mean) - (r(sd)/2) • . qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD-) • . prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+) • :::
Question: • I discover a problem with my age variable & decide to change my C to income. Which parts of the above code do I need to change if: • [1] I ‘hardcoded’ my numbers; & • [2] I used the locals & scalars?
Automating Your Work • Macros • Saved results • Loops • Ado-files {brief preview}
foreach and forvalues loops • Loops let you execute a group of commands multiple times • By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models • Loops can be used in many ways that make your workflow faster and more accurate. For example: • Creating interaction variables • Using the same command for multiple variables • Using information returned by Stata for other purposes
Syntax: foreach • foreachlocal-namein| oflist-typelist { commands referring to `local-name’ } • foreach name in var1 var2 var3 { • foreachvarofvarlist var1-var10 {
Syntax: forvalues • forvalueslname = range { commands referring to `lname’ } • forvaluesnage = 40(5)80 { • forvalues n = 1(.1)100 {
Here is a simple example that illustrates the key features of loops. I have a four-category ordinal variable y with values from 1 to 4. I want to create the binary variables y_lt2, y_lt3, and y_lt4 that equal 1 if y is less than the indicated value, else 0. I can create the variables with three generate commands: . generate y_lt2 = y<2 if y<. . generate y_lt3 = y<3 if y<. . generate y_lt4 = y<4 if y<.
I can do the same thing with a foreachloop: • 1> foreachcutpt in 2 3 4 { • 2> generate y_lt‘cutpt’ = y<‘cutpt’ if y<. • 3> } • The first time through the local cutptis assigned the first value in the list. • Next, the generate command is run, where ‘cutpt’ is replaced by the value assigned to cutpt. The first time through the loop, line 2 is evaluated as: • . generate y_lt2 = y<2 if y<. • Next, the closing brace } is encountered, which sends us back to the foreach command in line 1. • In the second pass, foreach assigns cutpt to the second value in the list, which means that the generate command is evaluated as: • . generate y_lt3 = y<3 if y<. • This continues once more, assigning cutpt to 4. When the foreach loop ends, three variables have been generated.
foreach and forvalues loops • Loops let you execute a group of commands multiple times • By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models • Loops can be used in many ways that make your workflow faster and more accurate. For example: • Creating interaction variables • Using the same command for multiple variables • Generating matrices from returned information
Suppose that I need variables that are interactions between the binary variable male and a set of independent variables. I can do this quickly with a loop: 1> foreachvarname of varlist yr89 white age edprst { 2> generate maleX‘varname’ = male*‘varname’ 3> label varmaleX‘varname’ "male*‘varname’" 4> } To examine the new variables and their labels, I use codebook: . codebook maleX*, compact • Variable Obs Unique Mean Min Max Label • --------------------------------------------------------------------------- • maleXyr89 2293 2 .1766245 0 1 male*yr89 • maleXwhite 2293 2 .4147405 0 1 male*white • maleXage 2293 71 20.50807 0 89 male*age • maleXed 2293 21 5.735717 0 20 male*ed • maleXprst 2293 59 18.76625 0 82 male*prst • --------------------------------------------------------------------------- • How can we use what we learned about extended macros to improve upon this?
foreach and forvalues loops • Loops let you execute a group of commands multiple times • By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models • Loops can be used in many ways that make your workflow faster and more accurate. For example: • Creating interaction variables • Using the same command for multiple variables • Generating matrices from returned information
Suppose I want to estimate discrete change for a Dsd (using the -prvalue, save- & -dif-) for multiple continuous variables. • Earlier, we used the following commands: • . local c “age” • . sum `c’ • . local sdup = r(mean) + (r(sd)/2) • . local sddn = r(mean) - (r(sd)/2) • . qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD-) • . prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+) • To expand this to multiple continuous variables, we’ll use a -foreach- loop: • foreachvar in age lwg { • qui sum `var’ • local sdup = r(mean) + (r(sd)/2) • local sddn= r(mean) - (r(sd)/2) • di “” • di “**Change in `var’ from `sddn’ to `sdup’” • qui prvalue, x(`var’=`sddn’) rest(mean) save label(SD-) • prvalue, x(`var’=`sdup’) rest(mean) dif label(SD+) • }
Output: **Change in age from 38.50156159842585 to 46.57413561272952 logit: Change in Predictions for lfp Confidence intervals by delta method SD+ SD- Current Saved Change 95% CI for Change Pr(y=inLF|x): 0.5150 0.6382 -0.1232 [-0.1717, -0.0747] Pr(y=NotInLF|x): 0.4850 0.3618 0.1232 [ 0.0747, 0.1717] k5 k618 age wchclwg inc Current= .2377158 1.3532537 46.574136 .2815405 .39176627 1.0971148 20.128965 Saved= .2377158 1.3532537 38.501562 .2815405 .39176627 1.0971148 20.128965 Diff= 0 0 8.072574 0 0 0 0 **Change in lwg from .8033366225286708 to 1.390893047643295 logit: Change in Predictions for lfp Confidence intervals by delta method SD+ SD- Current Saved Change 95% CI for Change Pr(y=inLF|x): 0.6204 0.5340 0.0865 [ 0.0445, 0.1285] Pr(y=NotInLF|x): 0.3796 0.4660 -0.0865 [-0.1285, -0.0445] k5 k618 age wchclwg inc Current= .2377158 1.3532537 42.537849 .2815405 .39176627 1.390893 20.128965 Saved= .2377158 1.3532537 42.537849 .2815405 .39176627 .80333662 20.128965 Diff= 0 0 0 0 0 .58755643 0
Question: • If I wanted to additionally compute the discrete change for a Dsd for income—what would I need to change? • foreachv in age lwg { • qui sum `v’ • local sdup = r(mean) + (r(sd)/2) • local sddn = r(mean) - (r(sd)/2) • di “” • di “**Change in `v’ from `sddn’ to `sdup’” • qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD-) • prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+) • }
foreach and forvalues loops • Loops let you execute a group of commands multiple times • By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models • Loops can be used in many ways that make your workflow faster and more accurate. For example: • Creating interaction variables • Using the same command for multiple variables • Generating matrices from returned information
As mentioned earlier, when we run a command in Stata, it stores the information in memory. We can access it from there & use it in our program. This includes both scalars [as seen from -sum-, prior], but also matrices: • . qui logitlfp k5 k618 age wchclwg inc • . ereturn list • scalars: • e(N) = 753 • [:::] • macros: • e(title) : "Logistic regression” • [:::] • matrices: • e(b) : 1 x 8 • e(V) : 8 x 8 • e(rules) : 1 x 4 • . mat list e(b) • e(b)[1,8] • k5 k618 age wchclwg [:::] • y1 -1.462913 -.06457068 -.06287055 .80727378 .11173357 .60469312
Many commands creates matrices we can use to, e.g., create cumulative matrices. For example, running -prvalue, save- & -dif- generates the following matrices: • . prvalue, x(age=20) dif • [:::] • . matrix dir • _PEtemp[3,7] • pedifsep[2,1] • pelower[7,2] //Matrix for lower CI bound • peupper[7,2] //Matrix for upper CI bound • pepred[7,2] //Matrix that includes discrete change • peinfo[3,12] • pebase[3,7] • PE_in[1,7] • PE_base[1,7] • PRVinfo[1,12] • PRVlower[2,2] • PRVupper[2,2] • PRVmisc[1,2] • PRVprob[1,2] • PRVbase[1,7] • _PRVsav[1,6] • pegrad_pr[2,8]
. matrix list pepred • pepred[7,2] • c1 c2 • 1values 0 1 • 2prob .15049911 .84950089 • 3misc 1.7306918 . • saved= .06454021 .93545979 • saved= 2.6737502 . • saved= .0859589 -.0859589 // Discrete change [6,2] • saved= -.94305837 .
We can make use of these stored matrices to generate our own matrix of discrete change coefficients & confidence intervals • matrix dc = J(9,4,.) //create empty matrix with 9 rows & 4 columns • matrix colnames dc = x dc dcLBdcUB //label columns • local irow1 = 0 //initialize a counter that will indicate row where I want to put info • forvaluesn = 30(5)70 { • local ++irow1 //this adds 1 to the counter • prvalue , x(wc=1 age=`n') save rest(mean) lab(WC) • prvalue , x(wc=0 age=`n') diff rest(mean) lab(noWC) • matrix dc[`irow1',1] = `n' • matrix dc[`irow1',2] = pepred[6,2] • matrix dc[`irow1',3] = pelower[6,2] • matrix dc[`irow1',4] = peupper[6,2] • mat list dc • } • Final output: • dc[9,4] • x dc dcLBdcUB • r1 30 .13744253 .06500561 .20987945 • r2 35 .16046226 .07829062 .2426339 • r3 40 .18021131 .08726909 .27315353 • r4 45 .19384143 .09151363 .29616923 • r5 50 .19910133 .09109842 .30710424 • r6 55 .19506124 .08578821 .30433427 • r7 60 .1824384 .07516111 .2897157 • r8 65 .16334447 .05965045 .26703849 • r9 70 .14059515 .04131751 .2398728
And for my final trick: • //change matrix to variables • svmat dc , names(col) • label varx "value of x" • label var dc "discrete change" • label vardcLB "95% CI" • label vardcUB "95% CI" • twoway /// • (connected dcLBx, msymbol(i) clpat(dash) clwidth(medthick) clcolor(blue)) /// • (connected dc x, msymbol(i) clpat(solid) clwidth(medthick) ) /// • (connected dcUBx, msymbol(i) clpat(dash) clwidth(medthick) clcolor(blue)) /// • , ytitle(Pr(Wife no college)-Wife college)) ylabel(0(.2)1) /// • xtitle(age) xlabel(30(5)70) /// • legend(pos(11) order(2 1) ring(0) cols(1) region(ls(none))) /// • title(”Labor Force Participation by" ”Wife’s College Attendance")
Ado-files • Ado-files are like do-files, except that they are automatically run • Indeed, ado stands for automatically loaded do-file • Stata 10 has nearly 2,000 ado-files • When you run a command, you cannot tell whether it is part of the executable or is an ado-file • This means that Stata users like you can write new commands and use them just like official Stata commands
Ado-files: An Example • List variables names and labels • nmlabel.ado
My first version of nmlabellists the names and labels with no options. It looks like this: 1> *! version 1.0.0 \ trm 2008-03-29 2> program define nmlabelV1 3> version 10 4>syntax varlist 5> foreachvarname in ‘varlist’ { 6> local varlabel : variable label ‘varname’ 7> display in yellow "‘varname’" _col(10) "‘varlabel’" 8> } 9> end • Here is how the command works: • . nmlabelV1 lfp-inc • lfp Paid Labor Force: 1=yes 0=no • k5 # kids < 6 • k618 # kids 6-18 • age Wife's age in years • wc Wife College: 1=yes 0=no • hc Husband College: 1=yes 0=no • lwg Log of wife's estimated wages • inc Family income excluding wife‘s
The new version of the program looks like this: 1> *! version 2.0.0 \ trm 2008-03-29 2> program define nmlabelV2 3> version 10 4> syntax varlist [, skip] 5> if "‘skip’"=="skip" { 6> display 7> } 8> foreachvarname in ‘varlist’ { 9> local varlabel : variable label ‘varname’ 10> display in yellow "‘varname’" _col(10) "‘varlabel’" 11> } 12> end If I enter the command with the skip option, the syntax command in line 4 creates a local named skip that contains the string skip: local skip “skip” If I do not specify the skip option, syntax creates the local skip as a null string: local skip “”
The third version looks like this: 1> *! version 3.0.0 \ trm 2008-03-29 2> program define nmlabelV3 3> version 10 4> syntax varlist [, skip NUMber ] 5> if "‘skip’"=="skip" { 6> display 7> } 8> local varnumber = 0 9> foreach varname in ‘varlist’ { 10> local ++varnumber 11> local varlabel : variable label ‘varname’ 12> if "‘number’"=="" { // do not number lines 13> display in yellow "‘varname’" _col(10) "‘varlabel’" 14> } 15> else { // number lines 16> display in green "#‘varnumber’: " /// 17> in yellow "‘varname’" _col(13) "‘varlabel’" 18> } 19> } 20> end
Here is the new ado-file: • 1> *! version 4.0.0 \ trm 2008-03-29 2> program define nmlabelV4 3> version 10 4> syntax varlist [, skip NUMberCOLnum(integer 16)] 5> if "‘skip’"=="skip" { 6> display 7> } 8> local varnumber = 0 9> foreachvarname in ‘varlist’ { 10> local ++varnumber 11> local varlabel : variable label ‘varname’ 12> if "‘number’"=="" { // do not number lines 13>display in yellow "‘varname’” 14> _col(‘colnum’) "‘varlabel’" 15> } 15> else { // number lines 16> display in green "#‘varnumber’: " /// 17> in yellow _col(6) "‘varname’" /// 18> _col(‘colnum’) "‘varlabel’" 19> } 20> } 21> end
Counters are so useful that Stata has a simpler way to increment them. The command local ++counter is equivalent to local counter = ‘counter’ + 1. So instead of this: local counter = 0 • foreachvarname of varlist warm yr89 male white age edprst { • local counter = ‘counter’ + 1 • local varlabel : variable label ‘varname’ • display "‘counter’. ‘varname’" _col(12) "‘varlabel’“ • } We can use this: • local counter = 0 • foreachvarname of varlist warm yr89 male white age edprst { • local ++counter • local varlabel : variable label ‘varname’ • display "‘counter’. ‘varname’" _col(12) "‘varlabel’" • }
Next, I use a matrix command to create a matrix named stats: matrix stats = J(‘nvars’,2,.) The J function creates a matrix based on three arguments. The first is the number of rows, the second the number of columns, and the third is the value used to fill the matrix. In this case, I want the matrix to be initialized with missing values which are indicated by a period. The matrix looks like this: . matrix list stats stats[6,2] c1 c2 r1 . . r2 . . r3 . . r4 . . r5 . . r6 . .
Nested Loops You can nest loops by placing one loop inside of another loop. • Consider the earlier example of creating binary variables indicating if y was less than a given value: • 1> foreachcutpt in 2 3 4 { • 2> generate y_lt‘cutpt’ = y<‘cutpt’ if y<. • 3> } • Suppose that I need to do this for variables ya, yb, yc, and yd. • 1> foreach y of varlistyaybyc yd { // loop 1 begins • 2> foreachcutpt in 2 3 4 { // loop 2 begins • 3> * create binary variable • 4> generate ‘y’_lt‘cutpt’ = `y’<‘cutpt’ if `y’<. • 9> } // loop 2 ends • 10> } // loop 1 ends • What is the first variable created? the last?