350 likes | 376 Views
Automatisation in Stata. Jan Hagemejer & Joanna Tyrowicz. Plan. Standard solutions Where they do not work ? Usually more than one way to estimate – how to chose ? Using loops and global function together Generating the resultssets for atypical estimations .
E N D
Automatisation in Stata Jan Hagemejer & Joanna Tyrowicz
Plan • Standard solutions • Wherethey do not work? • Usuallymorethan one way to estimate – how to chose? • Usingloops and globalfunctiontogether • Generatingtheresultssetsfor atypicalestimations. • Difficultieswithusingbootstrap (and obtainingresultssets) • Summarycomments … and someadvices Jan Hagemejer & Joanna Tyrowicz
The standard route • Problem: severalestimations of similar form. • Need to compareresults. • Threesimplesolutions: • Solution 1: bruteforce = sit & type • Solution 2: useparmby/parmest: ifestimations on simplecategoriesin data (limitations of „by” command) • Solution 3: useloops • See N. Cox’smaterialfromprevious SUGM) • Commandsdeveloped by Roger Newson: outreg/outreg2 • nicelyformattedtables, • publication-ready, • in many formats, evenLaTeX. • Note: ifyouneed nice summarystatistics, youcanuseoutsumeitherwithbyorwithinloops Jan Hagemejer & Joanna Tyrowicz
Where the problems come from? • 2nd and 3rd solutionworksonlywithregression-typeestimations • However, someproceduresareincompatiblewithpre-cookedsolutions • Examples: • Marginaleffects, • Useoutreg2in Stata10 ifusedprobit/logitinstead of probit/logit • Useoutreg2in Stata11 withmargins and/ormfx2(remeberaboutreplaceoption) • Nice statistics • Usetempname and postfilesyntax • Rolling window on any of thistype of analysis Jan Hagemejer & Joanna Tyrowicz
Not everything may be solved this way… • Reason 1: thingsmorecomplexthantheyseem (to comein a sec..) • Reason 2: somethingsare not listedintheoutput: • Example: variousversions of R2 orsamplesizeinsimpleregressions • outreg/parmesttypically do not includethem • theycan be included as additionallocals • youneed to knowwhatlocalstheyare=>solution: thefamily of „return list” commands • ret li =>results stored in r(), general commands • eret li =>results stored in e(), estimationcommands • sret li => results stored in s(), programmingcommands • Practicalexample Jan Hagemejer & Joanna Tyrowicz
Cookbook for „simple” problems • Run procedure • Check with the use of „return list” family, which statistics you need • Add locals that should be generated after the procedure • Add these statistics to outreg2/parmest commands forvalues no=1(1)10 { xi: xtreg x y z i.year i.month if g`no'==1, fe robust local Between=e(r2_b) local Within=e(r2_w) local No_min=e(g_min) local No_max=e(g_max) outreg2 using file.xls, bdec(4) title(Title) ctitle(`no') append excel addstat(R2 between, `Between', R2 within, `Within', No min, `No_min', No max, `No_max', No average, `No_avg') } Jan Hagemejer & Joanna Tyrowicz
Our problem is different – application to PSM • Need to report: • output of the procedure • sample properties after matching • balancing properties of matching • Problem1: actually, none of these is in the typical output • Problem2: we need it for many estimations looped over many variables and each one of them takes a looooong time Jan Hagemejer & Joanna Tyrowicz
Detailed problem description • Analyse the effects of privatisation • Observe what happens before and after the „event” of privatisation, but time runs: • E.g. firm A may be one year before privatisation in 1999 and firm B in 2006, so „event” is an anchor and time „runs” both ways. • Effects may be observed in many spheres: • E.g. profits, investments, international competitiveness, employment • Effects may be due to self-selection • E.g. only better firms are privatised, so difference in performance is not due to the privatisation • Effects may be largerly due to self-selection • Heckman correction will tell about the statistical significance but not about the economic relevance • Propensity score matching is the best solution Jan Hagemejer & Joanna Tyrowicz
Detailed problem desciption • Run logistic regression: • Dependent variable: Y = 1, if participate; Y = 0, otherwise. • Choose appropriate conditioning (instrumental) variables. • Obtain propensity score: predicted probability (p) or log[p/(1 − p)]. • Match each participant to one or more nonparticipants on propensity score: • Choose an adequate metric • Compare outcome variables • Example: test means equality in sample treated and control group • In PSM: obtaining pscore is irrelevant, but matching is key • To verify if matching is ok, need to run some diagnostics • Example: compare the balancing properties after matching (so-called bias reduction thanks to matching) Jan Hagemejer & Joanna Tyrowicz
Detailed problem description • Thus, in our case: • Many time periods (for each „time-to-anchor” a separate estimation) • Many variables (for each variable separate outcomes, but within one period the same balancing properties) • Two ways of estimating: regular and bootstrapping (especially the latter made things complex) • Each estimation: roughly 1.5-3.5 hours • Over a hundred estimations • Additional pitfalls: • We needed some statistics for all estimations and they were not in the return list • More precisely: procedure computes them to be able to produce output, but they were not added to the return list by authors Jan Hagemejer & Joanna Tyrowicz
Summary of the problems Our problem was quitespecific… BUT consisted of many general problems: • Loopstake a lot of time – need to findefficientways • Somethingscannot be obtainedfast => evenmorereasons to run itautomatically • Obtainingdatasets of thevariables we need (so-calledresultssets) • Gettingvisible data iftheyare not an output • Usinginvisible data • Gettingaroundwithbootstrap Jan Hagemejer & Joanna Tyrowicz
The structure of our estimations Jan Hagemejer & Joanna Tyrowicz
Eventloop Usingpscoreorpsmatch? • Typical psmatch syntax: psmatch2 treat treatment_determinants, out(outcomes) options • Alternative • Estimate pscore first: pscore treatment treatment_determinants, pscore(name) • Run: psmatch2 treatment pscore, out(outcomes) options • How to choose? • If you want to bootstrap, pscore estimated once will save you time • If you want to introduce data-fitted caliper into options, pscore first is a must Jan Hagemejer & Joanna Tyrowicz
Eventloop Usingtheglobalfunction for estimations • Ourapplication: observethe same firms back and forthfromthe moment of theprivtisation („event”) • „Events” happenindifferentyears • But we canonlymatch on one dimension: hasorhas not the „event” • Conceptualsolution: uselags and forwards to getthe time dimension • Technical problem: many outcomesvariables and de facto many loops • Technicalsolution: defineseparatelymatchingvariables and outputvariables global in="cut* remoteness eksporter energia obrotklratioroarosindebtednesswsk_plynnoscinet_income_efficiencyklratio_newroa_newindebtedness_newindebtedness_newwsk_plynnosci_new" global out="te_newredukcjawzrost_zatrshare_exportlewars_eff" global outf1="ff1_te_new ff2_te_new ff3_te_new ff4_te_new ff5_te_new ff1_redukcja ff2_redukcja ff3_redukcja ff4_redukcja ff5_redukcja ff1_wzrost_zatr ff2_wzrost_zatr ff3_wzrost_zatr ff4_wzrost_zatr ff5_wzrost_zatr" global outf2="ff1_share_export ff2_share_export ff3_share_export ff4_share_export ff5_share_export ff1_lewar ff2_lewar ff3_lewar ff4_lewar ff5_lewar ff1_s_eff ff2_s_eff ff3_s_eff ff4_s_eff ff5_s_eff" Jan Hagemejer & Joanna Tyrowicz
Eventloop The begining of the estimations – so far forvalues d=6(1)18 { use data, clear capture log close capture drop our_pscore* caliper* mean* diff* ttest* se_after* se_before* treated nontreated log using priv_caliper_`d', text replace pscore d`d' $in, pscore(our_pscore_`d') ttestour_pscore_`d', by(d`d') unequal capture drop sd_nontreatedsd_treated gen sd_nontreated=`r(sd_1)' gen sd_treated=`r(sd_2)' gen caliper_`d'= ((sd_treated^2+sd_nontreated^2)/2)^0.5 sum caliper_`d' localc_real=`r(mean)' histnasz_pscore_`d', by(d`d') graphsave „our_pscore_d`d'.png", replace psmatch2 d`d' our_pscore_`d', out($out $outf1 $outf2) commonaddmahalanobis(nace) caliper(`c_real') Jan Hagemejer & Joanna Tyrowicz
Getting from results to „resultssets” Jan Hagemejer & Joanna Tyrowicz
Why (and what) do we need (in) the resultssets? • Why? • Most importantly: withoutresultssets we cannot • analysethechangesover time • decomposetheobserveddifferentials • If we do not do itautomatically, itwouldhave to be copiedmanuallyfromlogs – many estimations, many variables, etc • What ? Step 1: find out the reality • Size of each of thethreegroups: treated, total and control (= matched) • Averagesinallthreegroups (medians, etc.) • Knowledgeifinfacttheyaredifferent (= test of thestatisticalsignificancebased on difference and standard error of thisdifference) • What? Step 2: find out, howgoodthefindingsarestatistically • Balancingproperties! Jan Hagemejer & Joanna Tyrowicz
Variablesloop Our solution to step 1 foreach out in $out $outf1 $outf2 { localse_after=r(seatt_`out') gen se_after_`out'=`se_after' localdiff_after=r(att_`out') gen diff_after_`out'=`diff_after' sum `out' if d`d'==0 & _support==1 localmean_nontreated=r(mean) gen mean_nontreated_`out'=`mean_nontreated' sum `out' if d`d'==1 & _support==1 localmean_treated=r(mean) gen mean_treated_`out'=`mean_treated' ttest `out' if _support==1, by(d`d')unequal localse_before=r(se) gen se_before_`out'=`se_before' localmean_before=r(mu_2)-r(mu_1) gen diff_before_`out'=`mean_before' gen ttest_before_`out'=diff_before_`out'/se_before_`out' gen ttest_after_`out'=diff_after_`out'/se_after_`out‘ CONTINUED ON THE NEXT SLIDE Jan Hagemejer & Joanna Tyrowicz
Variablesloop Our solution to step 1 - continued foreach type in before after { label varse_`type'_`out' "Standard error of difference `type' matching" label vardiff_`type'_`out' "Difference `type' matching" label varttest_`type'_`out' "T-test of difference" } label varmean_treated_`out' "Mean of treated companies" label varmean_nontreated_`out' "Mean of non-treated companies (before matching)" } count if d`d'==1 & _support==1 localtreated=r(N) gen treated=`treated' label var treated "No of treated companies" count if d`d'==0 & _support==1 localnontreated=r(N) gen nontreated=`nontreated' label varnontreated "No of control companies" Jan Hagemejer & Joanna Tyrowicz
Variablesloop Our solution to step 2 pstest$in foreachinin $in { capturelocalbias_reduction=r(bired_`in') capturelocalpvalue_bef=r(pbef_`in') capturelocalpvalue_after=r(paft_`in') capture gen b_red_`in'=`bias_reduction' capture gen pval_ber_`in'=`pvalue_bef' capture gen pval_aft_`in'=`pvalue_after' } outsheetb_red* pval* usingstats_priv_`d', replace psgraph graphsavepriv_support_`d', replace graph export priv_support`d'.png, replace drop b_red* pval* Jan Hagemejer & Joanna Tyrowicz
Solving problem of „missing” statistics • Lookintothe„ado” file youareusing (procedure) • Throughoutthe file, therearecommands return scalarx=`somelocal’ • Sometimes – for clarity – scalarsaredroppedattheend of procedure • Yourpreferedstatistic (ifitisintheoutput, ithas to be atleast a local) wouldsimplyhave to have a locallikethattoo • Ifitdoes not – youcanalwaysgenerateitbased on yourpreferences and availablelocals => Modifytheoriginalado file Jan Hagemejer & Joanna Tyrowicz
Solving problem of „missing” statistics – example 1 Original ado file – line 380 Modifiedado file – line 380 qui foreach v of varlist `varlist' { replace _`v' = . if _support==0 tempname m1t m0t u0u u1u att dif0 sum `v' if _treated==1, mean scalar `u1u' = r(mean) sum `v' if _treated==0, mean scalar `u0u' = r(mean) sum `v' if _treated==1 & _support==1, mean scalar `m1t' = r(mean) local n1 = r(N) sum _`v' if _treated==1 & _support==1, mean scalar `m0t' = r(mean) scalar `att' = `m1t' - `m0t' scalar `dif0' = `u1u' - `u0u‘ return scalar att = `att' return scalar att_`v' = `att' qui foreach v of varlist `varlist' { replace _`v' = . if _support==0 tempname m1t m0t u0u u1u att dif0 … /all the same as earlier plus / return scalardiff = `dif0' return scalar diff_`v' = `dif0‘ return scalar mean0 = `u0u' return scalar mean0_`v' = `u0u‘ return scalar mean1 = `u1u' return scalar mean1_`v' = `u1u' Jan Hagemejer & Joanna Tyrowicz
Solving problem of „missing” statistics – example 2 Original ado file – line 440 Modifiedado file – line 440 return scalar seatt = `stderr' return scalar seatt_`v' = `stderr' qui regress `v' _treated scalar `ols' = _b[_treated] scalar `seols' = _se[_treated] return scalarseatt = `stderr' return scalar seatt_`v' = `stderr' qui regress `v' _treated scalar `ols' = _b[_treated] scalar `seols' = _se[_treated] return scalarseols = `seols‘ return scalar seols_`v' = `seols' Jan Hagemejer & Joanna Tyrowicz
Problemswithbootstrap Jan Hagemejer & Joanna Tyrowicz
Problemswithbootstrap • Whydid we needbootstrap? • Afterestimationss.e.’swererelativelylarge (heterogenoussample) • When we triedbootstraping, thereductioninthesize of s.e.’s was roughly 50% whileestimatorswereessentiallyunaffected • Whatproblemswithbootstrap? • Need to run itseparately for eachvariable (itbootstrapsonly one standard errorat a time) • Outputisgivenin a totallydifferent form • Ittakes a looong time • New piece of code for just BS standard errors => newvariableloopswithineach time loop Jan Hagemejer & Joanna Tyrowicz
Problemswithbootstrap foreach out in $out $outf1 $outf2 { use data, clear sum caliper_`d‘ /thisiswheretheinitialpscorecomesuseful/ localc_real=`r(mean)‘ bootstrapr(att): psmatch2 d`d' our_pscore_`d', out(`out') commonaddmahalanobis(nace) caliper(`c_real') matrix mat = e(b), e(se) /withoutthis, no resultssets/ mat li mat svmat mat rename mat1 a`d'_diff_after_bs_`out‘ rename mat2 a`d'_se_after_bs_`out‘ gen time_of_event=`d' keep se* diff* ttest* mean* time_of_event a* drop if _n>1 savepriv_bs_`out'`d', replace } Jan Hagemejer & Joanna Tyrowicz
Final steps • Mergefilesobtainedfrombootstrap on „event” (to have a completeresultssetwithineach „event” period) • Mergebootstrapresultssetswith • Appendthefiles for „event” periods • Organisethe data • Producetables and graphs (againinloops) • Write paper Jan Hagemejer & Joanna Tyrowicz
The resulting graphs (1) • There are 6x3 figures alltogether Jan Hagemejer & Joanna Tyrowicz
The resulting graphs (2) • There are 6x2 figures alltogether Jan Hagemejer & Joanna Tyrowicz
The resulting graphs (3) • There are 6x3 figures alltogether Jan Hagemejer & Joanna Tyrowicz
Some advices we did not take at the right time • Save your computers’ time (your wasted time is your problem ) • Use „sample 10” for testing your procedures - saves a lot of time • Leaving mess is not useful if you ever want to come back • Your memory lasts shorter than that of saved files – describing dofiles really helps • Loops are better than copy&paste – and less messy too • STATA is not that complicated – modifying ado-files is really easy if you know what you want Jan Hagemejer & Joanna Tyrowicz
Thank you for your attention! Jan Hagemejer & Joanna Tyrowicz