Experiences with multiple propensity score matching

Experiences with multiple propensity score matching Jan Hagemejer & Joanna Tyrowicz University of Warsaw & National Bank of Poland

Plan • Standard solutions to the automatisation challenge • Wherethey do not work? Example of propensityscorematching • Using loops and globalfunctiontogether • Generating the resultssetsfor atypicalestimations. • Difficultieswithusingbootstrap (and obtainingresultssets) • Summarycomments … and some (hard learned) advices Jan Hagemejer & Joanna Tyrowicz

The standard route • Problem: severalestimations of similar form + need to compareresults. • Threesimplesolutions: • Solution 1: bruteforce = sit & type (copy / paste from output) • Solution 2: useparmest(Roger Newson) ifestimations on simplecategories in data (limitations of „by” command) • Solution 3: useloops • outreg/outreg2 • nicelyformattedtables, • publication-ready, • in many formats, evendirectly to Word orLaTeX. • Note: ifyouneed nice summarystatistics, youcanuseoutsumeither with byorwithinloops Jan Hagemejer & Joanna Tyrowicz

Where the problems come from? • 2nd and 3rd solutionworksonlywithregression-typeestimations • However, someproceduresareincompatible with pre-cookedsolutions • Need to report: • output of the procedure • samplepropertiesaftermatching • balancingproperties of matching • Problem1: actually, none of theseis in the typicaloutput • Problem2: we needit for manyestimationsloopedovermanyvariables and each one of themtakes a looooongtime Jan Hagemejer & Joanna Tyrowicz

Detailed problem description • Analyse the effects of privatisation • Take two firms A and A’. Firm A gets privatized. Firm A’ does not get privatised (ever). Want to compare firms A and A’ each year before and after privatisation of firm A (in fact we are comparing private firms to privatized SOEs due to few SOEs left in the sample) • Observe what happens before and after the „event” of privatisation • E.g. firm A may be one year before privatisation in 1999 and firm B in 2006, so „event” is an anchor and time „runs” both ways. • Effects may be observed in many spheres: • E.g. profits, investments, international competitiveness, employment, productivity • Effects may be due to self-selection • E.g. only better firms are privatised, so difference in performance is not due to privatisation (there might be other effects why firms are privatised related to, for instance, budget presure). • Use propensity score matching to compare privatised firms to non-privatised firms Jan Hagemejer & Joanna Tyrowicz

What we want to get: Jan Hagemejer & Joanna Tyrowicz

Detailed problem description • Thus, in ourcase: • Many timeperiods (for each „time-to-anchor” a separateestimation) • Many variables (for eachvariableseparateoutcomes, but within one „anchor” the same balancingproperties) • Twoways of estimating: regular and bootstrapping (especially the lattermadethingscomplex) • Eachestimation: roughly 1.5-3.5 hours (big dataset) • Over a hundredestimations • To verify if matching is ok, need to checkbalancingproperties • Additionalpitfalls: • We neededsomestatistics for allestimations and theywere not in the return list • Moreprecisely: procedurecomputesthem to be able to produceoutput, but theywere not added to the return list by authors Jan Hagemejer & Joanna Tyrowicz

Summary of the problems Our problem was quitespecific… BUT consisted of many general problems: • Loopstake a lot of time – need to findefficientways • Somethingscannot be obtainedfast => evenmorereasons to run itautomatically • Obtainingdatasets of the results we need (so-calledresultssets) • Gettingvisible data iftheyare not an output • Usinginvisible data • Gettingaroundwithbootstrap Jan Hagemejer & Joanna Tyrowicz

The structure of our estimations Jan Hagemejer & Joanna Tyrowicz

Howglobalfunctioncan be usefull?

Time loop Usingtheglobalfunction for estimations • Ourapplication: observe the same firms back and forth from the moment of privatisation („anchor”) • „Anchors” happen in differentyears • But we canonlymatch on one dimension: hasorhas not the „anchor” • Conceptualsolution: uselags and forwards to getthe time dimension • Technical problem: manyoutcomevariables and de facto many loops • Technicalsolution: defineseparatelymatchingvariables and outputvariables global in=„capitalroaexport_statusetc…” MATCHING VARS! global out=„productivityemploymentefficiencyetc…” COMPARISON VARS! globaloutf=„forwards of $out” Jan Hagemejer & Joanna Tyrowicz

Getting from results to „resultssets”

Why (and what) do we need (in) the resultssets? • Why? • Most importantly: withoutresultssets we cannot • analysethechangesover time • decomposetheobserveddifferentials • If we do not do itautomatically, itwouldhave to be copiedmanuallyfromlogs – many estimations, many variables, etc • What ? Step 1: Find out the reality • Size of each of thethreegroups: treated, total and control (= matched) • Averagesinallthreegroups (medians, etc.) • Knowledgeifinfacttheyaredifferent (= test of thestatisticalsignificancebased on difference and standard error of thisdifference) • What? Step 2: find out, howgoodthefindingsarestatistically • Balancingproperties! Jan Hagemejer & Joanna Tyrowicz

Eventloop Our solution to step 1 • Initialize the store for ourresultsetsusingpostfile. Index the resulttable with variablenames, years and otherthingsthat the codeloopsaround tempnamememhold postfilememholdindicesvariable_names_for_results • Start the big loop (event) forvalues d=6(1)18 { • Run pscore(needed for bootstrap) and subsequentlypsmatch psmatch2 d`d' our_pscore_`d', out($out $outf $outl) someoptions Jan Hagemejer & Joanna Tyrowicz

Variablesloop Our solution to step 1 • Run pscore and psmatch psmatch2 d`d' our_pscore_`d', out($out $outf$outl) someoptions • Start the loop foreach out in $out $outf1 $outf2 { • Generatemeansand standard errors for treaded/matched/unmatched, usingoutput from psmatch (somemoreaboutthislater) localse_after=r(seatt_`out') • Post the `locals’ to the postfileusing post command in eachloopiteration Jan Hagemejer & Joanna Tyrowicz

Specificloop Our solution to step 2 • For balancingproperties we need to usepstestoverall the matchingvariables pstest$in • In order to produce nice tables, we need to loopoverall the matchingvariables in $in and createsome ‚locals’ in memory to latersavethem as separatevariables: foreach in in $in { capturelocalbias_reduction=r(bired_ìn') capturelocalpvalue_bef=r(pbef_ìn') capturelocalpvalue_after=r(paft_ìn') capture gen b_red_ìn'=`bias_reduction' capture gen pval_ber_ìn'=`pvalue_bef' capture gen pval_aft_ìn'=`pvalue_after‚ } • Spit out everything to a spreadsheet(alternativelyyoucanusepostfileagain): outsheetb_red* pval* using stats_priv_`d', replace • Makesomegraphs and cleanup psgraph graphsavepriv_support_`d', replace drop b_red* pval* Jan Hagemejer & Joanna Tyrowicz

„Missing statistics”

Solving problem of „missing” statistics • Psmatchproduces nice tables with all the requiredstatistics. However, theyareonlyshown on the screen and vanishrightafterthat • Lookinto the „ado” file youareusing (procedure) • Throughoutthe file, therearecommands return scalarx=`somelocal’ • Sometimes – for clarity – scalarsaredroppedattheend of procedure • Yourpreferedstatistic (ifitisintheoutput, ithas to be atleast a local) wouldsimplyhave to have a locallikethattoo • Ifitdoes not – youcanalwaysgenerateitbased on yourpreferences and availablelocals => Modifytheoriginalado file Jan Hagemejer & Joanna Tyrowicz

Solving problem of „missing” statistics – example 1 Original ado file – line 380 Modifiedado file – line 380 qui foreach v of varlist `varlist' { replace _`v' = . if _support==0 tempname m1t m0t u0u u1u att dif0 sum `v' if _treated==1, mean scalar ù1u' = r(mean) sum `v' if _treated==0, mean scalar ù0u' = r(mean) sum `v' if _treated==1 & _support==1, mean scalar `m1t' = r(mean) local n1 = r(N) sum _`v' if _treated==1 & _support==1, mean scalar `m0t' = r(mean) scalar àtt' = `m1t' - `m0t' scalar `dif0' = ù1u' - ù0u‘ return scalar att = àtt' return scalar att_`v' = àtt‚ /no „return” of needed scalars/ qui foreach v of varlist `varlist' { replace _`v' = . if _support==0 tempname m1t m0t u0u u1u att dif0 … /all the same as earlier plus / return scalardiff = `dif0' return scalar diff_`v' = `dif0‘ return scalar mean0 = ù0u' return scalar mean0_`v' = ù0u‘ return scalar mean1 = ù1u' return scalar mean1_`v' = ù1u' Jan Hagemejer & Joanna Tyrowicz

Solving problem of „missing” statistics – example 2 Original ado file – line 440 Modifiedado file – line 440 return scalar seatt = `stderr' return scalar seatt_`v' = `stderr' qui regress `v' _treated scalar `ols' = _b[_treated] scalar `seols' = _se[_treated] return scalarseatt = `stderr' return scalar seatt_`v' = `stderr' qui regress `v' _treated scalar `ols' = _b[_treated] scalar `seols' = _se[_treated] return scalarseols = `seols‘ return scalar seols_`v' = `seols' Jan Hagemejer & Joanna Tyrowicz

Problemswithbootstrap

Problemswithbootstrap • The psmatchproceduredoes not takeintoaccountwhencalculatingse’sthat the propensityscoreisestimated. A possiblesolution to thisis to usebootstrap. • Whatproblems with bootstrap? • Need to run itseparately for eachvariable (itbootstrapsonly one standard errorat a time) • Outputisgivenin a totallydifferent form • Ittakes a looong time • New piece of code for just BS standard errors => newvariableloopswithineach time loop Jan Hagemejer & Joanna Tyrowicz

Problemswithbootstrap • Again, create the postfile • Run the actualbootstrap in loops (post results in everyiteration) foreach out in $out $outf1 $outf2 { use data, clear bootstrap r(att): psmatch2 d`d‘ $in, out(òut') someoptions matrix mat = e(b), e(se) /withoutthis, no resultssets/ svmat mat /convertmatrix to variables/ rename mat1 a`d'_diff_after_bs_òut‘/createmeaningfulnames/ rename mat2 a`d'_se_after_bs_òut‘ gen time_of_event=`d post `postfile’ indices (a`d'_diff_after_bs_òut‘) (a`d'_se_after_bs_òut‘) } postfileclose Jan Hagemejer & Joanna Tyrowicz

Final steps • Mergefilesobtained from bootstrap on „anchor” (to have a completeresultssetwithineach „anchor” period) • Organise the data • Producetables and graphs (againinloops) • Write paper Jan Hagemejer & Joanna Tyrowicz

The resulting graphs (1) • 6 figures showing levels for 3 groups (15 matches each) Jan Hagemejer & Joanna Tyrowicz

The resulting graphs (2) • 6 figures showing the decomposition of the treated-unmatched difference (15 matches each) Jan Hagemejer & Joanna Tyrowicz

The resulting graphs (3) • 6xn figures showing the „balanced panel” version for all variables of the treated-unmatched difference Jan Hagemejer & Joanna Tyrowicz

Some advices we did not take at the right time  • Use „sample 10” for testingprocedures- saves a lot of time • Leaving mess is not usefulif we ever want to comeback • Yourmemorylastsshorterthanthat of savedfiles – describingdofilesreallyhelps • Loopsarebetterthancopy&paste – and less messy too • Beware of changes in STATA syntax (all the time…) Jan Hagemejer & Joanna Tyrowicz

Thank you for your attention! Jan Hagemejer & Joanna Tyrowicz University of Warsaw and National Bank of Poland

Experiences with multiple propensity score matching