PhD Methods Duration Models December 1 & 15, 2008

PhD Methods Duration Models December 1 & 15, 2008

Sessions & Inter-session Session 1: Objectives & Tools Questions → Models Models → Analyses Offline assignments Session 2: Analyses → Presentations Presentations → Questions

Objectivesand Tools

Objectives • Understand duration models • What questions they help answer? • What ‘flavors’ exist? • How they work? • What you need to check? • Get more practice with STATA • Watch me do tricks… • …and do your own with assignments!

Applied researcher’s toolkit • Reference texts • Woolridge’s “Econometric analysis of cross sectional and panel data” • Greene’s “Econometric analysis” • STATA Corp’s “Survival analysis and epidemiology tables” • Rabe-Hesketh & Everitt’s “A handbook of statistical analyses using STATA” • STATA v10

Questionsto Models

Typical questions? • Asked in life, engineering, economic and administrative sciences • Interested in the length of a spell of time: • What are predictors of this duration? • Does duration depend on elapsed time?

Key terms? • Spell (origin, failure, duration) • Hazard • At-risk • Risk set • Censoring (left, right)

A little theory • Spell length: T ~ f(t), f ‘nice’ • Cumulative probability: F(t) • Survival function: S(t) = 1-F(t) = Pr{T t} • Hazard rate: (t) = limt →0 Pr{T [t, t+t]|T t} / t = limt →0 {F( t+t) – F(t)} / t / S(t) = f(t) / S(t) = -d ln S(t)/dt

Hazard models • Non-parametric (e.g. Kaplan-Meier) • Semi-parametric (e.g. Cox) • Fully parametric (e.g. Weibull)

Kaplan-Meier non-parametric • Let’s look at a purely empirical estimate of the survivor function • Suppose nj is the number of units at risk before dj failures occur at tj • Then estimated Ŝ(t) = j|tjt ((nj – dj)/nj)

Kaplan-Meier implementation • KM curves are easily plotted in Stata: use http://www.statapress.com/data/r10/drugtr list stset sts graph sts graph, by(drug) ci level(95) sts graph, by(drug) sts test drug sts test drug, wilcoxon gen ageless57 =(age < 57) sts test ageless57

Cox’s semi-parametric model • Specification (ti) = exp (xi ) 0(ti) • Problem: estimate  in presence of the unknown individual heterogeneity 0(ti)? • Solution: condition on exactly 1 individual leaving risk set at time of interest

Cox’s model • Let Tk be the kth exit time, and let Rk be the at-risk set. • Then Pr{ti=Tk|Rk}=exp(xi )/j Rkexp(xj ) sweeping out the 0(ti) terms • Maximize this partial likelihood function ln L = k [xk  – ln (j Rkexp(xj ))]

Cox’s model if tied exit times • The partial likelihood function must now account for non-unique exit times. • Suppose there is a set Dj of failure times at time tj, and dj is the cardinality of that set, where Rj is at-risk set of units at tj ln L=j  D[k Dj xk  – djln(i Rkexp(xj ))]

Let’s try Cox’s model • STATA implements Cox’s model very clearly use http://www.statapress.com/data/r10/kva stset stcox load estimates store load stcox load bearings lrtest . load drop bearings stcox load %look at the estimated coefficient on load%

What’s Stata doing with Cox? • Look at the Excel spreadsheet in http://faculty.fuqua.duke.edu/~willm/Classes/PhD/PhD_2008_2009_LongStrat/Strategy591_2008_2009_ResearchMethods.htm • I’ve tried to show in easy sequences how the ado file in Stata parallels the partial likelihood function we just learned. • Note the log-like and estimate of  this spreadsheet yields. Compare with Stata

Key Cox assumptions • Recall the Cox specification for the hazard rate for individual i at time tk i(tk) = exp (xi ) 0(tk) • Consider the hazard ratio for two individuals i and m, again at time tk i(tk)/m(tk) = exp(xi)0(tk)/exp(xm)0(tk) = exp ((xi – xm)) ~ some proportionality constant

Testing Cox’s assumptions • Is global proportionality reasonable? use http://www.statapress.com/data/r10/drugtr gen ageless57 =(age < 57) sts graph, by(ageless57) %% curves roughly parallel?% stcox drug stcox drug, strata(ageless57) stphplot, by(ageless57) %% curves roughly parallel?% stcoxkm,by(ageless57) %% predicted vs observed?% • Mitigation with stratification

Testing Cox model residuals • Are there significant outliers? use http://www.statapress.com/data/r10/kva stcox load bearings, mgale (mart) predict devr, deviance predict xb, xb twoway scatter devr xb %% residuals look reasonable?% stcox load bearings, esr(score*) twoway scatter score1 failtime %% large deviations?% twoway scatter score2 failtime %% large deviations?%

Fully parametric • So far the underlying baseline hazard rate has been left unspecified. We can modify this assumption using parametric models. • Easiest choice is exponential survival function in which hazard rate is constant -d ln S(t)/dt = (t)  ⇒ S(t) = exp (-t)

Other fully parametric models • Weibull specification of a monotonic hazard rate with p > 0 (t)  p(t)p-1 use http://www.statapress.com/data/r10/kva streg load bearings, d(weibull) stcurve, haz streg load bearings, d(exponential) stcurve, haz sts, haz

Models to Analyses

Practice (1): Equine risks • stset the data • Basic non-parametric exploration • More parametric models • Model assumptions tested • Constructing additional variables as needed

Practice (2): Military risks • stset the data • Basic non-parametric exploration • More parametric models • Model assumptions tested • Constructing additional variables as needed

Practice (3): Hospital stay risks • stset the data Warning: this is a huge set • Basic non-parametric exploration • More parametric models • Model assumptions tested • Constructing additional variables as needed

Offline assignments

Assignments • Data assignment • Reading assignment

Assignments: data • Datasets from military, veterinary and medical science • Data may be fictional, is certainly de-identified, and should not be re-used • Think of a simple, plausible research question, model it, analyze one set of data, write up and present results (1-2 p)

Assignments: reading • Read and briefly critique each of: • Jensen, M. 2006. Should we stay or should we go? Accountability, status anxiety, and client defections. ASQ51: 97-128 • Rao H, Greve HR, Davis GF. 2001. Fool's gold: social proof in the initiation and abandonment of coverage by Wall Street analysts. ASQ46(3): 502-526 • My 2008 working paper on cardiologists

Assignments: reading… • Typical questions we’ll discuss • Are the research question, data and the model choice congruent? • How else could they have answered the question • Different data? • Different model? • Different analysis? • Is the presentation of the analyses clear and compelling? • Do you buy it? Why or why not? • What is left to do?

Assignments… • Reading and datasets posted at http://faculty.fuqua.duke.edu/~willm/Classes/PhD/PhD_2008_2009_LongStrat/Strategy591_2008_2009_ResearchMethods.htm • Email your write-up by next Friday, Dec 12, by close-of-business • Be prepared to discuss reading and answer questions on Monday, Dec 15

Analyses to Presentation

Equine data • What predicts fatal injury hazard here? • Does that make sense? • What model did you use and why? • How did you check it? • What summary results do you have? • What’s missing in the data? • What’s wrong with our model?

Discharge data • What predicts the discharge hazard? • Does that make sense? • What model did you use and why? • How did you check it? • What summary results do you have? • What’s missing in the data? • What’s wrong with our model?

Military data • What predicts the fatal wound hazard? • Does that make sense? • What model did you use and why? • How did you check it? • What summary results do you have? • What’s missing in the data? • What’s wrong with our model?

Presentation to Questions

Some recent ‘presentations’ • Jensen, M. 2006. Should we stay or should we go? Accountability, status anxiety, and client defections. Administrative Science Quarterly51: 97-128 • Rao H, Greve HR, Davis GF. 2001. Fool's gold: social proof in the initiation and abandonment of coverage by Wall Street analysts. Administrative Science Quarterly46(3): 502-526 • My working paper on cardiologists

Loose Ends & The End

We (probably) didn’t cover… • When covariates vary over time? • What to do about a lot of left censoring? • Frailty models for omitted variables • Shared frailty models to explain similarity in duration in groups of units  Stata manual and experimentation are almost always the best next steps

Summary • Neat modeling tools exist when you have data on timings and care about differences in timings and their reason • Really neat when you care about firm longevity, leadership durations, spells of some management activity

Thank you!

PhD Methods Duration Models December 1 & 15, 2008