SC968: Panel Data Methods for Sociologists

SC968: Panel Data Methods for Sociologists Introduction to survival/event history models

Types of outcome Continuous OLS Linear regression Binary Binary regression Logistic or probit regression Time to event data Survival or event history analysis

Examples of time to event data • Time to death • Time to incidence of disease • Unemployed - time till find job • Time to birth of first child • Smokers – time till quit smoking

Time to event data • Analyse durations or length of timeto reach endpoint • Data are usually censored • Don’t follow sample long enough for everyone to get to the endpoint (e.g. death)

4 key concepts for survival analysis • States • Events • Risk period • Duration

States • States are categories of the outcome variable of interest • Each person occupies exactly one state at any moment in time • Examples • alive, dead • single, married, divorced, widowed • never smoker, smoker, ex-smoker • Set of possible states called the state space

Events • A transition from one state to another • From an origin state to a destination state • Possible events depend on the state space • Examples • From smoker to ex-smoker • From married to widowed • Not all transitions can be events • E.g. from smoker to never smoker

Risk period • Not all people can experience each state throughout the study period • To be able to have a particular event, one must be in the origin state at some stage • Example • can only experience divorce if married • The period of time that someone is at risk of a particular event is called the risk period • All subjects at risk of an event at a point in time called the risk set

Duration • Event history analysis is to do with the analysis of the duration of a nonoccurrence of an event or the length of time during the risk period • Examples • Duration of marriage • Length of life • In practice we model the probability of a transition conditional on being in the risk set

Example data ID Entry date Died End date 1 01/01/1991 01/01/2008 • 01/01/1991 01/01/2000 01/01/2000 3 01/01/1995 01/01/2005 4 01/01/1994 01/07/2004 01/07/2004

Calendar time Study follow-up ended 1991 1994 1997 2000 2003 2006 2009

Study time in years censored event censored event 0 3 6 9 12 15 18

Censoring • An observation is censored if it has incomplete information • We will only consider right censoring • That is, the person did not have an event during the time that they were studied • Common reasons for right censoring • the study ends • the person drops-out of the study • the person has to be taken off a drug

Survival or event history data characterised by 2 variables Time or duration of risk period Failure (event) 1 if not survived or event observed 0 if censored or event not yet occurred Data

What is the data structure? ID Entry date Died End date Duration Event • 01/01/1991 01/01/2008 17.0 0 • 01/01/1991 01/01/2000 01/01/2000 9.0 1 • 01/01/1995 01/01/2005 10.0 0 • 01/01/1994 01/07/2004 01/07/2004 10.5 1 The row is a person The tricky part is often calculating the duration Remember we need an indicator for observed events/ censored cases

Worked example • Random 20% sample from BHPS • Waves 1 – 15 • One record per person/wave • Outcome: Duration of cohabitation • Conditions on cohabiting in first wave • Survival time: years from entry to the study in 1991 till year living without a partner

The data Duration = 6 years Event = 1 Ignore data after event = 1

The data (continued) Note missing waves before event

Preparing the data Select records for respondents who were cohabiting in 1991 Declare that you want to set the data to survival time Important to check that you have set data as intended

Checking the data setup time of entry 1 if observation is to be used and 0 otherwise time of exit 1 if event, 0 if censoring or event not yet occurred

Checking the data setup How do we know when this person separated?

Trying again!

Checking the new data setup Now censored instead of an event

Summarising time to event data • Individuals followed up for different lengths of time • So can’t use prevalence rates (% people who have an event) • Use rates instead that take account of person years at risk • Incidence rate per year • Death rate per 1000 person years

Summarising time to event data Number of observations Person-years <25% of sample had event by 15 elapsed years Rate per year

List the cumulative hazard function Default is the survivor function

Graphs of survival time • Kaplan-Meier estimate of survival curve • The Kaplan-Meier method estimates the cumulative probability of an individual surviving after baseline to any time, t

Kaplan-Meier graphs • Can read off the estimated probability of surviving a relationship at any time point on the graph • E.g. at 5 years 88% are still cohabiting • The survival probability only changes when an event occurs • So the graph is stepped and not a smooth curve

Testing equality of survival curves among groups The log-rank test A non –parametric test that assesses the null hypothesis that there are no differences in survival times between groups

Log-rank test example Significant difference between men and women

The Cox regression model

Event History with Cox regression model Event History with Cox Model • No longer modelling the duration • Modelling the hazard • Hazard: measure of the probability that an event occurs at time t conditional on it not having occurred up until t • Also known as the Cox proportional hazard model

Some hazard shapes • Increasing • Onset of Alzheimer's • Decreasing • Survival after surgery • U-shaped • Age specific mortality • Constant • Time till next email arrives

Cox regression model • Regression model for survival analysis • Can model time invariant and time varying explanatory variables • Produces estimated hazard ratios (sometimes called rate ratios or risk ratios) • Regression coefficients are on a log scale • Exponentiate to get hazard ratio • Similar to odds ratios from logistic models

Cox regression equation is the hazard function for individual i is the baseline hazard function and can take any form It is estimated from the data (non parametric) are the covariates are the regression coefficients estimated from the data Effect of covariates is constant over time (parameterised) This is the proportional hazards assumption Therefore, Cox regression referred to as a semi-parametric model

Cox regression in Stata • Will first model a time invariant covariate (sex) on risk of partnership ending • Then will add a time dependent covariate (age) to the model

Cox regression in Stata

Interpreting output from Cox regression • Cox model has no intercept • It is included in the baseline hazard • In our example, the baseline hazard is when sex=1 (male) • The hazard ratio is the ratio of the hazard for a unit change in the covariate • HR = 1.3 for women vs. men • The risk of partnership breakdown is increased by 30% for women compared with men • Hazard ratio assumed constant over time • At any time point, the hazard of partnership breakdown for a woman is 1.3 times the hazard for a man

Interpreting output from Cox regression (ii) • The hazard ratio is equivalent to the odds that a female has a partnership breakdown before a man • The probability of having a partnership breakdown first is = (hazard ratio) / (1 + hazard ratio) • So in our example, a HR of 1.30 corresponds to a probability of 0.57 that a woman will experience a partnership breakdown first • The probability or risk of partnership breakdown can be different each year but the relative risk is constant • So if we know that the probability of a man having a partnership breakdown in the following year is 1.5% then the probability of a woman having a partnership breakdown in the following year is 0.015*1.30 = 1.95%

Time dependent covariates • Examples • Current age group rather than age at baseline • GHQ score may change over time and predict break-ups • Will use age to predict duration of cohabitation • Nonlinear relationship hypothesised • Recode age into 8 equally spaced age groups

Cox regression with time dependent covariates

Cox regression assumptions • Assumption of proportional hazards • No censoring patterns • True starting time • Plus assumptions for all modelling • Sufficient sample size, proper model specification, independent observations, exogenous covariates, no high multicollinearity, random sampling, and so on

Proportional hazards assumption • Cox regression with time-invariant covariates assumes that the ratio of hazards for any two observations is the same across time periods • This can be a false assumption, for example using age at baseline as a covariate • If a covariate fails this assumption • for hazard ratios that increase over time for that covariate, relative risk is overestimated • for ratios that decrease over time, relative risk is underestimated • standard errors are incorrect and significance tests are decreased in power

Testing the proportional hazards assumption • Graphical methods • Comparison of Kaplan-Meier observed & predicted curves by group. Observed lines should be close to predicted • Survival probability plots (cumulative survival against time for each group). Lines should not cross • Log minus log plots (minus log cumulative hazard against log survival time). Lines should be parallel

Testing the proportional hazards assumption • Formal tests of proportional hazard assumption • Include an interaction between the covariate and a function of time. Log time often used but could be any function. If significant then assumption violated • Test the proportional hazards assumption on the basis of partial residuals. Type of residual known as Schoenfeld residuals.

SC968: Panel Data Methods for Sociologists

SC968: Panel Data Methods for Sociologists

Presentation Transcript

What is research methods?

IMPLEMENTATION STRATEGIES

METHODS OF PERSUASION

Graham-Harvey (2001): Theory-Practice of Corporate Finance

Descriptive vs. Inferential Statistics

Econometric Analysis of Panel Data

Econometric Analysis of Panel Data

AHRQ 2010 Annual Conference AHRQ Data Resources to Inform Health Policy Research: The Medical Expenditure Panel Survey

Advanced Methods and Analysis for the Learning and Social Sciences

Multivariate Display

Qualitative Methods

Statistical Methods

Methods of Sociological Inquiry

Data Mining: Concepts and Techniques Cluster Analysis: Basic Concepts and Methods

Qualitative Data Analysis

CIS 2168 Data Structures and Algorithms in JAVA

3. Optimization Methods for Molecular Modeling

NEW RMU PORTAL – RMU Panel

Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 10 —

Spatial Clustering Methods

Quantitative Methods

Reversible Data Hiding