170 likes | 361 Views
Challenges in survival analysis with large datasets. Noori Akhtar-Danesh, PhD McMaster University, Hamilton, Canada daneshn@mcmaster.ca. Background.
E N D
Challenges in survival analysis with large datasets Noori Akhtar-Danesh, PhD McMaster University, Hamilton, Canada daneshn@mcmaster.ca San Francisco, USA
Background • The Canadian Community Health Survey, Cycle 3.1 (CCHS-3.1) is a large cross-sectional survey which includes information on over 130000 Canadians. • Preliminary results show that 63% of Canadians (age>=12 years) ever smoked a whole cigarette. San Francisco, USA
Objectives & Challenges • The main objective was to investigate the age of smoking initiation based on the variables of gender and place of birth. • We compared different survival analysis techniques including Cox regression and the available parametric methods. • To highlight some challenges that we encountered in search for an appropriate model. San Francisco, USA
Challenge: PH Assumption • In large datasets, test-based assessment of PH assumption is challenging because Schoenfeld test would be significant for even very small rho’s due to large dataset. • For the CCHS-3.1 dataset, Schoenfeld test for both Sex and Birth Place variables is significant with small rho’s. San Francisco, USA
Challenge: PH Assumption San Francisco, USA
Challenge: PH Assumption • However, the log(-log) graph showed quite parallel lines for these variables which indicates that PH assumption is satisfied. San Francisco, USA
Challenge: PH Assumption San Francisco, USA
Challenge: PH Assumption • Perhaps we need to specify a minimum value for correlation, for instance r=0.33, to be accepted as significant (as it is common in fields such as factor analysis). San Francisco, USA
Challenge: PH Assumption • However, if we incorporate the survey design into the analysis, the PH test would work fine for these variables but the global test is still significant (in Stata). San Francisco, USA
Challenge: PH Assumption San Francisco, USA
Challenge: Appropriate parametric model San Francisco, USA
Challenge: Parametric models • We used different parametric models incorporating the survey design and sampling weight (using svy: option). • A Weibull model with frailty appeared to be the best model. • But, we were not able to draw diagnostic graphs or have an overall GOF test due to the big sample size. San Francisco, USA
Using Cure Fraction Models • One main assumption in survival analysis is that eventually everyone will experience the event. • However, we have a large proportion (37%) of censored individuals (those who never started smoking) in the CCHS-3.1 dataset. San Francisco, USA
Using Cure Fraction Models San Francisco, USA
Using Cure Fraction Models • Therefore, it is more appropriate to use a cure fraction model (Lambert 2007; Stata Journal, 7:(3), pp. 1-25). • Using this model, both the cure fraction (the proportion who did not experience the event) and the time to failure (age of smoking initiation) depend (separately) on the explanatory variables. San Francisco, USA
Using Cure Fraction Models • We used the strxnmix code in Stata for a non-mixture model (Lambert 2007; Stata Journal, 7:(3), pp. 1-25). • Challenge: sampling weight cannot be incorporated in estimation. San Francisco, USA
Conclusion • Survival analysis for large datasets with sampling weight cannot be conducted easily. • Common challenges: • Assessment of PH assumption • Model diagnostics • Use of cure fraction models may not be appropriate because sampling weight cannot be incorporated in the estimation. San Francisco, USA