1 / 17

Challenges in survival analysis with large datasets

This study investigates the age of smoking initiation based on gender and place of birth using different survival analysis techniques. Challenges include assessing the proportional hazards assumption and finding an appropriate parametric model.

lorenas
Download Presentation

Challenges in survival analysis with large datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Challenges in survival analysis with large datasets Noori Akhtar-Danesh, PhD McMaster University, Hamilton, Canada daneshn@mcmaster.ca San Francisco, USA

  2. Background • The Canadian Community Health Survey, Cycle 3.1 (CCHS-3.1) is a large cross-sectional survey which includes information on over 130000 Canadians. • Preliminary results show that 63% of Canadians (age>=12 years) ever smoked a whole cigarette. San Francisco, USA

  3. Objectives & Challenges • The main objective was to investigate the age of smoking initiation based on the variables of gender and place of birth. • We compared different survival analysis techniques including Cox regression and the available parametric methods. • To highlight some challenges that we encountered in search for an appropriate model. San Francisco, USA

  4. Challenge: PH Assumption • In large datasets, test-based assessment of PH assumption is challenging because Schoenfeld test would be significant for even very small rho’s due to large dataset. • For the CCHS-3.1 dataset, Schoenfeld test for both Sex and Birth Place variables is significant with small rho’s. San Francisco, USA

  5. Challenge: PH Assumption San Francisco, USA

  6. Challenge: PH Assumption • However, the log(-log) graph showed quite parallel lines for these variables which indicates that PH assumption is satisfied. San Francisco, USA

  7. Challenge: PH Assumption San Francisco, USA

  8. Challenge: PH Assumption • Perhaps we need to specify a minimum value for correlation, for instance r=0.33, to be accepted as significant (as it is common in fields such as factor analysis). San Francisco, USA

  9. Challenge: PH Assumption • However, if we incorporate the survey design into the analysis, the PH test would work fine for these variables but the global test is still significant (in Stata). San Francisco, USA

  10. Challenge: PH Assumption San Francisco, USA

  11. Challenge: Appropriate parametric model San Francisco, USA

  12. Challenge: Parametric models • We used different parametric models incorporating the survey design and sampling weight (using svy: option). • A Weibull model with frailty appeared to be the best model. • But, we were not able to draw diagnostic graphs or have an overall GOF test due to the big sample size. San Francisco, USA

  13. Using Cure Fraction Models • One main assumption in survival analysis is that eventually everyone will experience the event. • However, we have a large proportion (37%) of censored individuals (those who never started smoking) in the CCHS-3.1 dataset. San Francisco, USA

  14. Using Cure Fraction Models San Francisco, USA

  15. Using Cure Fraction Models • Therefore, it is more appropriate to use a cure fraction model (Lambert 2007; Stata Journal, 7:(3), pp. 1-25). • Using this model, both the cure fraction (the proportion who did not experience the event) and the time to failure (age of smoking initiation) depend (separately) on the explanatory variables. San Francisco, USA

  16. Using Cure Fraction Models • We used the strxnmix code in Stata for a non-mixture model (Lambert 2007; Stata Journal, 7:(3), pp. 1-25). • Challenge: sampling weight cannot be incorporated in estimation. San Francisco, USA

  17. Conclusion • Survival analysis for large datasets with sampling weight cannot be conducted easily. • Common challenges: • Assessment of PH assumption • Model diagnostics • Use of cure fraction models may not be appropriate because sampling weight cannot be incorporated in the estimation. San Francisco, USA

More Related