300 likes | 387 Views
Exemple d’utilisation des données de la base MIMIC-II pour la construction d'un score prédictif de mortalité en réanimation Mortality Prediction in the ICU: Can we do better ? Super ICU Learner Algorithm (SICULA) Project.
E N D
Exemple d’utilisation des données de la base MIMIC-II pour la construction d'un score prédictif de mortalité en réanimationMortalityPrediction in the ICU: Can we do better ?Super ICU LearnerAlgorithm (SICULA) Project PIRRACCHIO R, Petersen M, Carone M, RescheRigon M, Chevret S and van der Laan M Division of Biostatistics, UC Berkeley, USA Département de Biostatistiques et informatique Médicale, UMR-717, Paris, France Service d’Anesthésie-Réanimation, HEGP, Paris
Motivations for MortalityPrediction • Improved mortality prediction for ICU patients in remains an important challenge: • Clinical research: stratification/adjustment on patients’ severity • ICU care: adaptation of the level of care/monitoring; choice of the appropriate structure • Health policies: performance indicators
Currently used Scores • SAPS, APACHE, MPM, LODS, SOFA,… • And several updates for each of them • The mostwidely in practice are: • The SAPS II score in Europe Le Gall, JAMA 1993 • The APACHE II score in the US Knauss, Crit Care Med 1985
Currently used Scores • SAPS, APACHE, MPM, LODS, SOFA,… • And several updates for each of them • The mostwidely in practice are: • The SAPS II score in Europe Le Gall, JAMA 1993 • The APACHE II score in the US Knauss, Crit Care Med 1985 PROBLEM: fair discrimination but poor calibration
Why are the current scores performingthatbad ? • 3potentialreasons for that: • Global decrease of ICU mortality • Covariateselection • ParametricLogisticregression => Whichmeansweacknowledgeassuming a linearrelationshipbetween the outcome and the covariates
Why are the current scores performingthatbad ? • 3potentialreasons for that: • Global decrease of ICU mortality • Covariateselection • ParametricLogisticregression => Whichmeansweacknowledgeassuming a linearrelationshipbetween the outcome and the covariates WHY wouldweacceptthat ??? • We have alternatives ! • Data-adaptive machine techniques • Non-parametricmodellingalgorithms
Super Learner • Method to choose the optimal regression algorithm among a set of (user-supplied) candidates, both parametric regression models and data-adaptive algorithms (SL Library) • Selection strategy relies on estimating a risk associated with each candidate algorithm based on: • loss-function (=risk associated with each prediction method) • V-fold cross-validation • Discrete Super Learner : select the best candidate algorithm defined as the one associated with the smallest cross-validated risk and reruns on full data for the final prediction model • Super Learner convex combination: weighted linear combination of the candidate learners where the weights are proportional to the risks. van der Laan, Stat Appl Genet Mol Biol 2007
Discrete Super Learner (or Cross-validatedSelector) van der Laan, Targeted Learning, Springer 2011
Discrete Super Learner • The discrete SL canonly do as well as the best algorithmincluded in the library • Not bad, but…. • Wecan do betterthanthat !
Super Learner • Method to choose the optimal regression algorithm among a set of (user-supplied) candidates, both parametric regression models and data-adaptive algorithms (SL Library) • Selection strategy relies on estimating a risk associated with each candidate algorithm based on: • loss-function • V-fold cross-validation • Discrete Super Learner : select the best candidate algorithm defined as the one associated with the smallest cross-validated risk and reruns on full data for the final prediction model • Super Learner convex combination: weighted linear combination of the candidate learners where the weights weights themselves are fitted data-adapvely using Cross-validation to give the best overall fit van der Laan, Stat Appl Genet Mol Biol 2007
Discrete Super Learner (or Cross-validatedSelector) van der Laan, Targeted Learning, Springer 2011
On which data performedour analyses ? • Needs for : • Large database • Not toospecific • Reflection of currentmedical practice… then of currentmortality • Complete • With all items used on previous score • With few missing data • Welldescribed • Easilyreachable
MIMIC-II !!! • Publically available dataset including all patients admitted to an ICU at the Beth Israel Deaconess Medical Center (BIDMC) in Boston, MA since 2001. • medical (MICU), trauma-surgical (TSICU), coronary (CCU), cardiac surgery recovery (CSRU) and medico-surgical (MSICU) critical care units. • Patient recruitment is still ongoing. • data collected before December 1st 2012 • adult ICU patients (>15 years-old) Lee, Conf Proc IEEE Eng Med Biol Soc 2011 Saeed, Crit Care Med 2011
MIMIC-II : beginnings of an obstacle course • Access to the ClinicalDatabase: • On-line course on protecting human research participants (minimum 3 hours) • For all participant • Basic Access Web interface : • Requiresknowledge of SQL…. User friendly for databasesspecialist • Limited size of the data export • number of patients • number of variables • Slow • Adapted for smallstudies, rare deseases or rare events
MIMIC-II : beginnings of an obstacle course • Entire MIMIC II ClinicalDatabase : • More than 40000 files (1 per patient) • Withineach files around 25 .txt files • Around 20 Giga Needs : • Endpoint: hospital mortality • Explanatory variables : those included in the SAPS II score (20 variables) • Dichotomized as in the SAPS II (Super Learner 1) • Non-transformed (Super Learner 2)
Constraints • Statisticalsofware R • Package workswithdataframe • Slow with large dataset • Few knowledge in SQL • Time • Distance betweenstatisticians
Choices • Decision to use R to read the datasests an constitute the working file • Allow us to welldefinecovariatesthatweneed • Quick to write • Independent of databasespecialists • But • Long (3 hours to obtain the .Rdata) • Not soeasy to modified • Need to wellunderstand the database and the way of coding data
SAPS II Super Learner 1
Conclusion • As compared to conventional severity scores, our Super Learner-based proposal offers improved performance for predicting hospital mortality in ICU patients. • The score willevoluatetogetherwith • New observations • New explanatory variables • SICULA : Just play with it !! http://webapps.biostat.berkeley.edu:8080/sicula/
Personalexperience • Increasingnumber of reviews for paperswithonly one angle of attack : the size of the database • Forgetting : • Quatility of the data • Difficulties in terms of model • Increasingnumber of request for analysis… but late in the process Bettercollaboration
Limits • Good example of what a large data base allows • In thisexample : • Covariates and endpointswere : • Relativelywelldefined • Systematicallycollected • Question was simple and unique • But thisisstill a monocentricobservationalstudy
Size does not matter ! • Fromour point of view the key characteristic of such data isnot the size • Size always reflects issues of statistical power related to • the frequency of exposures and outcomes • the effect sizes • Key characteristics are : • Observational data • Not planed to respond to a specific question • Complex data
Size does not matter ! • This implies • Questions shouldbeadapted to that type data or that collection of the data shouldbeadapt to questions • Choice of an adaptedanalysis model • a trade-off between analytic flexibility and the granularity of information • Recent methodological advances have given us new analytic tools to perform complex statistical analysis • Fortunatelywe have a lot data • Unfortunatelywe have a lot data
Size does not matter ! • Stronghypotheses • Non informative censoring • No confounder • Missingatrandom • ….. • Strongcomputingconstraints