1.11k likes | 1.12k Views
Relationship between performance measures: From statistical evaluations to decision-analysis. Ewout Steyerberg Dept of Public Health, Erasmus MC, Rotterdam, the Netherlands E.Steyerberg@ErasmusMC.nl Chicago, October 23, 2011. General issues.
E N D
Relationship between performance measures:From statistical evaluations to decision-analysis Ewout Steyerberg Dept of Public Health, Erasmus MC, Rotterdam, the Netherlands E.Steyerberg@ErasmusMC.nl Chicago, October 23, 2011
General issues • Usefulness / Clinical utility: what do we mean exactly? • Evaluation of predictions • Evaluation of decisions • Adding a marker to a model • Statistical significance?Testing β enough (no need to test increase in R2, AUC, IDI, …) • Clinical relevance: measurement worth the costs? (patient and physician burden, financial costs)
Overview • Case study: residual masses in testicular cancer • Model development • Evaluation approach • Performance evaluation • Statistical • Overall • Calibration and discrimination • Decision-analytic • Utility-weighted measures
Prediction approach • Outcome: malignant or benign tissue • Predictors: • primary histology • 3 tumor markers • tumor size (postchemotherapy, and reduction) • Model: • logistic regression • 544 patients, 299 malignant tissue • Internal validation by bootstrapping • External validation in 273 patients, 197 malignant tissue
Lessons • Plot observed versus expected outcome with distribution of predictions by outcome (‘Validation graph’) • Performance should be assessed in validation sets, since apparent performance is optimistic (model developed in the same data set as used for evaluation) Preferably external validation At least internal validation, e.g. by bootstrap cross-validation
Performance evaluation • Statistical criteria: predictions close to observed outcomes? • Overall; consider residuals y –ŷ, or y –p • Discrimination: separate low risk from high risk • Calibration: e.g. 70% predicted = 70% observed • Clinical usefulness: better decision-making? • One cut-off, defined by expected utility / relative weight of errors • Consecutive cut-offs: decision curve analysis
Predictions close to observed outcomes? Penalty functions • Logarithmic score: (1 – Y)*(log(1 – p)) + Y*log(p) • Quadratic score: Y*(1 – p)^2 + (1 – Y)*p^2
Overall performance measures • R2: explained variation • Logistic / Cox model: Nagelkerke’s R2 • Brier score: Y*(1 – p)^2 + (1 – Y)*p^2 • Brierscaled = 1 – Brier / Briermax • Briermax = mean(p) x (1 – mean(p))^2 + (1 – mean(p)) x mean(p)^2 • Brierscaled very similar to Pearson R2 for binary outcomes
Measures for discrimination Concordance statistic, or area under the ROC curve Discrimination slope Lorenz curve
Measures for calibration Graphical assessments Cox recalibration framework (1958) Tests for miscalibration Cox; Hosmer-Lemeshow; Goeman - LeCessie
Lessons • Visual inspection of calibration important at external validation, combined with test for calibration-in-the-large and calibration slope
Clinical usefulness: making decisions • Diagnostic work-up • Test ordering • Starting treatment • Therapeutic decision-making • Surgery • Intensity of treatment
Decision curve analysis Andrew Vickers Departments of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center
How to evaluate predictions? Prediction models are wonderful!
How to evaluate predictions? Prediction models are wonderful! How do you know that they do more good than harm?
Overview of talk • Traditional statistical and decision analytic methods for evaluating predictions • Theory of decision curve analysis
Illustrative example • Men with raised PSA are referred for prostate biopsy • In the USA, ~25% of men with raised PSA have positive biopsy • ~750,000 unnecessary biopsies / year in US • Could a new molecular marker help predict prostate cancer?
Molecular markers for prostate cancer detection • Assess a marker in men undergoing prostate biopsy for elevated PSA • Create “base” model: • Logistic regression: biopsy result as dependent variable; PSA, free PSA, age as predictors • Create “marker” model • Add marker(s) as predictor to the base model • Compare “base” and “marker” model
How to evaluate models? • Biostatistical approach (ROC’ers) • P values • Accuracy (area-under-the-curve: AUC) • Decision analytic approach (VOI’ers) • Decision tree • Preferences / outcomes
PSA velocity P value for PSAv in multivariable model <0.001 PSAv an “independent” predictor AUC: Base model = 0.609 Marker model =0 .626
AUCs and p values • I have no idea whether to use the model or not • Is an AUC of 0.626 high enough? • Is an increase in AUC of 0.017 enough to make measuring velocity worth it?
Decision analysis • Identify every possible decision • Identify every possible consequence • Identify probability of each • Identify value of each
Cancer Cancer No cancer No cancer No cancer No cancer p1 a Cancer Decision tree Biopsy p2 b No Cancer Apply model p3 c Cancer No biopsy 1- (p1 + p2 + p3) d No Cancer (p1 + p3) a Cancer Biopsy 1 - (p1 + p3) b No Cancer (p1 + p3) c Cancer No biopsy 1 - (p1 + p3) d No Cancer
Optimal decision • Use model • p1 a + p2 b + p3 c + (1 - p1 - p2 - p3 )d • Treat all • (p1 + p3 )a + (1- (p1 + p3 ))b • Treat none • (p1 + p3 )c + (1- (p1 + p3 ))d • Which gives highest value?
Drawbacks of traditional decision analysis • p’s require a cut-point to be chosen
Decision tree p1 a Cancer Biopsy p2 b No Cancer Apply model p3 c Cancer No biopsy 1- (p1 + p2 + p3) d No Cancer Cancer Cancer (p1 + p3) a Cancer Biopsy 1 - (p1 + p3) b No Cancer No cancer No cancer No cancer No cancer (p1 + p3) c Cancer No biopsy 1 - (p1 + p3) d No Cancer
Problems with traditional decision analysis • p’s require a cut-point to be chosen • Extra data needed on health values outcomes (a – d) • Harms of biopsy • Harms of delayed diagnosis • Harms may vary between patients
Decision tree p1 a Cancer Biopsy p2 b No Cancer Apply model p3 c Cancer No biopsy 1- (p1 + p2 + p3) d No Cancer Cancer Cancer (p1 + p3) a Cancer Biopsy 1 - (p1 + p3) b No Cancer No cancer No cancer No cancer No cancer (p1 + p3) c Cancer No biopsy 1 - (p1 + p3) d No Cancer
Evaluating values of health outcomes • Obtain data from the literature on: • Benefit of detecting cancer (cp to missed / delayed cancer) • Harms of unnecessary prostate biopsy (cp to no biopsy) • Burden: pain and inconvenience • Cost of biopsy
Evaluating values of health outcomes • Obtain data from the individual patient: • What are your views on having a biopsy? • How important is it for you to find a cancer?
Either way • Investigator: “here is a data set, is my model or marker of value?” • Analyst: “I can’t tell you, you have to go away and do a literature search first. Also, you have to ask each and every patient.”
ROCkers’ methods are simple and elegant but useless VOIers’ methods are useful, but complex and difficult to apply ROCkers and VOIers
Threshold probability Probability of disease is Define a threshold probability of disease as pt Patient accepts treatment if
Solve the decision tree • pt, cut-point for choosing whether to treat or not • Harm:Benefit ratio defines p • Harm: d – b (FP) • Benefit: a – c (TP) • pt / (1-pt) = H:B