1 / 111

Relationship between performance measures: From statistical evaluations to decision-analysis

Relationship between performance measures: From statistical evaluations to decision-analysis. Ewout Steyerberg Dept of Public Health, Erasmus MC, Rotterdam, the Netherlands E.Steyerberg@ErasmusMC.nl Chicago, October 23, 2011. General issues.

mikebradley
Download Presentation

Relationship between performance measures: From statistical evaluations to decision-analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relationship between performance measures:From statistical evaluations to decision-analysis Ewout Steyerberg Dept of Public Health, Erasmus MC, Rotterdam, the Netherlands E.Steyerberg@ErasmusMC.nl Chicago, October 23, 2011

  2. General issues • Usefulness / Clinical utility: what do we mean exactly? • Evaluation of predictions • Evaluation of decisions • Adding a marker to a model • Statistical significance?Testing β enough (no need to test increase in R2, AUC, IDI, …) • Clinical relevance: measurement worth the costs? (patient and physician burden, financial costs)

  3. Overview • Case study: residual masses in testicular cancer • Model development • Evaluation approach • Performance evaluation • Statistical • Overall • Calibration and discrimination • Decision-analytic • Utility-weighted measures

  4. www.clinicalpredictionmodels.org

  5. Prediction approach • Outcome: malignant or benign tissue • Predictors: • primary histology • 3 tumor markers • tumor size (postchemotherapy, and reduction) • Model: • logistic regression • 544 patients, 299 malignant tissue • Internal validation by bootstrapping • External validation in 273 patients, 197 malignant tissue

  6. Logistic regression results

  7. Evaluation approach: graphical assessment

  8. Lessons • Plot observed versus expected outcome with distribution of predictions by outcome (‘Validation graph’) • Performance should be assessed in validation sets, since apparent performance is optimistic (model developed in the same data set as used for evaluation)  Preferably external validation  At least internal validation, e.g. by bootstrap cross-validation

  9. Performance evaluation • Statistical criteria: predictions close to observed outcomes? • Overall; consider residuals y –ŷ, or y –p • Discrimination: separate low risk from high risk • Calibration: e.g. 70% predicted = 70% observed • Clinical usefulness: better decision-making? • One cut-off, defined by expected utility / relative weight of errors • Consecutive cut-offs: decision curve analysis

  10. Predictions close to observed outcomes? Penalty functions • Logarithmic score: (1 – Y)*(log(1 – p)) + Y*log(p) • Quadratic score: Y*(1 – p)^2 + (1 – Y)*p^2

  11. Overall performance measures • R2: explained variation • Logistic / Cox model: Nagelkerke’s R2 • Brier score: Y*(1 – p)^2 + (1 – Y)*p^2 • Brierscaled = 1 – Brier / Briermax • Briermax = mean(p) x (1 – mean(p))^2 + (1 – mean(p)) x mean(p)^2 • Brierscaled very similar to Pearson R2 for binary outcomes

  12. Overall performance in case study

  13. Measures for discrimination Concordance statistic, or area under the ROC curve Discrimination slope Lorenz curve

  14. ROC curves for case study

  15. Box plots with discrimination slope for case study

  16. Lorenz concentration curves: general pattern

  17. Lorenz concentration curves: case study

  18. Discriminative ability of testicular cancer model

  19. Characteristics of measures for discrimination

  20. Measures for calibration Graphical assessments Cox recalibration framework (1958) Tests for miscalibration Cox; Hosmer-Lemeshow; Goeman - LeCessie

  21. Calibration: general principle

  22. Calibration: case study

  23. Calibration tests

  24. Hosmer-Lemeshow test for testicular cancer model

  25. Some calibration and goodness-of-fit tests

  26. Lessons • Visual inspection of calibration important at external validation, combined with test for calibration-in-the-large and calibration slope

  27. Clinical usefulness: making decisions • Diagnostic work-up • Test ordering • Starting treatment • Therapeutic decision-making • Surgery • Intensity of treatment

  28. Decision curve analysis Andrew Vickers Departments of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center

  29. How to evaluate predictions? Prediction models are wonderful!

  30. How to evaluate predictions? Prediction models are wonderful! How do you know that they do more good than harm?

  31. Overview of talk • Traditional statistical and decision analytic methods for evaluating predictions • Theory of decision curve analysis

  32. Illustrative example • Men with raised PSA are referred for prostate biopsy • In the USA, ~25% of men with raised PSA have positive biopsy • ~750,000 unnecessary biopsies / year in US • Could a new molecular marker help predict prostate cancer?

  33. Molecular markers for prostate cancer detection • Assess a marker in men undergoing prostate biopsy for elevated PSA • Create “base” model: • Logistic regression: biopsy result as dependent variable; PSA, free PSA, age as predictors • Create “marker” model • Add marker(s) as predictor to the base model • Compare “base” and “marker” model

  34. How to evaluate models? • Biostatistical approach (ROC’ers) • P values • Accuracy (area-under-the-curve: AUC) • Decision analytic approach (VOI’ers) • Decision tree • Preferences / outcomes

  35. PSA velocity P value for PSAv in multivariable model <0.001 PSAv an “independent” predictor AUC: Base model = 0.609 Marker model =0 .626

  36. AUCs and p values • I have no idea whether to use the model or not • Is an AUC of 0.626 high enough? • Is an increase in AUC of 0.017 enough to make measuring velocity worth it?

  37. Decision analysis • Identify every possible decision • Identify every possible consequence • Identify probability of each • Identify value of each

  38. Cancer Cancer No cancer No cancer No cancer No cancer p1 a Cancer Decision tree Biopsy p2 b No Cancer Apply model p3 c Cancer No biopsy 1- (p1 + p2 + p3) d No Cancer (p1 + p3) a Cancer Biopsy 1 - (p1 + p3) b No Cancer (p1 + p3) c Cancer No biopsy 1 - (p1 + p3) d No Cancer

  39. Optimal decision • Use model • p1 a + p2 b + p3 c + (1 - p1 - p2 - p3 )d • Treat all • (p1 + p3 )a + (1- (p1 + p3 ))b • Treat none • (p1 + p3 )c + (1- (p1 + p3 ))d • Which gives highest value?

  40. Drawbacks of traditional decision analysis • p’s require a cut-point to be chosen

  41. Decision tree p1 a Cancer Biopsy p2 b No Cancer Apply model p3 c Cancer No biopsy 1- (p1 + p2 + p3) d No Cancer Cancer Cancer (p1 + p3) a Cancer Biopsy 1 - (p1 + p3) b No Cancer No cancer No cancer No cancer No cancer (p1 + p3) c Cancer No biopsy 1 - (p1 + p3) d No Cancer

  42. Problems with traditional decision analysis • p’s require a cut-point to be chosen • Extra data needed on health values outcomes (a – d) • Harms of biopsy • Harms of delayed diagnosis • Harms may vary between patients

  43. Decision tree p1 a Cancer Biopsy p2 b No Cancer Apply model p3 c Cancer No biopsy 1- (p1 + p2 + p3) d No Cancer Cancer Cancer (p1 + p3) a Cancer Biopsy 1 - (p1 + p3) b No Cancer No cancer No cancer No cancer No cancer (p1 + p3) c Cancer No biopsy 1 - (p1 + p3) d No Cancer

  44. Evaluating values of health outcomes • Obtain data from the literature on: • Benefit of detecting cancer (cp to missed / delayed cancer) • Harms of unnecessary prostate biopsy (cp to no biopsy) • Burden: pain and inconvenience • Cost of biopsy

  45. Evaluating values of health outcomes • Obtain data from the individual patient: • What are your views on having a biopsy? • How important is it for you to find a cancer?

  46. Either way • Investigator: “here is a data set, is my model or marker of value?” • Analyst: “I can’t tell you, you have to go away and do a literature search first. Also, you have to ask each and every patient.”

  47. ROCkers’ methods are simple and elegant but useless VOIers’ methods are useful, but complex and difficult to apply ROCkers and VOIers

  48. Solving the decision tree

  49. Threshold probability Probability of disease is Define a threshold probability of disease as pt Patient accepts treatment if

  50. Solve the decision tree • pt, cut-point for choosing whether to treat or not • Harm:Benefit ratio defines p • Harm: d – b (FP) • Benefit: a – c (TP) • pt / (1-pt) = H:B

More Related