Relationship between performance measures: From statistical evaluations to decision-analysis

Relationship between performance measures:From statistical evaluations to decision-analysis Ewout Steyerberg Dept of Public Health, Erasmus MC, Rotterdam, the Netherlands E.Steyerberg@ErasmusMC.nl Chicago, October 23, 2011

General issues • Usefulness / Clinical utility: what do we mean exactly? • Evaluation of predictions • Evaluation of decisions • Adding a marker to a model • Statistical significance?Testing β enough (no need to test increase in R2, AUC, IDI, …) • Clinical relevance: measurement worth the costs? (patient and physician burden, financial costs)

Overview • Case study: residual masses in testicular cancer • Model development • Evaluation approach • Performance evaluation • Statistical • Overall • Calibration and discrimination • Decision-analytic • Utility-weighted measures

www.clinicalpredictionmodels.org

Prediction approach • Outcome: malignant or benign tissue • Predictors: • primary histology • 3 tumor markers • tumor size (postchemotherapy, and reduction) • Model: • logistic regression • 544 patients, 299 malignant tissue • Internal validation by bootstrapping • External validation in 273 patients, 197 malignant tissue

Logistic regression results

Evaluation approach: graphical assessment

Lessons • Plot observed versus expected outcome with distribution of predictions by outcome (‘Validation graph’) • Performance should be assessed in validation sets, since apparent performance is optimistic (model developed in the same data set as used for evaluation)  Preferably external validation  At least internal validation, e.g. by bootstrap cross-validation

Performance evaluation • Statistical criteria: predictions close to observed outcomes? • Overall; consider residuals y –ŷ, or y –p • Discrimination: separate low risk from high risk • Calibration: e.g. 70% predicted = 70% observed • Clinical usefulness: better decision-making? • One cut-off, defined by expected utility / relative weight of errors • Consecutive cut-offs: decision curve analysis

Predictions close to observed outcomes? Penalty functions • Logarithmic score: (1 – Y)*(log(1 – p)) + Y*log(p) • Quadratic score: Y*(1 – p)^2 + (1 – Y)*p^2

Overall performance measures • R2: explained variation • Logistic / Cox model: Nagelkerke’s R2 • Brier score: Y*(1 – p)^2 + (1 – Y)*p^2 • Brierscaled = 1 – Brier / Briermax • Briermax = mean(p) x (1 – mean(p))^2 + (1 – mean(p)) x mean(p)^2 • Brierscaled very similar to Pearson R2 for binary outcomes

Overall performance in case study

Measures for discrimination Concordance statistic, or area under the ROC curve Discrimination slope Lorenz curve

ROC curves for case study

Box plots with discrimination slope for case study

Lorenz concentration curves: general pattern

Lorenz concentration curves: case study

Discriminative ability of testicular cancer model

Characteristics of measures for discrimination

Measures for calibration Graphical assessments Cox recalibration framework (1958) Tests for miscalibration Cox; Hosmer-Lemeshow; Goeman - LeCessie

Calibration: general principle

Calibration: case study

Calibration tests

Hosmer-Lemeshow test for testicular cancer model

Some calibration and goodness-of-fit tests

Lessons • Visual inspection of calibration important at external validation, combined with test for calibration-in-the-large and calibration slope

Clinical usefulness: making decisions • Diagnostic work-up • Test ordering • Starting treatment • Therapeutic decision-making • Surgery • Intensity of treatment

Decision curve analysis Andrew Vickers Departments of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center

How to evaluate predictions? Prediction models are wonderful!

How to evaluate predictions? Prediction models are wonderful! How do you know that they do more good than harm?

Overview of talk • Traditional statistical and decision analytic methods for evaluating predictions • Theory of decision curve analysis

Illustrative example • Men with raised PSA are referred for prostate biopsy • In the USA, ~25% of men with raised PSA have positive biopsy • ~750,000 unnecessary biopsies / year in US • Could a new molecular marker help predict prostate cancer?

Molecular markers for prostate cancer detection • Assess a marker in men undergoing prostate biopsy for elevated PSA • Create “base” model: • Logistic regression: biopsy result as dependent variable; PSA, free PSA, age as predictors • Create “marker” model • Add marker(s) as predictor to the base model • Compare “base” and “marker” model

How to evaluate models? • Biostatistical approach (ROC’ers) • P values • Accuracy (area-under-the-curve: AUC) • Decision analytic approach (VOI’ers) • Decision tree • Preferences / outcomes

PSA velocity P value for PSAv in multivariable model <0.001 PSAv an “independent” predictor AUC: Base model = 0.609 Marker model =0 .626

AUCs and p values • I have no idea whether to use the model or not • Is an AUC of 0.626 high enough? • Is an increase in AUC of 0.017 enough to make measuring velocity worth it?

Decision analysis • Identify every possible decision • Identify every possible consequence • Identify probability of each • Identify value of each

Cancer Cancer No cancer No cancer No cancer No cancer p1 a Cancer Decision tree Biopsy p2 b No Cancer Apply model p3 c Cancer No biopsy 1- (p1 + p2 + p3) d No Cancer (p1 + p3) a Cancer Biopsy 1 - (p1 + p3) b No Cancer (p1 + p3) c Cancer No biopsy 1 - (p1 + p3) d No Cancer

Optimal decision • Use model • p1 a + p2 b + p3 c + (1 - p1 - p2 - p3 )d • Treat all • (p1 + p3 )a + (1- (p1 + p3 ))b • Treat none • (p1 + p3 )c + (1- (p1 + p3 ))d • Which gives highest value?

Drawbacks of traditional decision analysis • p’s require a cut-point to be chosen

Decision tree p1 a Cancer Biopsy p2 b No Cancer Apply model p3 c Cancer No biopsy 1- (p1 + p2 + p3) d No Cancer Cancer Cancer (p1 + p3) a Cancer Biopsy 1 - (p1 + p3) b No Cancer No cancer No cancer No cancer No cancer (p1 + p3) c Cancer No biopsy 1 - (p1 + p3) d No Cancer

Problems with traditional decision analysis • p’s require a cut-point to be chosen • Extra data needed on health values outcomes (a – d) • Harms of biopsy • Harms of delayed diagnosis • Harms may vary between patients

Decision tree p1 a Cancer Biopsy p2 b No Cancer Apply model p3 c Cancer No biopsy 1- (p1 + p2 + p3) d No Cancer Cancer Cancer (p1 + p3) a Cancer Biopsy 1 - (p1 + p3) b No Cancer No cancer No cancer No cancer No cancer (p1 + p3) c Cancer No biopsy 1 - (p1 + p3) d No Cancer

Evaluating values of health outcomes • Obtain data from the literature on: • Benefit of detecting cancer (cp to missed / delayed cancer) • Harms of unnecessary prostate biopsy (cp to no biopsy) • Burden: pain and inconvenience • Cost of biopsy

Evaluating values of health outcomes • Obtain data from the individual patient: • What are your views on having a biopsy? • How important is it for you to find a cancer?

Either way • Investigator: “here is a data set, is my model or marker of value?” • Analyst: “I can’t tell you, you have to go away and do a literature search first. Also, you have to ask each and every patient.”

ROCkers’ methods are simple and elegant but useless VOIers’ methods are useful, but complex and difficult to apply ROCkers and VOIers

Solving the decision tree

Threshold probability Probability of disease is Define a threshold probability of disease as pt Patient accepts treatment if

Solve the decision tree • pt, cut-point for choosing whether to treat or not • Harm:Benefit ratio defines p • Harm: d – b (FP) • Benefit: a – c (TP) • pt / (1-pt) = H:B

Relationship between performance measures: From statistical evaluations to decision-analysis

Relationship between performance measures: From statistical evaluations to decision-analysis

Presentation Transcript

Performance Evaluations

The Relationship between Sustainability Performance and Financial Performance

Statistical Measures 1

Introduction to Performance Measures

Effective Performance Evaluations

Performance Evaluations

Performance Evaluations

Performance Evaluations

Statistical Significance and Performance Measures

Performance Evaluations

Performance Evaluations

Measures of relationship

Statistical Measures

Bivariate Statistical Analysis : Measures of Association

4.1 Statistical Measures

360 Performance Evaluations

Performance Evaluations

PERFORMANCE EVALUATIONS

360 Performance Evaluations

EMPLOYEE PERFORMANCE EVALUATIONS

Measures of the relationship between 2 variables: Correlation