1.49k likes | 1.64k Views
Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies. David G. Brown and Frank Samuelson Center for Devices and Radiological Health, FDA
E N D
Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies David G. Brown and Frank Samuelson Center for Devices and Radiological Health, FDA 6 July 2014
Course Outline • Performance measures for Computational Intelligence (CI) observers • Accuracy • Prevalence dependent measures • Prevalence independent measures • Maximization of performance: Utility analysis/Cost functions • Receiver Operating Characteristic (ROC) analysis • Sensitivity and specificity • Construction of the ROC curve • Area under the ROC curve (AUC) • Error analysis for CI observers • Sources of error • Parametric methods • Nonparametric methods • Standard deviations and confidence intervals • Boot strap methods • Theoretical foundation • Practical use • References
What’s the problem? • Emphasis on algorithm innovation to exclusion of performance assessment • Use of subjective measures of performance – “beauty contest” • Use of “accuracy” as a measure of success • Lack of error bars—My CIO is .01 better than yours (+/- ?) • Flawed methodology—training and testing on same data • Lack of appreciation for the many different sources of error that can be taken into account
Original imageLena. Courtesy of the Signal and Image Processing Institute at the University of Southern California.
CI improved imageBaboon. Courtesy of the Signal and Image Processing Institute at the University of Southern California.
I. Performance measures for computational intelligence (CI) observers • Task based: (binary) discrimination task • Two populations involved: “normal” and “abnormal,” • Accuracy – Intuitive but incomplete • Different consequences for success or failure for each population • Some measures depend on the prevalence (Pr) some do not, Pr = • Accuracy, positive predictive value, negative predictive value • Sensitivity, specificity, ROC, AUC • True optimization of performance requires knowledge of cost functions or utilities for successes and failures in both populations
How to make a CIO with >99% accuracy • Medical problem: Screening mammography (“screening” means testing in an asymptomatic population) • Prevalence of breast cancer in the screening population Pr = 0.5 % • My CIO always says “normal” • Accuracy (Acc) is 99.5% (accuracy of accepted present-day systems ~75%) • Accuracy in a diagnostic setting (Pr~20%) is 80% -- Acc=1-Pr (for my CIO)
CIO operates on two different populations Normal cases p(t|0) Abnormal cases p(t|1) Threshold t = T t-axis
Must consider effects on normal and abnormal populations separately • CIO output t • p(t|0) probability distribution of t for the population of normals • p(t|1) probability distribution of t for the population of abnormals • Threshold T. Everything to the right of T called abnormal, and everything to the left of T called normal • Area of p(t|0) to left of T is the true negative fraction (TNF = specificity) and to the right the false positive fraction (FPF = type 1 error). • TNF + FPF = 1 • Area of p(t|1) to left of T is the false negative fraction (FNF = type 2 error) and to the right is the true positive fraction (TPF = sensitivity) • FNF + TPF = 1 • TNF, FPF, FNF, TPF all are prevalence independent, since each is some fraction of one of our two probability distributions • {Accuracy = Pr x TPF + (1-Pr) x TNF}
Normal cases FPF (.5) TNF (.5) t-axis Threshold T Abnormal cases TPF (.95) FNF (.05) t-axis
Prevalence dependent measures • Accuracy (Acc) • Acc = Pr x TPF + (1-Pr) x TNF • Positive predictive value (PPV): fraction of positives that are true positives • PPV = TPF x Pr / (TPF x Pr + FPF x (1-Pr)) • Negative predictive value (NPV): fraction of negatives that are true negatives • NPV = TNF x (1-Pr) / (TNF x (1-Pr) + FNF x Pr) • Using the mammography screening Pr and previous TPF, TNF, FNF, FPF values: Pr = .05, TPF = .95, TNF = 0.5, FNF=.05, FPF=0.5 • Acc = .05x.95+.95x.5 = .52 • PPV = .95x.05/(.95x.05+.5x.95) = .10 • NPV = .5x.95/(.5x.95+.05x.05) = .997
Prevalence dependent measures • Accuracy (Acc) • Acc = Pr x TPF + (1-Pr) x TNF • Positive predictive value (PPV): fraction of positives that are true positives • PPV = TPF x Pr / (TPF x Pr + FPF x (1-Pr)) • Negative predictive value (NPV): fraction of negatives that are true negatives • NPV = TNF x (1-Pr) / (TNF x (1-Pr) + FNF x Pr) • Using the mammography screening Pr and previous TPF, TNF, FNF, FPF values: Pr = .005, TPF = .95, TNF = 0.5, FNF=.05, FPF=0.5 • Acc = .005x.95+.995x.5 = .50 • PPV = .95x.005/(.95x.005+.5x.995) = .01 • NPV = .5x.995/(.5x.995+.05x.005) = .995
Acc, PPV, NPV as functions of prevalence(screening mammography) • TPF=.95 • FNF=.05 • TNF=0.5 • FPF=0.5
Acc = NPV as function of prevalence(forced “normal” response CIO)
Prevalence independent measures • Sensitivity = TPF • Specificity = TNF (1-FPF) • Receiver Operating Characteristic (ROC) = TPF as a function of FPF (Sensitivity as a function of 1 – Specificity) • Area under the ROC curve (AUC) = Sensitivity averaged over all values of Specificity
Threshold Normal / Class 0 subjects Entire ROC curve ROC slope TPF, sensitivity Abnormal / Class 1 subjects FPF, 1-specificity
Empirical ROC data for mammography screening in the US Craig Beam et al.
Maximization of performance • Need to know utilities or costs of each type of decision outcome – but these are very hard to estimate accurately. You don’t just maximize accuracy. • Need prevalence • For mammography example • TPF: prolongation of life minus treatment cost • FPF: diagnostic work-up cost, anxiety • TNF: peace of mind • FNF: delay in treatment => shortened life • Hypothetical assignment of utilities for some decision threshold T: • UtilityT= U(TPF) x TPF x Pr + U(FPF) x FPF x (1-Pr) + U(TNF) x TNF x (1-Pr) + U(FNF) x FNF x Pr • U(TPF) = 100, U(FPF) = -10, U(TNF) = 4, U(FNF) = -20 • UtilityT= 100 x .95 x .05 – 10 x .50 x .95 + 4 x .50 x .95 – 20 x .05 x .05 = 1.85 • Now if we only knew how to trade off TPF versus FPF, we could optimize (?) medical performance.
Choice of ROC operating point through utility analysis—screening mammography
Utility maximization calculation u = (UTPFTPF+UFNFFNF)PR+(UTNFTNF+UFPFFPF)(1-PR) =(UTPFTPF+UFNF(1-TPF))PR+(UTNF(1-FPF)+UFPFFPF)(1-PR) du/dFPF=(UFPF-UTNF)(1-PR)+(UTPF-UFNF)PRdTPF/dFPF =0 dTPF/dFPF=(UTNF-UFPF)(1-PR)/(UTPF-UFNF)PR PR=.005 dTPF/dFPF = 23. PR=.05 dTPF/dFPF = 2.2 (UTPF=100, UFNF=-20, UTNF=4, UFPF=-20)
Threshold Normal cases Entire ROC curve ROC slope TPF, sensitivity Abnormal cases FPF, 1-specificity
Estimators • TPF, FPF, TNF, FNF, Accuracy, the ROC curve, and AUC are all fractions or probabilities. • Normally we have a finite sample of subjects on which to test our CIO. From this finite sample we try to estimate the above fractions • These estimates will vary depending upon the sample selected (statistical variation). • Estimates can be nonparametric or parametric
Number of abnormals that would be selected by CIO in the population Number of abnormals that were selected by CIO in the sample Number of abnormals in the population Number of abnormals in the sample Estimators • TPF= • TPF= • Number in sample << Number in population (at least in theory)
II. Receiver Operating Characteristic (ROC) • Receiver Operating Characteristic • Binary Classification • Test result is compared to a threshold
Distribution of CIO Output for all Subjects Threshold Computational intelligence observer output
Distribution of Output for Normal / Class 0 Subjects, p(t|0) Distribution of Output for Abnormal / Class 1 Subjects, p(t|1) Threshold t-axis Computational intelligence observer output
Distribution of Output for Normal / Class 0 Subjects, p(t|0) Threshold Abnormal / Class 1 subjects
Distribution of Output for Normal / Class 0 Subjects, p(t|0) Specificity = True Negative Fraction = TNF Threshold Abnormal / Class 1 subjects Sensitivity = True Positive Fraction = TPF
Normal / Class 0 subjects Specificity Decision D0 D1 TNF 0.50 Threshold Truth H1 H0 TPF 0.95 Abnormal / Class 1 subjects Sensitivity
Normal / Class 0 subjects 1 - Specificity = False Positive Fraction = FPF Threshold Abnormal / Class 1 subjects 1 - Sensitivity = False Negative Fraction = FNF
Normal / Class 0 subjects 1 - Specificity Decision D0 D1 TNF 0.50 FPF 0.50 Threshold Truth H1 H0 FNF 0.05 TPF 0.95 Abnormal / Class 1 subjects 1 - Sensitivity
Normal / Class 0 subjects high sensitivity TPF, sensitivity Threshold Abnormal / Class 1 subjects FPF, 1-specificity
Normal / Class 0 subjects sensitivity = specificity TPF, sensitivity Threshold Abnormal / Class 1 subjects FPF, 1-specificity
Normal / Class 0 subjects TPF, sensitivity Threshold high specificity Abnormal / Class 1 subjects FPF, 1-specificity
Which CIO is best? Normal / Class 0 subjects CIO #3 CIO #2 TPF, sensitivity Threshold CIO #1 Abnormal / Class 1 subjects FPF, 1-specificity
Do not compare rates of one class, e.g. TPF, at different rates of the other class (FPF). Normal / Class 0 subjects CIO #3 CIO #2 TPF, sensitivity Threshold CIO #1 Abnormal / Class 1 subjects FPF, 1-specificity
Threshold Normal / Class 0 subjects Entire ROC curve TPF, sensitivity Abnormal / Class 1 subjects FPF, 1-specificity
AUC=0.98 Entire ROC curve chance line TPF, sensitivity AUC=0.85 Discriminability -or- CIO performance FPF, 1-specificity AUC=0.5
AUC (Area under ROC Curve) • AUC is a separation probability • AUC = probability that • CIO output for abnormal > CIO output for normal • CIO correctly tells which of 2 subjects is normal • Estimating AUC from finite sample • Select abnormal subject score = xi • Select normal subject score = yk • Is xi > yk ? • Average over all x,y: