470 likes | 839 Views
An Overview of Contemporary ROC Methodology in Medical Imaging and Computer-Assist Modalities Robert F. Wagner, Ph.D., OST, CDRH, FDA. ROC Receiver Operating Characteristic (historic name from radar studies) Relative Operating Characteristic
E N D
An Overview of Contemporary ROC Methodology in Medical Imaging and Computer-Assist Modalities Robert F. Wagner, Ph.D., OST, CDRH, FDA
ROC Receiver Operating Characteristic (historic name from radar studies) Relative Operating Characteristic (psychology, psychophysics) Operating Characteristic (preferred by some)
OUTLINE: - Efforts toward consensus development on present issues -The ROC Paradigm - The complication of reader variability - The multiple-reader multiple-case (MRMC) ROC paradigm - The measurement scales: categories; patient-management/action; probability scale - Complications from location uncertainty truth uncertainty effective sample # uncertainty reader vigilance - Summary
EFFORTS TOWARD CONSENSUS DEVELOPMENT ON THE PRESENT ISSUES - How to use classic concepts of Sensitivity, Specificity, and ROC analysis to assess performance of diagnostic imaging and computer-assist systems? - Many new issues and levels of complexity coming to the fore as more complex technologies emerge
EFFORTS TOWARD CONSENSUS DEVELOPMENT ON THE PRESENT ISSUES (II) RSNA/SPIE/MIPS Various Workshops & Literature - an evolving Work-in-Progress FDA/CDRH use of multiple-reader multiple-case (MRMC) ROC - Digital Mammography PMAs - Computer Aid for lung nodule detection on CXR (film) NCI Lung Image Database Consortium (LIDC) & Workshops - Consensus seeking on many issues - Two CDRH active members Communication of these resources with incoming sponsors
Non-diseased cases Diseased cases Threshold Test result value or subjective judgement of likelihood that case is diseased
Non-diseased cases Diseased cases more typically: Test result value or subjective judgement of likelihood that case is diseased
Non-diseased cases TPF, sensitivity Threshold less aggressive mindset Diseased cases FPF, 1-specificity
Non-diseased cases moderate mindset TPF, sensitivity Threshold Diseased cases FPF, 1-specificity
Non-diseased cases more aggressive mindset TPF, sensitivity Threshold Diseased cases FPF, 1-specificity
Threshold Non-diseased cases Entire ROC curve TPF, sensitivity Diseased cases FPF, 1-specificity
Entire ROC curve chance line TPF, sensitivity Reader Skill and/or Level of Technology FPF, 1-specificity
. . . at least that’s the idea . . . . . . now to what happens in the real world . . . The Complication of Reader Variability
In the following example from mammography, readers were asked to set their “threshold for action” . . . . . . between their sense of the boundary between category 3 and category 4 of the BIRADS scale
- There is no unique ROC operating point i.e., no unique (TPF, FPF) point - There is no unique ROC curve i.e., there is a band or region of ROCs
. . . dozens of examples of this phenomenon exist . . . The following is an example from plain film chest radiography (CXR)
The Multiple-Reader Multiple-Case (MRMC) paradigm “Fully-Crossed Design” * Cases matched across modalities (i.e., same cases read unaided vs aided) * Readers matched across modalities (i.e., same readers read unaided vs aided) * This design has the most statistical power for a given number of readers and a given number of cases with verified truth; thus, it’s least demanding of these resources (“least burdensome”)
The Multiple-Reader Multiple-Case (MRMC) paradigm Enabled by “resampling strategies” - Jackknife plus ANOVA (parametric) (Dorfman, Berbaum, Metz DBM 1992) - Bootstrap the experiment of interest (nonpar) Draw random readers, random cases Carry out the experiment of interest
Some possible bootstrap samples of size 15 from a dataset with 15 elements [14, 6, 3, 5, 12, 9, 11, 14, 4, 10, 7, 12, 3, 14, 2] . . . [9, 15, 11, 2, 13, 1, 6, 7, 12, 4, 8, 1, 12, 6, 14]
The Multiple-Reader Multiple-Case (MRMC) paradigm Enabled by “resampling strategies” - Jackknife plus ANOVA (parametric) (Dorfman, Berbaum, Metz DBM 1992) - Bootstrap the experiment of interest (nonpar) Draw random readers, random cases Carry out the experiment of interest -Obtain mean performance over readers, cases - Obtain error bars that account for variability of readers and cases
Scales used for reporting and measurements: - Historic ordered categories (usu. 5 or 6) (almost definitely no . . . maybe . . . almost definitely yes) - “Action item” or “patient management” scale (e.g., no action vs F/U . . . or F/U vs biopsy) . . . BIRADS scale is classic example . . . - “Continuous” probability rating scale (e.g., probability of disease or probability of cancer) . . . actually recommended in BIRADS doc . . .
Scales used for reporting and measurements Example of “Best of both worlds”: Classification of benign vs malignant μcalc clusters (Jiang, Nishikawa, Schmidt, Metz, Giger, Doi) Authors studied ROC curves, ROC areas . . . and (Sensitivity, Specificity) operating point (means and uncertainties)
Possible reasons why we do not see more of “Best of both worlds” ROC total area is TPF (Se) averaged over FPF (Sp) - Var(ROC area) ~ (Binomial Var)/2 - Var(Se) when Sp is known = Binomial Var - Var (Se) when Sp is estimated > Binomial Var Var(ROC area) is least burdensome - “Both worlds” requires consistent conventions . . . plus training (little documentation so far) - May require consensus bodies to promote the practice
Dilemma: Which modality is better? 1.0 Modality B True Positive Fraction = Sensitivity Modality A 0.0 0.0 1.0 False Positive Fraction = 1.0 Specificity
The dilemma is resolved after ROCsare determined (one scenario): 1.0 Conclusion: Modality B is better: Modality B True Positive Fraction = Sensitivity Modality A • higher TPF at same FPF, or • lower FPF at • same TPF 0.0 0.0 1.0 False Positive Fraction = 1.0 Specificity
A different scenario: Same ROC 1.0 Modality B True Positive Fraction = Sensitivity Modality A 0.0 0.0 1.0 False Positive Fraction = 1.0 Specificity
. . . yet another scenario: 1.0 Conclusion: Modality B Modality A is better: True Positive Fraction = Sensitivity Modality A • higher TPF at • same FPF, or • lower FPF at • same TPF 0.0 0.0 1.0 False Positive Fraction = 1.0 Specificity
When ROC curves cross . . . total area under the ROC curve is not a sufficient summary measure of performance . . . other summary measures may be necessary. When this is anticipated, the study protocol is expected to address this.
Location scoring: - The basic ROC paradigm is an assessment of the decision making at the level of the patient. - In complex imaging, assessment of decision making at a finer level is desired, i.e., assessment of localization is desired. - Localization adds more information, more statistical power
The problem of location-specific ROC or “LROC” analysis - Measurement of a “hit” depends on localization criterion (thus, results are not unique) - Monotonic relationship between ROC and LROC for special case of zero or one lesion - More elaborate models require assumptions of independence among multiple lesions, regions - Lack of validated software for analysis of experiments
Region-of-interest (ROI) approach to location-specific ROC analysis . . . . . . only require localization to within a quadrant . . . . . . or some other unit . . .
Region-of-interest (ROI) approach to location-specific ROC analysis . . . - Disadvantages: “Does not correspond to the clinical task” . . . etc. . . - Advantages: Straightforward to account for correlations w/o additional assumptions - The most straightforward method is simply to resample using the patient as the statistical unit
THE PROBLEM OF UNCERTAINTY OF TRUTH STATE Classic paper: Revesz, Kundel, Bonitatibus (1983) included various ways of obtaining panel consensus “truth” Authors compared three imaging methods Any one of the three could outperform the others – depending on rule used for reducing panel to truth HOWEVER, TODAY TARGET IS “ACTIONABLE NODULE” ACCORDING TO EXPERT PANEL Classic ref. above indicates additional uncertainty present => Resample panel to assess additional uncertainty
UNCERTAINTY OF TRUTH STATE UNCERTAINTY IN EFFECTIVE SAMPLE SIZE Uncertainty in TPF # actually diseased cases Uncertainty in FPF # actually nondiseased cases Uncertainty in total area under ROC curve “effective number of cases” Harmonic mean of numbers in the two classes . . . & is a function of the panel sample
Given: 100 patients – What is the best “split” between “normals” and “abnormals” for purposes of estimating area under ROC?
. . . relaxing panel criterion from unanimous to majority - allows resampling to assess variability - may increase effective number of samples . . . these effects may tend to cancel
THE PROBLEM OF CONTROLLING FOR READER VIGILANCE Any measurement setting has artificial conditions vis-à-vis actual practice: “Are readers more vigilant in unaided reading when they’re subjects in a study?” “Are readers less vigilant in unaided reading when they’re not subjects in a study?” One early suggestion: Control the time available to readers to mimic the clinic (Chan et al., Invest. Radiol. 1990)
IN SUMMARY These points reflect the current status of on-going interactions between and among FDA Academia Industry sponsors NCI and the LIDC on the topic and issues for submissions like the present one
Selected References Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Medicine 1978; 8: 283-298. Metz CE. ROC methodology in radiologic imaging. Invest Radiol 1986; 21: 720-33. Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 1989; 24: 234-245. Metz CE. Fundamentals of ROC Analysis. [In] Handbook of Medical Imaging. Vol. 1. Physics and Psychophysics. Beutel J, Kundel HL, and Van Metter RL, Eds. SPIE Press (Bellingham WA 2000), Chapter 15: 751-769. Swets JA and Pickett RM. Evaluation of Diagnostic Systems. Academic Press, New York, 1982. Wagner RF, Beiden SV, Campbell G, Metz CE, and Sacks WM. Assessment of medical imaging and computer-assist systems: Lessons from recent experience. Acad Radiol 2002; 9: 1264-1277 Wagner RF, Beiden SV, Campbell G, Metz CE, and Sacks WM. Contemporary issues for experimental design in assessment of medical imaging and computer-assist systems. Proc. of the SPIE-Medical Imaging 2003; 5034: 213-224. Dodd LE, Wagner RF, Armato SG, McNitt-Gray MF, et al. Assessment methodologies and statistical issues for computer-aided diagnosis of lung nodules in computed tomography: Contemporary research topics relevant to the Lung Image Database Consortium. Acad Radiol (in print, Apr. 2004).
Toledano AY, Gatsonis C. Ordinal regression methodology for ROC curves derived from correlated data. Statistics in Medicine 1996, 15: 1807-1826. Nishikawa RM and Yarusso LM. Variations in measured performance of CAD schemes due to database composition and scoring protocol. Proc. of the SPIE 1998; 3338: 840-844. Giger ML. Current issues in CAD for mammography. In: Doi K, Giger ML, Nishikawa RM, and Schmidt RA, Eds. Digital Mammography ’96. Elsevier Science B.V. 1996, 53-59. Clarke LP, Croft BY, Staab E, Baker H, Sullivan DC, National Cancer Institute initiative: Lung image database resource for imaging research. Acad Radiol 2001 May;8(5):447-50. Wagner RF, Beiden SV, Metz CE. Continuous versus categorical data for ROC analysis: Some quantitative considerations. Acad Radiol 2001; 8: 328-334. Revesz G, Kundel HL, and Bonitatibus M. The effect of verification on the assessment of imaging techniques. Invest. Radiol. 1983; 18: 194-198. Beiden SV, Wagner RF, Campbell G. Components-of-variance models and multiple-bootstrap experiments: An alternative method for random-effects receiver operating characteristic analysis. Acad Radiol 2000; 7: 341-349. Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: Hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Acad Radiol 1995; 2 (Supplement 1): S22-S29. Chan HP, Doi K, Vyborny CJ et al. Improvement in radiologists’ detection of clustered microcalcifications on mammograms. Invest Radiol 1990; 25: 1102.
Chakraborty DP and Winter L. Free-response methodology: Alternate analysis and a new observer-performance experiment. Radiology 1990; 174: 873-881. Metz CE, Starr SJ, Lusted LB. Observer performance in detecting multiple radiographic signals: prediction and analysis using a generalized ROC approach. Radiology 1976; 121: 337-347. Starr SJ, Metz CE, Lusted LB, Goodenough DJ. Visual detection and localization of radiographic images. Radiology 1975; 116: 533-538 Swensson RG. Unified measurement of observer performance in detecting and localizing target objects on images. Medical Physics 1996; 23: 1709-1725. Chakraborty DP. The FROC, AFROC and DROC variants of the ROC analysis. [In] Handbook of Medical Imaging. Vol. 1. Physics and Psychophysics. Beutel J, Kundel HL, and Van Metter RL, Eds. SPIE Press (Bellingham WA 2000), Chapter 16: 771-796. Obuchowski NA. Multireader receiver operating characteristic studies: A comparison of study designs. Acad Radiol 1995; 2: 709-716. Gatsonis CA, Begg CB, Wieand S. Advances in Statistical Methods for Diagnostic Radiology: A Symposium. Acad Radiol 1995; 2 (Supplement 1): S1-S84 (the entire supplement is the Proceedings of the Symposium). Beiden SV, Wagner RF, Doi K, Nishikawa RM, Freedman M, Lo S-C B, and Xu X-W. Independent versus sequential reading in ROC studies of computer-assist modalities: Analysis of components of variance. Acad Radiol 22002; 9: 1036- 1043.
Metz CE. Evaluation of CAD Methods. In: Doi K, MacMahon H, Giger ML, and Hoffmann KR, eds. Computer-Aided Diagnosis in Medical Imaging. Amsterdam: Elsevier Science B.V. (Excerpta Medica International Congress Series, Vol. 1182), 1999, 543-554. Chakraborty, DP. Statistical power in observer performance studies: Comparison of the receiver operating characteristic and free-response methods in tasks involving localization. Acad Radiol 2002; 9: 147-156. Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992; 27: 723-731. Chakraborty DP and Berbaum KS: Comparing Inter-Modality Diagnostic Accuracies in Tasks Involving Lesion Localization: A Jackknife AFROC Approach. Supplement to Radiology, Volume 225 (P), 259, 2002. Obuchowski NA, Lieber ML, Powell KA. Data analysis for detection and localization of multiple abnormalities with application to mammography. Acad Radiol 2000; 7: 516-525. Rutter CM. Bootstrap estimation of diagnostic accuracy with patient-clustered data. Acad Radiol 2000; 7 : 413-419. Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman and Hall, New York, 1993. Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Arch Intern Med 1996; 156: 209-213. Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K. Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 1999; 6: 22-33.