330 likes | 443 Views
Statistics and Image Evaluation. Oleh Tretiak Medical Imaging Systems Fall, 2002. Which Image is Better?. Case A. Case B. Method. Rating on a 1 to 5 scale (5 is best) Rating performed by 21 subject Statistics:
E N D
Statistics and Image Evaluation Oleh Tretiak Medical Imaging Systems Fall, 2002
Which Image is Better? Case A Case B
Method • Rating on a 1 to 5 scale (5 is best) • Rating performed by 21 subject • Statistics: • Average, maximum, minimum, standard deviation, standard error for Case A, Case B • Difference per viewer between Case A and Case B, and above statistics on difference
Conclusions • Image in Case A has a higher average ranking than that in Case B. • The highest ranking for B is the same as the lowest ranking for A. In all other cases, the rankings for B are lower than those for A. • Consider the difference (rightmost column on previous slide). The ratio of average to the standard error (the z value) is 2.62/.22 ~ 12. This value of z is extremely unlikely if the means are the same.
Experimental Design • How many observers should we use to test differences between pictures? • We expect difference between two kinds of pictures will be 0.5 ranking units. We exoect the standard deviation on difference measurement to be 1.0 (see experiment above). We would like to determine this reliably. We therefore want a confidence interval on the mean to be [mean - 0.5, mean + 0.5] at 99% confidence. How many observers should we use? • Answer:z0.005 = 2.6. Standard error must be 0.5/2.6 = 0.19. Std. err = std. dev. /sqrt(n). Therfore n = (1.0/0.19)^2 = 28
Today’s Lecture • Hypothesis testing • Two kinds of errors • ROC analysis • Visibility of blobs • Quantitative quality measures
Hypothesis Testing Example 256x256 128x128
Question: Which is better? • Testing method • Quality rating by multiple viewers • Compute per-viewer difference in quality • Find mean and standard deviation of the difference • Compute the z score (mean/std. error) • How to interpret?
Null Hypothesis (H0) • Assume that the mean is zero (no difference) • Find a range of z that would occur when the mean is zero. • Accept the null hypothesis if z is in this range (no difference) • Reject null hypothesis if z falls outside the range
We show the normal distribution with 0 mean and s = 1. The shaded area has probability 0.95, and the two white areas have each probability 0.025. If we observe gaussian variables with mean zero, 95% of the observations will have value between -1.96 and 1.96. The area outside this interval (0.05 in this case) is called the significance level of the test.
Two Kinds of Errors • In a decision task with two alternatives, there are two kinds of errors • Suppose the alternatives are ‘healthy’ and ‘sick’ • Type I error: say healthy if sick • Type II error: say sick if healthy
X - observation, t - threshold a = Pr[X > t | H0] (Type I error) b = Pr[X < t | H1] (Type II error) Choosing t, we can trade off between the two types of errors
Examples of Threshold Measurement • Show blobs and noise.
Examples • Measurement of psychophysical threshold • Detectible flicker, detectible contrast • Medical diagnosis • Negative (healthy), positive (sick) • Home security • Friend or terrorist
Probability of Error • Pe= P0a + P1b • Why bother with two types of error, why not just Pe? • In many cases, P1 << P0! • Two types of error are typically of different consequence. We therefore don’t want to mix them together.
ROC Terminology • ROC — receiver operating characteristic • H0 — friend, negative; H1 — enemy, positive • Pr[X > t | H0] = probability of false alarm = probability of false positive = PFP = a • Pr[X > t | H1] = probability of detection = probability of true positive = PTP = b
The ROC • The ROC shows the tradeoff between PFP and PTP as the threshold is varied
How Do We Estimate the ROC? • Radiological diagnosis setting • Positive and negative cases • The true diagnosis must be evaluated by a reliable method • Cases are evaluated by radiologist(s), who report the data on a discrete scale • 1 = definitely negative, 5 = definitely positive
Binormal Model • Negative: Normal, mean = 0, st. dev. = 1 • Negative: Normal, mean = a, st. dev. = b
Some Binormal Plots b = 0.5, a = 1, 2, 3 b = 1, a = 1, 2, 3 b = 2, a = 1, 2, 3 Az ~ area under ROC curve
Experimental Framework • Set of positive and negative cases • Need reliable diagnosis • Radiologist interprets cases • Radiologist report on a scale • Certainly Negative, Probably Negative, Unclear, Probably Positive, Certainly Positive • Estimate ROC, Az • Compare results in studies with conventional and image processing
Standard deviations of estimates of a, b, and Az for varying numbers of observations. Horizontal axis: number of positive and negative observations. Top curve, sa, middle curve: sb. Trials were with a = b = 1. Standard deviations of estimates of a, b, and Az for varying numbers of observations. Horizontal axis: number of positive and negative observations. Top curve, sa, middle curve: sb. Trials were with a = b = 1 Statistical Estimation • Result of experiment is a sample • If N is very large, estimate is the same as theory • For practical N, the estimate is true ± error
Another Approach: Nonparametric Model • Ordinal Dominance Graph • Donald Bamber, Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph, J. of Math. Psych. 12: 387-415 (1975). • Method: computer frequencies of occurrence for different threshold levels from sample, plot on probability scale. Monte Carlo, a = 1, b = 1, 10 positive and 10 negative cases
Ordinal Dominance - examples (10, 10) (20, 20) (100, 100) (40, 40)
Theory • Area asymptotically normal Worst case
Metz • University of Chicago ROC project: • http://www-radiology.uchicago.edu/krl/toppage11.htm • Software for estimating Az, also sample st. dev. And confidence intervals. • Versatile
Example • Compare image processing with conventional • Design: • Should we use same cases for both? • Yes, better comparison • Now results from two studies are correlated! • Metz software can handle this
Design Parameters (1) unpaired (uncorrelated) test results. The two "conditions" are applied to independent case samples -- for example, from two different diagnostic tests performed on the different patients, from two different radiologists who make probability judgments concerning the presence of a specified disease in different images, etc.; (2) fully paired (correlated) test results, in which data from both of two conditions are available for each case in a single case sample. The two "conditions" in each test-result pair could correspond, for example, to two different diagnostic tests performed on the same patient, to two different radiologists who make probability judgments concerning the presence of a specified disease in the same image, etc.; and (3) partially-paired test results -- for example, two different diagnostic tests performed on the same patient sample and on some additional patients who received only one of the diagnostic tests.
Summary: ROC • Compare modalities, evaluate effectiveness of a modality • Need to know the truth • Issue: two kinds of error • Specificity, Sensitivity • Scalar comparison not suitable • Statistical problem • More data, better answer • ROC methodology • Metz methods and software allow computation of confidence intervals, significance for tests with practical design parameters
Recent Work • Beiden SV, Wagner RF, Campbell G, Metz CE, Jiang Y. Components-of-variance models for random-effects ROC analysis: The case of unequal variance structures across modalities. Academic Radiol. 8: 605-615, 2001 • Gefen S, Tretiak OJ, Piccoli CW, Donohue KD, Petropulu AP, Shankar PM, Dumane VA, Huang L, Kutay MA, Genis V, Forsberg F, Reid JM, Goldberg BB, ROC Analysis of Ultrasound Tissue Characterization Classifiers For Breast Cancer Diagnosis, IEEE Trans. Med. Im. In press