Comparing Diagnostic Accuracies of Two Tests in Studies with Verification Bias

Comparing Diagnostic Accuracies of Two Tests in Studies with Verification Bias Marina Kondratovich, Ph.D. Division of Biostatistics, Center for Devices and Radiological Health, U.S. Food and Drug Administration. No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be inferred. September, 2005

Outline • Introduction: examples, • diagnostic accuracy, • verification bias • I. Ratio of true positive rates and ratio • of false positive rates • Multiple imputation • III. Types of missingness in subsets • Summary

Comparison of two qualitative tests, T1 and T2, or combinations of them • Examples: • Cervical cancer: • T1- Pap test (categorical values), T2 - HPV test (qualitative test); • Reference method – colposcopy/biopsy • Prostate cancer: • T1 - DRE (qualitative test), T2 - PSA (quantitative test with cutoff of 4 pg/mL); • Reference method – biopsy; • Abnormal cells on a Pap slide; • T1 - Manual reading of a Pap slide; T2 - Computer-aided reading of a Pap slide; • Reference method – reading of a slide by Adjudication Committee

Diagnostic Accuracy of Medical test Pair: Sensitivity = TPR Specificity = TNR (x1, y1), where x1 = FPR = 1 - Sp1 y1 = TPR = Se1 θ2 y1 T1 Pair: PLR1 = Se1/(1-Sp1) = y1/ x1 = tangent of θ1 (slope of line) related to PPV NLR1 = (1-Se1)/Sp1 = (1-y1)/ (1-x1)= tangent of θ2 (slope of line) related to NPV Se θ1 x1 1- Sp

Boolean Combinations “OR” and “AND” of T1 and Random Test θ2 y-y1 = NLR1 * (x-x1) y-y1 = (1-y1)/(1-x1) * (x-x1) T1 OR Random Test y1 T1 Random Test: + with prob. α - with prob. 1-α Se Combination OR SeOR = Se1 + (1-Se1)*α = y1 + (1-y1)*α SpOR = Sp1*(1-α) = (1-x1)*(1-α) θ1 x1 1- Sp NLR(T1 OR Random Test) = (1-y1)/(1-x1)

Boolean Combinations “OR” and “AND” of T1 and Random Test θ2 y-y1 = PLR1 * (x-x1) y-y1 = y1/x1 * (x-x1) y1 T1 Random Test: + with prob. α - with prob. 1-α Se Combination AND SeAND = Se1*α = y1*α SpAND = Sp1 +(1-Sp1)*(1-α) = (1-x1) + x1*(1-α) T1 AND Random Test θ1 x1 1- Sp PLR(T1 AND Random Test) = y1/x1

Comparing Medical Tests PPV<PPV1 NPV>NPV1 PPV>PPV1 NPV>NPV1 T1 Se PPV<PPV1 NPV<NPV1 PPV>PPV1 NPV<NPV1 1- Sp More detail in: Biggerstaff, B.J. Comparing diagnostic tests: a simple graphic using likelihood ratios. Statistics in Medicine 2000, 19 :649-663

Formal Model: Prospective study, comparison of two qualitative tests,T1 and T2, or combinations of them Disease D+ Non-Disease D- a1 + a0 = A; b1 + b0 = B; c1 + c0 = C; d1 + d0 = D, N1 + N0 = N

Example: condition of interest -cervical disease, T1- Pap test, T2 – biomarker, Reference- colposcopy/biopsy Disease D+ Non-Disease D-

Verification Bias • In studies for the evaluation of diagnostic devices, sometimes the reference (“gold”) standard is not applied to all study subjects. • If the process by which subjects were selected for verification depends on the results of the medical tests, then the statistical analysis of accuracies of these medical tests without the proper corrections is biased. • This bias is often referred as verification bias (or variants of it, work-up bias, referral bias, and validation bias).

Ratio of True Positive Rates and • Ratio of False Positive Rates Not all subjects (or none) with both negative results were verified by the Reference method. • Estimates of sensitivities and specificities based only on verified results are biased. • Ratio of sensitivities and ratio of false positive rates are unbiased2. Disease D+ Non-Disease D- 2 Schatzkin, A., Connor, R.J., Taylor, P.R., and Bunnag, B. “Comparing new and old screening tests when a reference procedure cannot be performed on all screeners”. American Journal of Epidemiology 1987, Vol. 125, N.4, p.672-678

Ratio of TP Rates and Ratio of FP Rates (cont.) Statement of the problem: Se2/Se1 = y2/y1 = Ry (1-Sp2)/(1-Sp1) = x2/x1 = Rx Can we make conclusions about effectiveness of Test2 if we know only ratio of True Positive rates and ratio of False Positive rates between Test1 and Test2? For sake of simplicity, consider that Test2 has higher theoretical sensitivity, Se2/Se1=Ry >1 (true parameters not estimates)

Ratio of TP Rates and Ratio of FP Rates (cont.) • Se2/Se1=Ry >1 • (increase in sensitivity) • (1-Sp2)/(1-Sp1) = Rx <1 • (decrease in false positive • rates) y1 T1 Se For any Test1, Test2 is effective (superior than Test1) x1 1- Sp

Ratio of TP Rates and Ratio of FP Rates (cont.) B) Se2/Se1=Ry >1 (increase in sensitivity); (1-Sp2)/(1-Sp1) = Rx >1 (increase in false positive rates); Ry >= Rx > 1 y1 T1 It is easy to show that PLR2=Se2/(1-Sp2)=Ry/Rx*PLR1 and then PLR2 >= PLR1 Se For any Test1, Test2 is effective (superior than Test1 because PPV and NPV of Test2 are higher than ones of Test1 ) x1 1- Sp

Ratio of TP Rates and Ratio of FP Rates (cont.) Example: condition of interest -cervical disease, T1- Pap test, T2 – biomarker, Reference- colposcopy/biopsy Disease D+ Non-Disease D-

Ratio of TP Rates and Ratio of FP Rates (cont.) C) Se2/Se1=Ry >1 (increase in sensitivity); (1-Sp2)/(1-Sp1) = Rx >1 (increase in false positive rates); Ry < Rx Increase in false positive rates is higher than increase in true positive rates T1 OR Random Test y1 T1 Se x1 Can we make conclusions about effectiveness of Test2 ? 1- Sp

Ratio of TP Rates and Ratio of FP Rates (cont.) Theorem: Test2 is above the line of combination T1 OR Random Test if (Rx-1)/(Ry-1) < PLR1/NLR1 T1 OR Random Test y1 T1 Se Example, Ry=2 and Rx=3. (Rx-1)/(Ry-1)=(3-1)/(2-1)=2. Depends on accuracy of T1: if PLR1/NLR1> 2 then T2 is superior for confirming absence of disease (NPV↑, PPV↓); if PLR1/NLR1< 2 then T2 is inferior overall (NPV↓, PPV↓). x1 1- Sp

Ratio of TP Rates and Ratio of FP Rates (cont.) For situation C: C) Se2/Se1=Ry >1 (increase in sensitivity); (1-Sp2)/(1-Sp1) = Rx >1 (increase in false positive rates); Ry < Rx (increase in FPR is higher than increase in TPR) In order to do conclusions about effectiveness of Test2, we should have information about the diagnostic accuracy of Test1.

Ratio of TP Rates and Ratio of FP Rates (cont.) Se2/Se1=Ry>1 then Se1 <=1/Ry; (1-Sp2)/(1-Sp1)=Rx >1 then (1-Sp1)<=1/Rx Hyperbola If T1 is in the green area, then T2 is superior for confirming absence of Disease (higher NPV and lower PPV) 1/Ry If T1 is in the red area, then T2 is inferior overall (lower NPV and lower PPV) 1/Rx

Ratio of TP Rates and Ratio of FP Rates(cont.) • Summary: • If in the clinical study of comparing accuracies of two tests, Test2 and Test1, it is anticipated a statistically higher increase in TP rates of Test2 than increase in FP rates then conclusions about effectiveness of Test2 can be made without information about diagnostic accuracy of Test1. • In most practical situations, when it is anticipated that increase in FP rates of Test2 is higher than increase in TP rates (or not enough sample size to demonstrate that increase in TP is statistically higher than increase in FP), then information about diagnostic accuracy of Test1 is needed in order to make conclusions about effectiveness of Test2.

II. Verification Bias: Subjects Negative on Both Tests If a random sample of the subjects with both negative tests results are verified by reference standard then the unbiased estimates of sensitivities and specificities for Test1 and Test2 can be constructed. Disease D+ Non-Disease D-

II.Verification Bias: Bias Correction • Verification Bias Correction Procedures: • Begg, C.B., Greenes, R.A. (1983) Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics 39, 207-215. • Hawkins, D.M., Garrett, J.A., Stephenson, B. (2001) Some issues in resolution of diagnostic tests using an imperfect gold standard. Statistics in Medicine 2001; 20, 1987-2001. • Multiple Imputation • The absence of the disease status for some subjects can be considered as a problem of missing data. • Multiple imputation is a Monte Carlo simulation where the missing disease status of the subjects are replaced by simulated plausible values based on the observed data, each of the imputed datasets is analyzed separately and diagnostic accuracies of tests are evaluated. Then the results are combined to produce the estimates and confidence intervals that incorporate uncertainties related to the missing verified disease status for some subjects.

II. Verification Bias: Subjects Negative on Both Tests (cont.) In practice, sometimes, not all subjects from the subsets A, B, and C may be compliant about disease verification: Usually, according to the study protocol, all subjects from the subsets A, B and C should have the verified disease status and the verification bias is related to the subjects to whom both tests results are negative. Verification Bias !

III. Different Types of Missingness In order to correctly adjust for verification bias, the type of missingness should be investigated. Missing data mechanisms: Missing Completely At Random (MCAR) – missingness is unrelated to the values of any variables (whether the disease status or observed variables); Missing At Random (MAR) – missingness is unrelated to the disease status but may be related to the observed values of other variables. For details, see Little, R.J.A and Rubin, D. (1987) Statistical Analysis with Missing Data. New York: John Wiley.

III. Different Types of Missingness Example: Prospective study for prostate cancer. 5,000 men were screened with digital rectal exam (DRE) and prostate specific antigen (PSA) assay. Results of DRE are Positive, Negative. PSA, a quantitative test, is dichotomized by threshold of 4 ng/ml: Positive (PSA > 4), Negative (PSA ≤ 4). D+ = Prostate cancer; D- = No prostate cancer (ref. standard = biopsy).

All Subjects Subjects with Verified Disease Status D+ (Positive Biopsy) D- (Negative Biopsy)

III. Different Types of Missingness (cont.) • Do the subjects without biopsies differ from the subjects with biopsies? • Propensity score = conditional probability that the subject underwent the verification of disease (biopsy in this example) given a collection of observed covariates (the quantitative value of the PSA test, Age, Race and so on). • Statistical modeling of relationship between membership in the group of verified subjects by logistic regression: • outcome – underwent verification (biopsy): yes, no • predictor – PSAQuantitative, covariates.

III. Different Types of Missingness (cont.) For subgroup A (PSA+, DRE+), probability that a subject has a missed biopsy does not appear to depend neither on PSA values nor on the observed covariates (age, race). Type of missingness - Missing Completely At Random. Similar, for group B (PSA+, DRE-).

III. Different Types of Missingness (cont.) For subgroup C (PSA-, DRE+), probability that a subject has a missed biopsy does depend on the quantitative value of PSA. So, the value of the PSA is a significant predictor for biopsy missingness in this subgroup (the larger value of PSA, the lower probability of missing biopsy). Type of missingness - Missing At Random.

III. Different Types of missingness (cont.) D+D- Adjustment for verification withoutproper investigation of type of missingness (biasedestimates): Adjustment for verificationtaking into account different types of missingness (unbiased estimates):

III. Different Types of missingness (cont.) Correct adjustment for verification bias produces the estimates demonstrating that an increase in FP rates for the New test (PSA) is about the same as an increase in TP rates while incorrect adjustment for verification bias showed that the increase in FP rates was larger than the increase in TP rates. So, naïve estimation of the risk for the subgroup C based on the assumption that the missing results of biopsy were Missing Completely At Random produces biased estimation of the performance of the New PSA test (underestimation of the performance of the New test). For proper adjustment, information on the distribution of test results in the subjects who are not selected for verification should be available.

Summary • In most practical situations, estimation of only ratios of True Positive and False Positive rates does not allow one to make conclusions about effectiveness of the test. • The absence of disease status can be considered as the problem of missing data. Multiple imputation technique can be used for correction of verification bias. Information on the distribution of test results in the subjects who are not selected for verification should be available. • The investigation of the type of missingness should be done for obtaining unbiased estimates of performances of medical tests. All subsets of subjects should be checked for missing disease status. • Precision of the estimated diagnostic accuracies depends primarily on the number of verified cases available for statistical analysis.

References • Begg C.B. and Greenes R.A. (1983). Assessment of diagnostic tests when disease verification is subject to selection. Biometrics, 39, 207-215. • Biggerstaff, B.J. (2000) Comparing diagnostic tests: a simple graphic using • likelihood ratios. Statistics in Medicine 2000, 19 :649-663 • Hawkins, DM, JA Garrett and B Stephenson. (2001) Some issues in resolution of diagnostic tests using an imperfect gold standard. Statistics in Medicine; 20:1987-2001. • Kondratovich MV (2003) Verification bias in the evaluation of diagnostic tests. Proceedings of the 2003 Joint Statistical Meeting, Biopharmaceutical Section, San Francisco, CA. • Ransohoff DF, Feinstein AR. (1978) Problems of spectrum and bias in evaluating the • efficacy of diagnostic tests. New England Journal Of Medicine. 299: 926-930 • Schatzkin A., Connor R.J., Taylor P.R., and Bunnag B. (1987) Comparing new and • old screening tests when a reference procedure cannot be performed on all screeners. • American Journal of Epidemiology, vol.125, N.4, p. 672- 678. • 7. Zhou X. (1994) Effect of verification bias on positive and negative predictive values. Statistics in Medicine; 13; 1737-1745 • 8. Zhou X. (1998) Correcting for verification bias in studies of a diagnostic test’s accuracy. Statistical Methods in Medical Research; 7; p.337-353. • 9. http://www.fda.gov/cdrh/pdf/p930027s004b.pdf

Comparing Diagnostic Accuracies of Two Tests in Studies with Verification Bias

Comparing Diagnostic Accuracies of Two Tests in Studies with Verification Bias

Presentation Transcript

Diagnostic Tests

Comparing the Use of Diagnostic Tests in Canadian and US Hospitals

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

T-Tests for Comparing Two Means

Diagnostic verification and verification of extremes

Diagnostic tests

Diagnostic tests

Diagnostic tests

DIAGNOSTIC TESTS

Diagnostic tests

Diagnostic Tests

Studies of Diagnostic Tests

DIAGNOSTIC TESTS

Studies of Diagnostic Tests

DIAGNOSTIC TESTS

Diagnostic Tests

Diagnostic Tests

Studies of Diagnostic Tests

Diagnostic tests