Assessing agreement for diagnostic devices

Assessing agreement for diagnostic devices FDA/Industry Statistics Workshop September 28-29, 2006 Bipasa Biswas Mathematical Statistician, Division of Biostatistics Office of Surveillance and Biometrics Center for Devices and Radiological Health, FDA No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be inferred

Outline • Accuracy measures for diagnostic tests with a dichotomous outcome. Ideal world -tests with reference standard. • Two indices to measure accuracy –Sensitivity and Specificity • Assessing agreement between two tests in the absence of a reference standard. • Overall agreement • Cohen’s Kappa • McNemar’s test • Proposed remedy • Extending agreement to tests with more than 2 outcomes. • Cohen’s Kappa • Extension to Random Marginal Agreement coefficient (RMAC) • Should agreement per cell be reported?

Ideal World-Tests with perfect reference standard (Single) • If a perfect reference standard exists to classify patients as diseased (D+) versus not diseased (D-) then we can represent the data as: True Status TestD+D- T + T - • If the true status of the disease is known then we can estimate the Se =TP/(TP+FN) and the Sp=TN/(TN+FP)

Ideal World-Tests with perfect reference standard (Comparing two tests) • McNemar’s test to test equality of either sensitivity or specificity. True Status Disease D+ No Disease D- Comparator test Comparator test New test R+ R- New test R+ R- T + T + T - T - McNemar Chi square: Check equality of sensitivities of the two tests (|b1-c1|-1)2/(b1+c1) Check equality of specifities of the two tests (|c2-b2|-1)2/(c2+b2)

Ideal World-Tests with perfect reference standard (Comparing two tests) • Example True Status Disease D+ Disease D- Comparator test Comparator test New test R+ R- New test R+ R- T + T + T - T - SeT=85.0%(85/100) SpT=88.3%(795/900) SeR=90.0%(90/100) SpR=90.0%(810/900) • McNemar Chi square: Check equality of sensitivities of the two tests (|5–10|–1)2/(5+10) p-value=0.30 95% CI (–13.5%,3.5%) Check equality of specifities of the two tests (|5–20|–1)2/(5+20) p-value=.005 95% CI (–2.9%, –0.5%)

McNemar’s test when a reference standard exists • Note however that the McNemar’s test is only checking for equality and thus the null hypothesis is of equivalence and the alternative hypothesis of difference. This is not an appropriate hypothesis as a failure to find a statistically significant difference is naively interpreted as evidence for equivalence. • The 95% confidence interval of the difference in sensitivities and specificities provides a better idea on the difference between the two tests.

Imperfect reference standard • A subject’s true disease status is seldom known with certainty. • What is the effect on sensitivity and specificity when the comparator test R itself has error? Imperfect reference test (Comparator test) New test R+ R- T + T -

Imperfect reference standard • Example1: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R (the comparator test) which misses 20% of the diseased subjects but never falsely indicates disease. True Status Imperfect reference test D+ D- R+ R- T + T – Se= (80/100)80.0% Se (relative to R)= (64/80) 80.0% Sp =(70/100)70.0% Sp (relative to R)= (74/120)62.0%

Imperfect reference standard • Example 2: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R which misses 20% of the diseased subjects but the error in R is related to the error in T. True Status Imperfect reference test D+ D- R+ R- T + T – Se =(80/100)80.0% Se (relative to R)=(80/80) 100.0% Sp =(70/100)70.0% Sp (relative to R) =(90/120)75.0%

Imperfect reference standard • Example3: Now suppose our test is perfect, that is has 100% sensitivity and 100% specificity, but the imperfect reference test R has only 90% sensitivity and 90% specificity. True Status Imperfect reference test D+ D- R+ R- T + T – Se =(100/100)100.0% Se (relative to R)=(90/100) 90.0% Sp =(100/100)100.0% Sp (relative to R)=(90/100) 90.0%

Challenges in assessing agreement in the absence of a reference standard. • Two commonly used overall measures are: • Overall agreement measure • Cohen’s Kappa • McNemar’s Test • In stead report positive percent agreement (ppa) and negative percent agreement (npa).

Estimate of Agreement • The overall percent agreement can be calculated as: 100%x(a+d)/(a+b+c+d) • The overall percent agreement however, does not differentiate between the agreement on the positives and agreement on the negatives. • Instead of overall agreement, report positive percent agreement (PPA) with respect to the imperfect reference standard positives and negative percent agreement (NPA) with respect to imperfect reference standard negative. (reference Feinstein et. al.) PPA=100%xa/(a+c) NPA=100%xd/(b+d)

Why not to report just the overall percentagreement? The overall percent agreement is insensitive to off diagonal imperfect reference test R+ R- New T+ Test T- The overall percent agreement is 85.0% and yet it does not account for the off-diagonal imbalance. The PPA is 100% and the NPA is only 50%

Why report both PPA and NPA? imperfect reference test imperfect reference test R+ R- R+ R- New T+ new T+ Test T- test T- Table 1 Table2 Overall pct. agreement=90.0% Overall pct. agreement=90.0% PPA=50.0% (5/10) PPA=87.5% (35/40) [95% CI= 18.7%,81.3%] [95% CI=73.2%,95.8%] NPA=94.4% (85/90) NPA=91.7% (55/60) [95% CI= 87.5%,98.2 %] [95% CI=81.6%,97.2%]

Kappa measure of agreement • Kappa is defined as the difference between observed and expected agreement expressed as a fraction of the maximum difference and ranges between -1 to 1. Imperfect reference standard R+ R- New T+ Test T- • k=(Io-Ie)/(1-Ie) where Io=(a+d)/n, Ie=((a+c)(a+b)+(b+d)(c+d))/n2

Kappa measure of agreement Imperfect reference standard R+ R- New T+ Test T- • Io=(70)/100=0.70, Ie=((50)(50)+(50)(50))/10000= 0.50 • κ=(0.70-0.50)/(1-0.50)=0.40 [95% CI=0.22,0.58] • By the way the overall percent agreement is 70.0%

Kappa measure of agreement sensitive to off-diagonal? Imperfect reference test R+ R- New T+ Test T- Kappa=κ=0.45 [95% CI=0.31,0.59] Although the overall agreement stayed the same (70%) and the marginal differences are much bigger than before, the kappa agreement index indicates otherwise. Kappa statistics is impacted by the marginal totals even though the overall agreement is the same.

McNemar’s Test to check for equality in the absence of a reference standard • Hypothesizes: Equality of rates of positive response Imperfect reference test R+ R- New T+ Test T- McNemar Chi square=(|b-c|-1)2/(b+c) =(|30-5|-1)2/(30+5)=16.46 Two sided p-value=0.00005

McNemar’s test (insensitivity to main diagonal) Imperfect reference test R+ R- New T+ Test T- • Same p-value as when A=37 and D=28, even though the new and the old test agree on 99.5% of individual cases.

McNemar’s test (insensitivity to main diagonal) Imperfect reference test R+ R- New T+ Test T- • Two sided p-value=1 even though old and new test agree on no cases.

Proposed remedy • In stead of reporting overall agreement or kappa or the McNemar’s test p-value, report both positive percent agreement and negative percent agreement. • In the 510(k) paradigm where a new device is compared to an already marketed device the positive percent agreement and the negative percent agreement is relative to the comparator device, which is appropriate.

Agreement of tests with more than two outcomes • For example in radiology one often compares the standard film mammogram to a digital mammogram where the radiologists assign a score of 1(negative finding) to 5 (highly suggestive of malignancy) depending on severity. • The article by Fay in 2005 in Biostatistics proposes a random marginal agreement coefficient (RMAC) which uses a different adjustment for chance than the standard agreement coefficient (Cohen’s Kappa).

Comparing two tests with more than two outcomes • The advantages of RMAC is that the differences between two marginal distributions will not induce greater apparent agreement. • However, as stated in the paper similar to Cohen’s Kappa with the fixed marginal assumption, the RMAC also depends on the heterogeneity of the population. Thus in cases where the probability of responding in one category is nearly 1 then the chance agreement will be large leading to low agreement coefficients.

Comparing two tests with more than two outcomes • An omnibus agreement index for situations with more than two outcomes is also ridden by similar situations faced for tests with dichotomous outcome. Also, in a regulatory set-up where a new test device is being compared to a predicate device RMAC may not be appropriate as it gives equal weight to the marginals from the test and the predicate device. • In stead report individual agreement for each category.

Summary • Perfect standard exists then for a dichotomous test then both sensitivity and specificity can be estimated and appropriate hypothesis tests can be performed. • If a new test is being compared to an imperfect predicate test then the positive percent agreement and negative percent agreement along with their 95% confidence interval is a more appropriate way of comparison than reporting the overall agreement or the kappa statistics or the McNemar’s test. • In case of tests with more than two outcomes the kappa statistics or the overall agreement has the same problems if the goal of the study is to compare the new test against a predicate. A suggestion would be toreport agreement for each cell.

References • Pepe, M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press. • Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests; Draft Guidance for Industry and FDA Reviewers. March 2, 2003. • Fleiss, JL, Statistical Methods for Rates and Proportions, John Wiley & Sons, New York (2nd ed., 1981). • Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Lijmer, J.G., Moher, D., Rennie, D., & deVet, H.C.W. (2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Clinical Chemistry, 49(1), 1–6. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)

References (continued) • Dunn, G and Everitt, B, Clinical Biostatistics –An Introduction to Evidence-Based Medicine, John Wiley & Sons, New York. • Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6, 543-549. • Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6, 551-558. • Fay M. P. (2005). Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement 2005; Biostatistics 6:171-180.

Assessing agreement for diagnostic devices