Jeffrey J. Kopicko, MSPH Tulane University School of Public Health and Tropical Medicine

Advanced Statistical Analysis in Epidemiology:Inter-rater Reliability Diagnostic CutpointsTest Comparison Discrepant Analysis Polychotomous Logistic RegressionandGeneralized Estimating Equations Jeffrey J. Kopicko, MSPH Tulane University School of Public Health and Tropical Medicine

Diagnostic Statistics Typically Assess a 2 x 2 contingency table taking the form:

Inter-rater Reliability Suppose that two different tests exist for the diagnosis of a specific disease. We are interested in determining if the new test is as reliable in diagnosing the disease as the old test (“gold standard”).

Inter-rater Reliability continued In 1960, Cohen proposed a statistic that would provide a measure of reliability between the ratings of two different radiologists in the interpretation of x-rays. He called it the Kappa coefficient.

Inter-rater Reliability continued Cohen’s Kappa can be used to assess the reliability between two raters or diagnostic tests. Based on the previous contingency table, it has the following form and interpretation:

Inter-rater Reliability continued Cohen’s Kappa: where *Rosner, 1986

Inter-rater Reliability continued Cohen’s Kappa is appropriately used when the prevalence of the disease is low and the marginal totals of the contingency table are distributed evenly. When these are not the case, Cohen’s Kappa will be erroneously low.

Inter-rater Reliability continued Byrt, et al. proposed a solution to these possible biases in 1994. They called their solution the “Prevalence-Adjusted Bias-Adjusted Kappa” or PABAK. It has the same interpretation as Cohen’s Kappa and the following form:

Inter-rater Reliability continued 1. Take the mean of b and c. 2. Take the mean of a and d. 3. Compute PABAK using these means and the original Cohen’s Kappa formula and this table.

Inter-rater Reliability continued • PABAK is preferable in all instances, regardless of the prevalence or the potential bias between raters. • More meaningful statistics regarding the diagnostic value of a test can be computed, however.

Diagnostic Measures • Prevalence • Sensitivity • Specificity • Predictive Value Positive • Predictive Value Negative

Diagnostic Measures continued Prevalence Definition: Prevalence quantifies the proportion of individuals in a population who have the disease at a specific instant and provides and estimate of the probability (risk) that an individual will be ill at a point in time. Formula:

Diagnostic Measures continued Sensitivity Definition: Sensitivity is defined as the probability of testing positive if the disease is truly present. Formula:

Diagnostic Measures continued Specificity Definition: Specificity is defined as the probability of testing negative if the disease is truly absent. Formula:

Diagnostic Measures continued Predictive Value Positive Definition: Predictive Value Positive (PV+ ) is defined as the probability that a person actually has the disease given that he or she tests positive. Formula:

Diagnostic Measures continued Predictive Value Negative Definition: Predictive Value Negative (PV- ) is defined as the probability that a person actually disease-free given that he or she tests negative. Formula:

Example: Cervical Cancer Screening The standard of care for cervical cancer/dysplasia detection is the Pap smear. We want to assess a new serum DNA detection test for the Humanpapilloma Virus.

Prevalence = 55/500 = 0.110 Sensitivity = 50/55 = 0.909 Specificity = 410/445 = 0.921 PV+ = 50/85 = 0.588 PV- = 410/415 = 0.988

Reciever Operating Characteristic (ROC) Curves Sensitivities and Specificities are used to: 1. Determine the diagnostic value of a test. 2. Determine the appropriate cutpoint for continuous data. 3. Compare the diagnostic values of two or more tests.

ROC Curves continued 1. For every gap in continuous data, the mean value is taken as the cutoff. This is where there is a change in the contingency table distribution. 2. At each new cutpoint, the sensitivity and specificity is calculated. 3. The sensitivity is graphed versus 1-specificity.

ROC Curves continued 4. Since the sensitivity and specificity are proportions, the total area of the graph is 1.0 units. 5. The area under the curve is the statistic of interest. 6. The area under a curve produced by chance alone is 0.50 units.

ROC Curves continued 7. If the area under the diagnostic test curve is significantly above 0.50, then the test is a good predictor of disease. 8. If the area under the diagnostic test curve is significantly below 0.50, then the test is an inverse predictor of disease. 9. If the area under the diagnostic test curve is not significantly different from 0.50, then the test is a poor predictor of disease.

ROC Curves continued 10. An individual curve can be compared to 0.50 using the N(0, 1) distribution. 11. Two or more diagnostic tests can be compared also using the N(0, 1) distribution. 12. A diagnostic cutpoint can be determined for tests with continuous outcomes in order to maximize the sensitivity and specificity of the test.

ROC Curves continued Determining Diagnostic Cutpoints

ROC Curves continued Diagnostic Value of a Test where a1 = area under the diagnostic test curve, ao = 0.50, se a1 is the standard error of the area, and se ao = 0.00.

ROC Curves continued Diagnostic Value of a Test For the RCP example, the area under the curve is 0.987, with a p-value of <0.001. The optimal cutpoint for this test is 1.1 ng/ml.

ROC Curves continued Comparing the areas under 2 or more curves In order to compare the areas under two or more ROC curves, use the same formula, substituting the values for the second curve for those previously defined for chance alone.

ROC Curves continued Comparing the areas under 2 or more curves For the CMV retinitis example, the Digene test had the largest area (although not significantly greater than antigenemia). The cutpoint was determined to be 1,400 cells/cc. The sensitivity was 0.85 and the specificity was 0.84. Bonferronni adjustments must be made for >2 comparisons.

ROC Curves continued Another Application? Remember when Cohen’s Kappa was unstable at extreme prevalence and/or when there was bias among the raters? What about using ROC curves to assess inter-rater reliability?

ROC Curves continued Another limitation to K is that it provides only a measure of agreement, regardless of whether the raters correctly classify the state of the items. K can be high, indicating excellent reliability, even though both raters incorrectly assess the items.

ROC Curves continued The two areas under the curves may be compared as a measure of overall inter-rater reliability. This comparison is made by applying the following formula: droc = (1- |Area1- Area2|) By subtracting the difference in areas by one, droc is on a similar scale as K, ranging from 0 to 1.

ROC Curves continued If both raters correctly classify the objects at the same rate, their sensitivities and specificities will be equal, resulting in a droc of 1. If one rater correctly classifies all the objects, and the second rater misclassifies all the objects, droc will equal 0.

Statistics for Figure 1(N=20): Rater One: % Correct = 80 % sensitivity = 0.80 specificity = 0.80 Area under ROC = 0.80 Rater Two: % Correct = 55 % sensitivity = 0.60 specificity = 0.533 Area under ROC = 0.567 droc = 0.7667

Monte Carlo Simulation Several different levels of disease prevalence, sample size and rater error rates were assessed using Monte Carlo methods. Total sample sizes of 20, 50 and 100 were generated each for disease prevalence of 5, 15, 25, 50, 75, and 90 percent. Two raters were used in this study. Rater One was fixed at a 5 percent probability of misclassifying the true state of the disease, while Rater Two was allowed varying levels of percent misclassification. For each condition of disease prevalence, rater error, and sample size, 1000 valid samples were generated and analyzed using SAS proc IML.

ROC Curves continued Another limitation is that K provides only a measure of agreement, regardless of whether the raters correctly classify the state of the items. K can be high, indicating excellent reliability, even though both raters incorrectly assess the items.

ROC Curves continued If both raters correctly classify the objects at the same rate, their sensitivities and specificities will be equal, resulting in a droc of 0. If one rater correctly classifies all the objects, and the second rater misclassifies all the objects, droc will equal 1.

Based on the above results, it appears that the difference in two ROC curves may be a more stable estimate of inter-rater agreement than K. Based on the metric used to assess K, a similar metric can be formed for the difference in two ROC curves. We propose the following: 1.0 > droc > 0.95 excellent reliability 0.8 < droc < 0.95 good reliability 0 < droc < 0.8 marginal reliability

ROC Curves continued From the example data provided with Figure 1, it can be seen that droc behaves similarly to K. The droc from these data is 0.7667, while K is 0.30. Both result in a decision of marginal inter-rater reliability. However, from the ROC plot and the percent correct for each rater, it is seen that Rater One is much more correct in his observations than Rater Two, with percent agreements of 80 % and 55 %, respectively.

ROC Curves continued Without the individual calculation of the sensitivities and specificities, information about the correctness of the raters would have remained obscure. Additionally, with the large differential rater error, K may have been underestimated. The difference in ROC curves allows many advantages over K, but only when the true state of the objects being rated is known. Finally, with very little adaptation, these methods may be extended to more than two raters and to continuous outcome data.

So, we now know how to assess whether a test is a good predictor of disease, how to compare two or more tests, and how to determine cutpoints. But, What if there is no established “gold-standard?”

Discrepant Analysis Discrepant Analysis (DA) is a commonly used (and commonly misused) technique of estimating the sensitivity and specificity of diagnostic tests that are imperfect “gold-standards.” This technique often results in “upwardly biased” estimates of the diagnostic statistics.

Discrepant Analysis continued Example: Chlamydia trachomatis is a common STI that has been diagnosed using cervical swab culture for years. Often, though, patients only present for screening when they are symptomatic. Symptomatic screening may be closely associated with organism load. Therefore, culture diagnosis may miss carriers and patients with low organism loads.

Discrepant Analysis continued Example continued: GenProbe testing has also been used to capture some cases that are not captured by culture. New polymerase chain reaction (PCR) and ligase chain reaction (LCR) DNA assays may be better diagnostic tests. But, there is obviously no good “gold-standard.”

Jeffrey J. Kopicko, MSPH Tulane University School of Public Health and Tropical Medicine