1 / 24

Measurement of Rater Consistency by Chance-Corrected Agreement Coefficients

Measurement of Rater Consistency by Chance-Corrected Agreement Coefficients. Zheng Xie, Chai Gadepalli, Barry M.G. Cheetham ,. 1. Abstract. Measurement of consistency in decisions made by observers or raters. Important problem in clinical medicine.

bonnett
Download Presentation

Measurement of Rater Consistency by Chance-Corrected Agreement Coefficients

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UKSIM2018 Measurement of Rater Consistency by Chance-Corrected Agreement Coefficients Zheng Xie, Chai Gadepalli, Barry M.G. Cheetham, 1

  2. UKSIM2018 Abstract • Measurement of consistency in decisions made by observers or raters. • Important problem in clinical medicine. • Chance corrected agreement coeffs such as Cohen & Fleiss Kappas are commonly used. • The way they estimate probability of agreement 'by chance' has been strongly questioned. • Alternatives such Gwet AC1 & AC2 coeffs are gaining currency. • A well known paradox illustrates deficiencies of Kappa coeffs • Remedied by grading subjects according to their probability of being hard to score. • AC1 & AC2 coeffs result from this approach. • We question the rationale of the hard to score concept. • An alternative approach is proposed. • Applied to weighted & unweighted multi-rater Cohen & Fleiss Kappas. • Also to Intra-Class Correlation (ICC) coefficients. 2

  3. Consistency of clinical observations is clearly important. A patient may see just one clinician. Useful to know whether observations are likely to be independent of the choice of clinician. Investigating consistency requires a clinical trial. With groups of subjects & clinical observers (raters). Such a trial was presented at EMS2017 in Manchester. Carried out for a voice quality assessment procedure. Required measurements of self-consistency of raters. Also of inter-rater consistency of different raters observing the same subjects. Decisions may be categorical or ordinal (with weighting). Introduction 3 UKSIM2018

  4. Pearson Correlation (inappropriate for comparing raters) takes into account only variations about mean score for each rater. Proportion of agreement among raters Po biased by possibility of agreement by chance. Brennan_Prediger (BP) Coeff – simplest chance-corrected coeff Original Cohen Kappa - defined for categorical scoring & generalised to many raters 5. Weighted Cohen Kappa (for ordinal/numerical scoring): allows differences between scores to be taken into account. 6. Fleiss Kappa Measures consistency of groups of 2 or more raters. 7. Intra-class correlation coefficient (ICC): Measures scoring consistency of 2 or more raters. 8. Gwet AC1,AC2 & Aickin coefficients Measures of Consistency UKSIM2018 4

  5. Intra-rater: Self-consistency when an SLT scores same subject twice To what extent does he give the same scores twice? Inter-rater: Consistency when different SLTs score the same subjects. Agreement of each rater with the other raters. Multi-rater: Consistency of groups of more than 2 raters Three forms of Consistency UKSIM2018 5

  6. Significance tables Kappa ICC UKSIM2018 6

  7. Chance-corrected Agreement Coeffs • Unweighted & weighted Po may be used for categorical or ordinal scoring. • Biased by probability of some agreement occurring by chance. • If raters make random decisions evenly distributed over Q = 4 categories, 'by chance' agreement would be expected with a probability of 1/Q = 1/4. • Even if raters made decisions without seeing the subjects. • This would make unweighted Po equal to 25% • Gives false impression of consistency. • Chance-corr agreement coeffs aim to cancel out bias in Po. • Cohen & Fleiss Kappa, AC1 & AC2 coeffs are chance-corr. • ICC may be considered chance-corr also. UKSIM2018 7

  8. UKSIM2018 Formulation • Chance-corr agreement coeffs are expressed as: • Po is as defined before. • Pe is estimate of probability of agreement by chance. • For almost complete agreement, Po 1 &  1 • unless Pe is also close to 1. • Problem lies with estimating Pe. • With Cohen & Fleiss Kappas, this is done from the actual scores given to the subjects in the trial. • Must be assumed that the subjects are a reasonable sample of the usual population of patients. 8

  9. Brennan-Prediger Coeff • Simplest chance-corr agreement coeff: • This is categorical version – ordinal version is in paper. • Estimate of Pe assumes a population of subjects whose scores are evenly distributed among the Q possible scores. • Estimate is not dependent on the scores provided. • Does not assume that they are representative of the population. • If Q=4, Pe = 1/4  25%. UKSIM2018 9

  10. Gwet’s Paradox • Controversy abt how Cohen & Fleiss estimate by chance agreement. • Different approaches, such as AC1 & AC2 by Gwet, gaining currency. • Deficiencies of Cohen & Fleiss Kappas illustrated by this example: • There are 2 raters for 20 subjects with 2 scoring categories 1 & 2. • Rater 1 scores all in category 1 & rater 2 scores 18 out of 20 in cat 1. • It may appear that that there is a high level of agreement. • However, since Po = 0.9 & Pe  0.9, both Kappas give value close to 0. • Thus indicating little or no agreement. UKSIM2018

  11. Explanation • Problem lies with estimation of Pe. • Almost all agreement classified as agreement by chance. • Happens because the 20 subjects are not representative of a population considered typical. • Should specify the expected scoring characteristics for this population. • If the overwhelming majority of subjects are expected to be scored as category 1, then Cohen & Fleiss Kappas may be adequate. • However we normally do not expect such a population. • Makes the previous result appear to be a paradox. • Brennan-Prediger coeff assumes an even distribution of scores in the population. • Eliminates the Paradox. • However, we may have reason to expect a different distribution. • Then Cohen & Fleiss Kappas may prove useful. • But some modifications are needed. UKSIM2018

  12. Further Investigation of Paradox • Assume raters must give scores in Q categories. • Let k be estimate of probability that a rater will choose category k. Then, • Paradox occurs whenever k 1 for some value of k. • Assume this value of k is 1. • Then all other values of k 0. UKSIM2018 12

  13. Our investigation • Randomly generate a set of scores for: 50 subjects, 5 raters & Q = 4 categories. • Make probability of getting score k equal to k for each k. • Initially, make k = 1/Q for k = 1, 2, …, Q, • Makes all scores equally probable. • Gives Pe = 1/Q for both Cohen & Fleiss Kappas. • Then randomly generate further sets of scores’ • Make 1 increase towards 1 & other k values decrease. • Generates a series of scoring patterns that gradually approach the maximally concentrated distribution. • This is where all raters give score 1 to all subjects. UKSIM2018

  14. Resulting Kappas & Coeffs AC1 Bren-Pred CohenK & FleissK UKSIM2018

  15. Variation of Pe Pe(CohenK & FleissK) Pe(Bren-Pred)= 1/4 Pe(AC1) UKSIM2018

  16. Gwet’s AC1 & AC2 Coeffs • Generalise unweighted & weighted Brennan-Prediger coefficients. • To estimate Pe, Gwet uses Aicken’s idea: • Divide subjects into 'hard to score' & 'easy to score‘ subjects. • Estimate Pe for the 'hard' subjects only. • Easy subjects disregarded assuming any agreement not ‘by chance’. • Gwet defines P(R) as probability of selecting a ‘hard to score’ subject. • P(R) estimated from distribution of scores given by the R raters. • Degree to which raters give different scores presumed to determine the degree to which the subjects are hard to score. • since there is less agreement in the scoring • Subjects whose scores agree are presumed to be easier to score,. • Gwet's ‘probability of hardness’ formula is:

  17. P(R) plotted against 1 P(R) decreases as 1 approaches 1 corresponding to maximum concentration of scores. Compare with subject-by subject approach (mentioned later) by plotting: P(R) E UKSIM2018

  18. Formula for AC1 • AC1 for categorical scoring • AC2 for ordinal scoring with weighting (see paper). • Effect of P(R) is to reduce estimate of Pe as probability of subjects being hard to score decreases. • AC1 & corresponding Pe was plotted against 1 in earlier graphs, UKSIM2018

  19. Rationale questioned • Reasonable that Pe decreases as distribution of scores decreases. • In this case, subjects are indeed likely to become easier to score. • But subjects may become easier to score with the scores not concentrated on a single score, as illustrated below: • Almost all scores agree, so these subjects are easy to score. • But P(R) will be close to 1 for this example. • And P(R) is said to be ‘probability of subjects being hard to score’.  this description of P(R) appears to be misleading. • Better to describe P(R) as degree to which the N subjects are likely representative of a typical population of subjects. • Gwet’s approach still works, but rationale is different. • P(R) defined from distribution of scores, not hardness. UKSIM2018

  20. Application of P(R) to Cohen & Fleiss • There is a case for applying a measure of the degree to which the N subjects are representative of the population to both the Cohen & Fleiss Kappas. • Taking P(R) as such a measure, multiplying the equations for Pe gives the graphs referred to as CohenK-AC1 and FleissK-AC1 on next slide. • The graphs almost coincide, and the paradox exhibited by the Cohen & Fleiss Kappas has now been eliminated. 20 UKSIM2018

  21. Modified Cohen & Fleiss Kappas AC1 Cohen-S/S& Fleiss-S/S Cohen-AC1 & Fleiss-AC1 21 UKSIM2018

  22. Subject-by-Subject Implementation • More direct way of implementing Gwet’s idea. • Define a 'by chance' probability, E(i), for each subject i. • Do this according to its own scores & overall spread of rater scores. • Then the contribution to Pe of any 'by chance' disagreement within rater pairs scoring subjects i and j may be scaled by max( E(i), E(j)). • Simple & obvious way of defining E(i) is: • Differences between subj-by-subj approach & Gwet's approach illustrated by comparing curves for P(R) & E on earlier slide, where: • Both reduce to zero as concentration on a single score increases 22 UKSIM2018

  23. Intra-class Correlation (ICC) • ICC coefficient may be used as a consistency measure. • It has been shown that ICC is exactly equal to quadratically weighted Fleiss Kappa • So it incorporates correction for chance agreement. • And it produces the Gwet paradox when there is a concentration on one score. • As with Cohen & Fleiss Kappas, it may be modified by application of Gwet's P(R) function or its subject-by-subject implementation. EMS2017 23

  24. UKSIM2018 Conclusions • Fundamental flaw with Cohen & Fleiss Kappas & ICC . • Leads to paradox when scores are concentrated on one score. • Arises because subjects & rater scores are used to estimate Pe. • Cohen & Fleiss Kappaassume that the subjects provided are representativeof a typical population of subjects. • Paradox occurs because subjects & scores are considered not representative. • Kappas look wrong but are correct if scores are representative. • Brennan-Prediger coeff correctly estimates Pe for even distributn of scoring. • To cater for other distributions, Gwet sets out to improve this coeff by de-emphasising contributions from subjects considered easy to score. • These subjects are considered unlikely to have been scored by chance. • De-emphasis achieved by multiplying each contribution by P(R). • P(R) described by Gwet as probability of subjects being hard to score. • This description is misleading, but the method remains valid. • P(R)may be applied to Cohen & Fleiss Kappas & ICC . • Subject-by-subject implementation eliminates the paradox & has potential to improve the estimate of the population statistics. 24

More Related