240 likes | 365 Views
Measuring Agreement. Introduction. Different types of agreement Diagnosis by different methods Do both methods give the same results? Disease absent or Disease present Staging of carcinomas Will different methods lead to the same results? Will different raters lead to the same results?
E N D
Introduction • Different types of agreement • Diagnosis by different methods • Do both methods give the same results? • Disease absent or Disease present • Staging of carcinomas • Will different methods lead to the same results? • Will different raters lead to the same results? • Measurements of blood pressure • How consistent are measurements made • Using different devices? • With different observers? • At different times?
Investigating agreement • Need to consider • Data type • Categorical or continuous • How are the data repeated? • Measuring instrument (s), rater(s), time(s) • The goal • Are ratings consistent? • Estimate the magnitude of differences between measurements • Investigate factors that affect ratings • Number of raters
Data type • Categorical • Binary • Disease absent, disease present • Nominal • Hepatitis • Viral A, B, C, D, E or autoimmune • Ordinal • Severity of disease • Mild, moderate, severe • Continuous • Size of tumour • Blood pressure
How are data repeated? • Same person, same measuring instrument • Different observers • Inter-rater reliability • Same observer at different times • Intra-rater reliability • Repeatability • Internal consistency • Do the items of a test measure the same attribute?
Measures of agreement • Categorical • Kappa • Weighted • Fleiss’ • Continuous • Limits of agreement • Coefficient of variation (CV) • Intraclass Correlation (ICC) • Cronbach’s • Internal consistency
Number of raters • Two • Three or more
Categorical data: two raters • Kappa • Magnitude quoted • ≥0.75 Excellent, 0.40 to 0.75 Fair to good, < 0.40 as Poor • 0 to 0.20 Slight, >0.20 to 0.40 Fair, >0.40 to 0.60 Moderate, >0.60 to 0.80 Substantial, >0.80 Almost perfect • Degree of disagreement can be included • Weighted kappa • Values close together do not count to disagreement as much as those further apart • Linear / quadratic weightings
Categorical data: > two raters • Different tests for • Binomial data • Data with more than two categories • Online calculators • http://www.vassarstats.net/kappa.html
Example 1 • Two raters • Scores 1 to 5 • Unweighted kappa 0.79, 95% CI (0.62 to 0.96) • Linear weighting 0.84, 95% CI (0.70 to 0.98) • Quadratic weighting 0.90, 95% CI (0.77 to 1.00)
Example 2 • Binomial data • Two raters • Two ratings each • Inter-rater agreement • Intra-rater agreement
Example 2 ctd. • Inter-rater agreement • Kappa1,2= 0.865 (P<0.001) • Kappa1,3= 0.054 (P=0.765) • Kappa2,3= -0.071 (P=0.696) • Intra-rater agreement • Kappa1= 0.800 (P<0.001) • Kappa2= 0.790 (P<0.001) • Kappa3= 0.000 (P=1.000)
Continuous data • Test for bias • Check differences not related to magnitude • Calculate mean and SD of differences • Limits of agreement • Coefficient of variation • ICC
Test for bias • Student’s paired t (mean) • Wilcoxon matched pairs (median) • If there is bias, agreement cannot be investigated further
Example 3: Test for bias • Paired t test • P=0.362 • No bias
Check differences unrelated to magnitude • Clearly no relationship
this is s this is mean Calculate Mean and SD differences
Limits of agreement • Lower limit of agreement (LLA) = mean - 1.96×s = -37.6 • Upper limit of agreement (ULA) = mean + 1.96×s = 47.5 • 95% of differences between a pair of measurements for an individual lie in (-37.6, 47.5)
Coefficient of variation • Measure of variability of differences • Expressed as a proportion of the average measured value • Suitable when error (the differences between pairs) increases with the measured values • Other measures require this not to be the case • 100 × s ÷ mean of the measurements • 100 × 21.72÷ 447.88 • 4.85%
Intraclass Correlation • Continuous data • Two or more sets of measurements • Measure of correlation that adjusts for differences in scale • Several models • Absolute agreement of consistency • Raters chosen randomly or same raters throughout • Single or average measures
Intraclass Correlation • ≥0.75 Excellent • 0.4 to 0.75 Fair to Good • <0.4 Poor
Cronbach’s α • Internal consistency • Total scores • Several components. • α ≥0.8 good • ≥0.7 adequate
Investigating agreement • Data type • Categorical • Chi squared • Continuous • Limits of agreement • Coefficient of variation • Intraclass correlation • How are the data repeated? • Measuring instrument (s), rater(s), time(s) • Number of raters • Two • Straightforward • Three or more • Help!