200 likes | 230 Views
Reliability: Introduction. Reliability Session. Definitions & Basic Concepts of Reliability Theoretical Approaches Empirical Assessments of Reliability Interpreting Coefficients. 1. Conceptions of Reliability. “My car won’t start!”. “This patient is often late!” *. Roughly: ±. .
E N D
Reliability Session • Definitions & Basic Concepts of Reliability • Theoretical Approaches • Empirical Assessments of Reliability • Interpreting Coefficients
1. Conceptions of Reliability “My car won’t start!” “This patient is often late!” * Roughly: ± S.E.M. * Does this make him reliable or unreliable?
Classic view of the components of a measurement Measured Value = True Value plus any Systematic Error (Bias) plus Random Error The usefulness of a score depends on the ratio of its true value component to any error variance that it contains
Several sources of variance in test scores: Which to include in estimating reliability? • Variance between patients • Variance due to different observers • Fluctuations over time: day of week or time of day • Changes in the measurement instrument (reagents degrade) • Changes in definitions (e.g. revised diagnostic codes) • Random errors (various sources)
Reliability Subject Variability Subject variability + Measurement Error Reliability = or, Subject Variability Subject Var. + Observer Variability + Meas’t Error
Generalizability Theory • An ANOVA model that estimates each source of variability separately: • Observer inconsistency over time • Discrepancies between observers • Changes in subject being assessed over time • Quantifies these • Helps to show how to optimize design (and administration) of test given these performance characteristics.
2. Classical Test Theory • Distinguishes random error from systematic, or bias. Random = unreliability; bias = invalidity. • Classical test theory assumes: • Errors are independent of the score (i.e. similar errors occur at all levels of the variable being measured) • Mean of errors = zero (some increase & some decrease the score; these errors balance out) • Hence, random errors tend to cancel out if enough observations are made, so a large sample can give you an accurate estimate of the population mean even if the measure is unreliable. Useful! • From the above, Observed score = True score + Error (additive: no interaction between score and error)
Reliability versus Sensitivity of a Measurement:Metaphor of the combs Fine-grained scalemay produce more error variance Coarse measure will appear more stablebut is less sensitive
Reliability and Precision Some sciences use ‘precision’ to refer to the close grouping of results that in the metaphor of the shooting target we called ‘reliability’. You may also see ‘accuracy’ used in place of our ‘validity’. These terms are common in laboratory disciplines, and you should be aware of the contrasting usage. In part this difference arises because in the social sciences, measurements need to distinguish between 3 concepts: reliability and validity, plus the level of detail a measure is capable of revealing – the number of significant digits it provides. Thus, rating pain as “moderate” is imprecise and yet could be done reliably, and it may also be valid (as far as we can tell!) By contrast, mechanical measurements in laboratory sciences can be sufficiently consistent that they have little need for our concept of reliability.
3. Consistency over time, and Internal Consistency The basic way to test reliability is to repeat the measurement. If you get the same score, it’s reliable. This works well with fixed attributes (your height), but runs into a problem with things that may change over time (such as your health). The low agreement over time would give a falsely negative impression of reliability. Another problem with repeating measures that use questions (rather than physical measurements) is that people may remember their replies, perhaps falsely inflating reliability. What can we do?
Internal Consistency To avoid the problem that people remember their answers, you could correlate different, but equivalent, versions of the test. • For example, divide the whole test into two halves and correlate them. This is called “Split-half reliability”. • You could apply the second version after as time delay and it would give a less biased indication of test-retest reliability, as long as the 2 halves really were equivalent. • The equivalence of the 2 halves is called “internal consistency”. • How can you show that the 2 halves really are equivalent? Why not correlate them, but without the time delay: ask all the items on one day and if the scores for the 2 halves are equivalent they they measure the same thing. That means they should also correlate highly after a time-delay (so give a high test-retest reliability, assuming the health remained stable).
Internal consistency (2) Now comes the really clever idea: if the test really is internally consistent (and scores for those 2 halves correlate highly) then it will have high test-retest reliability. The idea of internal consistency is that a reliable test is one with items that are very similar. So (brilliant idea!) you do not have to do the re-test: internal consistency can replace it! Much easier: no re-testing required! But, perhaps by chance the division into 2 halves may not give perfectly equivalent sub-tests. So why not try several different ways of splitting the whole scale into 2 parts? Kuder & Richardson worked out formulae for estimating the internal consistency of a set of items split in every possible way. Cronbach’s alpha is the statistic we now normally use to estimate internal consistency.
Number of items and Reliability You can think of internal consistency in terms of item-total correlations – if each item correlates with the overall score, the items are measuring a consistent theme (e.g. depression) and the scale is internally consistent. Alpha generalizes this, showing the average of the individual item-total correlations. Adding more items will increase internal consistency reliability,up to a point. Spearman and Brown worked outthe formula to show the link between scale length and reliability: reliability # items
Scale Length and Reliability Reliabilityof eachitem Number of items Conclusion: with items of intermediate reliability you achieve great improvements up to around 7 – 10 items. Not much gain beyond 10.
4. Statistics to use: Intra-class correlation vs. Pearson r ICC = 1.0; r = 1.0 r = 1.0; ICC < 1.0 Systematic error:bias Message: a re-test correlation will ignore a systematic change in scores over time. An ICC measures agreement, so will penalize retest reliability when a shift occurs. Which do you prefer?
Comments on statistics • ICC measures concordance; >0.75 is regarded as excellent reliability. • Limits of agreement statistic plots the mean difference between two tests plus or minus 2 standard deviations. (Bland JM & Altman DG, Comput Biol Med 1990;20:337, and Lancet 1995;346:1085) • In rtt, varying the delay between retest trials may not make much difference, especially for factual measures like ADL. For SF36 scores, there was no consistent difference between 2-day & 2-week rtt (Marx RG, J Clin Epidemiol 2003;56:730)
Self-test fun time!What is the Reliability when: • Every student is rated “above average” • Physician A rates every BP as 5 mm Hg higher than physician B • The measure is applied to a different population • The observers change • The patients do, in reality, improve over time?