900 likes | 911 Views
Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009. Anita L. Stewart Institute for Health & Aging University of California, San Francisco. Overview of Class 4. Concepts of error, sources of error and bias in measures.
E N D
Class 4Basic Psychometric Characteristics:Variability, Reliability, InterpretabilityOctober 15, 2009 Anita L. Stewart Institute for Health & Aging University of California, San Francisco
Overview of Class 4 • Concepts of error, sources of error and bias in measures. • Indicators of variability and reasons for poor variability • Indicators of reliability • Interpretability of scores
= + Components of an Individual’s Observed Item Score (Simplistic view) Observed true item score score error
= + Components of an Individual’s Observed Item Score Observed true item score score error “score that would be obtained over repeated testings” Nunnally, 1994, p211
Random versus Systematic Error Observed true item score score random systematic error = +
Random versus Systematic Error Observed true item score score Relevant to reliability random systematic error = + Relevant to validity
= + Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (sum of all observed item scores) error variance
= + Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (sum of all observed item scores) (Random)error variance
Combining Items into Multi-Item Scales • When items are combined into a summated scale, random error to some extent “cancels out” • Error variance reduced as # items increases • Reducing random error increases amount of “true score” variance
Sources of Error • Subjects • Observers or interviewers • Measure or instrument
Example: Measuring Weight of Children • Observed score is a linear combination of many sources of variation for an individual
Measuring Weight in Pounds (Without Shoes) of One Child Amount of water past 30 min Observed weight True weight 80 lbs Weightof clothes + = + Person weighing childrenis not very precise Scale ismiscalibrated + +
Measuring Weight in Pounds (Without Shoes) of One Child Amount of water past 30 min +.25 lb Observed weight 82.1 lbs True weight 80 lbs Weightof clothes +.70 lb + = + Person weighing childrenis not very precise +1 lb Scale ismiscalibrated +.1 lb + + 82.1 = 80 +.25 +.70 +.1 +1
Sources of Error in Measuring Weight of Children • Weight of clothes • Subject source of random error • Scale is miscalibrated • Instrument source of systematic error • Person weighing child is not precise • Observer source of random error
Measuring Depressive Symptoms (past 4 weeks) in an Asian or Latino Man Hard to choose number on the 1-6response choice scale Observed depressionscore “True” depression 16 = + Measure misses 2culturally-bound symptoms Unwillingnessto tellinterviewer Poor memory of feelings + + +
Measuring Depressive Symptoms (past 4 weeks) in an Asian or Latino Man Hard to choose number on the 1-6response choice scale +1 Observed depressionscore 12 “True” depression 16 = + Measure misses 2culturally-bound symptoms -2 Unwillingto tellinterviewer -2 Poor memory of feelings -1 + + + 12 = 16 +1 -2 -1 -2
Sources of Error in Measuring Depression • Hard to choose one number on 1-6 response scale • Subject source of random error • Unwilling to tell interviewer, poor memory of feelings • Subject sources of systematic error (underreport true depression) • Measure misses culturally-bound symptoms • Instrument source of systematic error (underestimate true depression)
Four Types of Memory Errors: From Cognitive Psychology • Encoding • Information inadequately stored in memory • Storage • Memory eroded over time • Retrieval • Some events/feelings harder to recall • Reconstruction • Errors filling in missing pieces R Torangeau, Chap 3, in AA Stone et al. (eds)The Science of Self-Report, London: Lawrence Erlbaum, 2000
Autobiographical memory – memory of things in time and space Events not encoded with their calendar dates Thus time is a poor retrieval method Numerous errors remembering “when” and “how often” something occurred within a particular time frame Memory and Time N Bradburn, Chap 4, The Science of Self-Report
Tend to remember positive more than negative experiences more emotionally intense than neutral experiences non-threatening events more than threatening, sensitive events Memory and Emotion Kihlstrom et al, Chap 6, The Science of Self-Report
Overview • Concepts of error • Basic psychometric characteristics • Variability • Reliability • Interpretability
Variability • Good variability • All (or nearly all) scale levels are represented • Distribution approximates bell-shaped normal • Variability is a function of the sample • Need to understand variability of a measure in sample similar to one you are studying • Review criteria • Adequate variability on the latent variable that is relevant to your study
Indicators of Variability • Range of scores • Mean, median, mode • Standard deviation (or standard error) • Interquartile range • Skewness statistic • % at floor (lowest possible score) • % at ceiling (highest possible score)
Range of Scores: Possible and Observed • Especially important for multi-item measures • Example: • CES-D possible range is 0-30 • Wong et al. study of mothers of young children: observed range was 0-23 • missing entire high end of the distribution (none had high levels of depression)
Mean, Median, Mode • Mean - average • Median - midpoint • Mode - most frequent score • In normally distributed measures, these are all the same • In non-normal distributions, they will vary
Mean and Standard Deviation • Most information on variability is from mean and standard deviation • Can envision how measure is distributed on the possible range • Mean + 1 SD = 64% of the scores
Interquartile Range (IR) • Difference between the 3rd and 1st quartiles IR = Quartile 3 - Quartile 1 • This range contains the middle 50% of the distribution • 25% of the sample is above and 25% is below this range
Quartiles Divide distribution into 4 parts with 25% of the sample in each part (quartiles) • Quartile 1 - the scale score at the boundary of the lowest 25% of the distribution • Quartile 2 - the score that divides the distribution in half (same as the median) • Quartile 3 - the score at the boundary of the highest 25% (25% of the sample scores above this point)
Set of Scores on 12 people 12 people (red), 12 scores (black) 1 2 3 4 5 6 7 8 9 10 11 12 2 3 8 1 7 4 4 3 2 7 5 3 Re-arrange scores in numeric order 4 9 1 8 2 12 7 6 11 10 5 3 1 2 2 3 3 3 4 4 5 7 7 8
Example of Quartiles: Set of Scores on 12 people 1 2 2 3 3 3 4 4 5 7 7 8 2.5Q1 3.5 Q2 6 Q3 Q1=lowest 25% (lowest 3 people) Q2= median (50% below, 50% above) Q3=highest 25% (highest 3 people)
Example of Quartiles: Set of Scores on 12 people 1 2 2 3 3 3 4 4 5 7 7 8 2.5Q1 3.5 Q2 6 Q3 Interquartile range - quartile 3 - quartile 1 = 6 - 2.5 = 3.5
Skewness • Positive skew - scores bunched at low end, long tail to the right • Negative skew - opposite pattern • Skewness coefficient ranges from - infinity to + infinity • the closer to zero, the more normal • Scores +2.0 are cause for concern
Ceiling and Floor Effects: Similar to Skewness Information • Ceiling effects: substantial number of people get highest possible score • Floor effects: opposite • More helpful for single-item measures or coarse scales with only a few levels
… to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? 49% not limited at all (can’t improve) %
SF-36 Variability Information in Patients with Chronic Conditions (N=3,445) All on 0-100 scales, higher is better McHorney C et al. Med Care. 1994;32:40-66.
Evidence of Floor and Ceiling Effects in One SF-36 Scale 24 37 All on 0-100 scales, higher is better McHorney C et al. Med Care. 1994;32:40-66.
Reasons for Poor Variability • Low variability in construct being measured in that “sample” (true low variation) • Items not adequately tapping construct • If only one item, especially hard • Items not detecting variation at one end • What to do: • If developing measures, add items • If selecting measures – find another one
Advantages of Multi-item Scales Revisited • Using multi-item scales minimizes likelihood of ceiling/floor effects • Even if items are skewed, multi-item scale “normalizes” the skew
Percent with “Best” Score on 5 Items in the MOS MHI-5 6-level response scale - all of the time to none of the time: Stewart A. et al., Measuring Functioning and Well-Being, 1992
Percent with “Best” Score on 5 Items in the MOS MHI-5 6-level response scale - all of the time to none of the time: 63 Stewart A. et al., Measuring Functioning and Well-Being, 1992
Percent with “Best” Score on 5 Items in the MOS MHI-5 6-level response scale - all of the time to none of the time: 5-itemscale: only 5%had highestscore Stewart A. et al., Measuring Functioning and Well-Being, 1992
Overview • Concepts of error • Basic psychometric characteristics • Variability • Reliability • Interpretability
Reliability • Extent to which an observed score is free of random error • Produces the same score each time it is administered (all else being equal) • Population-specific - reliability affected by: • sample size • variability in scores (dispersion) • a person’s level on the scale
= + Back to Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (Variation is the sum of all observed item scores) error variance
Reliability Depends on True Score Variance • Reliability is a group-level statistic • Reliability: • Reliability = 1 – (error variance) • OR Proportion of variance due to true score Total variance
Reliability Depends on True Score Variance Reliability of .70 means 30% of variancein observed scores is due to error Reliability = total variance – error variance .70 = 1.0 – .30
Reliability Coefficient • Typically ranges from .00 - 1.00 • Higher scores indicate better reliability
Importance of Reliability • Necessary for validity (but not sufficient) • Low reliability (or high measurement error) attenuates correlations with other variables • May conclude that two variables are not related when they are • Greater reliability = greater power • The more reliable your scales, the smaller sample size you need to detect an association
Reliable Scale? • NO! • There is no such thing as a “reliable” scale • We accumulate “evidence” of reliability in a variety of populations in which it has been tested
How Do You Know if a Scale or Measure Has Adequate Reliability? • Adequacy of reliability judged according to standard criteria • Criteria depend on type of coefficient