Anita L. Stewart Institute for Health & Aging University of California, San Francisco

Class 4Psychometric Characteristics Part I: Variability, Reliability, InterpretabilityOctober 20, 2005 Anita L. Stewart Institute for Health & Aging University of California, San Francisco

Overview of Class 4 • Basic psychometric characteristics • Variability • Reliability • Interpretability • Validity and bias • Responsiveness and sensitivity to change

Overview • This class: • Variability • Reliability • Interpretability • Next class (class 5) • Validity and bias • Responsiveness and sensitivity to change

Overview • Basic psychometric characteristics • Variability • Reliability • Interpretability

Variability • Good variability • All (or nearly all) scale levels are represented • Distribution approximates bell-shaped normal • Variability is a function of the sample • Need to understand variability of measure of interest in sample similar to one you are studying • Review criteria • Adequate variability in a range that is relevant to your study

Common Indicators of Variability • Range of scores (possible, observed) • Mean, median, mode • Standard deviation (standard error) • Skewness • % at floor (lowest score) • % at ceiling (highest score)

Range of Scores • Especially important for multi-item measures • Possible and observed • Example of difference: • CES-D possible range is 0-30 • Wong et al. study of mothers of young children: observed range was 0-23 • missing entire high end of the distribution (none had high levels of depression)

Mean, Median, Mode • Mean - average • Median - midpoint • Mode - most frequent score • In normally distributed measures, these are all the same • In non-normal distributions, they will vary

Mean and Standard Deviation • Most information on variability is from mean and standard deviation • Can envision how it is distributed on the possible range

Normal Distributions(Or Approximately Normal) • Mean, SD tell the entire story of the distribution • + 1 SD on each side of the mean = 64% of the scores

Examples from Sarkisian (2002): Expectations Regarding Aging Scores 0-100 – higher indicate better expectations

Skewness • Positive skew - scores bunched at low end, long tail to the right • Negative skew - opposite pattern • Coefficient ranges from - infinity to + infinity • the closer to zero, the more normal • Test whether skewness coefficient is significantly different from zero • thus depends on sample size • Scores +2.0 are cause for concern

Skewed Distributions • Mean and SD are not as useful • SD often goes out beyond the maximum or minimum possible

Ceiling and Floor Effects: Similar to Skewness Information • Ceiling effects: substantial number of people get highest possible score • Floor effects: opposite • Not very meaningful for continuous scales • there will usually be very few at either end • More helpful for single-item measures or coarse scales with only a few levels

… to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? 49% not limited at all (can’t improve) %

Advantages of multi-item scales revisited • Using multi-item scales minimizes likelihood of ceiling/floor effects • When items are skewed, multi-item scale “normalizes” the skew

Percent with Highest (Best) Score:MOS 5-Item Mental Health Index • Items (6 pt scale - all of the time to none of the time): • Very nervous person - 34% none of the time • Felt calm and peaceful - 4% all of the time • Felt downhearted and blue - 33% none of the time • Happy person - 10% all of the time • So down in the dumps nothing could cheer you up – 63% none of the time • Summated 5-item scale (0-100 scale) • Only 5% had highest score Stewart A. et al., MOS book, 1992

SF-36 Variability Information in Patients with Chronic Conditions (N=3,445) McHorney C et al. Med Care. 1994;32:40-66.

Ceiling and floor effects: Expectations About Aging (Sarkisian)

Reasons for Poor Variability • Low variability in construct being measured in that “sample” (true low variation) • Items not adequately tapping construct • If only one item, especially hard • Items not detecting important differences in construct at one or the other end of the continuum • Solutions: add items

Overview • Basic psychometric characteristics • Variability • Reliability • Interpretability

Reliability • Extent to which an observed score is free of random error • Population-specific; reliability increases with: • sample size • variability in scores (dispersion) • a person’s level on the scale

= + Components of an Individual’s Observed Item Score Observed true item score score error

= + Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (Variation is the sum of all observed item scores) error variance

Reliability Depends on True Score Variance • Reliability is a group-level statistic • Reliability: • Reliability = 1 – (error variance) • Reliability is: Proportion of variance due to true score Total variance

Reliability Depends on True Score Variance Reliability of .70 means 30% of the variancein the observed score is explainedby error Reliability = total variance – error variance Proportion of variance due to true score Total variance

Reliability Depends on True Score Variance Proportion of variance due to true score Total variance Reliability = Total variance – error variance .70 = 100% - 30%

Importance of Reliability • Necessary for validity (but not sufficient) • Low reliability attenuates correlations with other variables (harder to detect true correlations among variables) • May conclude that two variables are not related when they are • Greater reliability, greater power • Thus the more reliable your scales, the smaller sample size you need to detect an association

Reliability Coefficient • Typically ranges from .00 - 1.00 • Higher scores indicate better reliability

How Do You Know if a Scale or Measure Has Adequate Reliability? • Adequacy of reliability judged according to standard criteria • Criteria depend on type of coefficient

Types of Reliability Tests • Internal-consistency • Test-retest • Inter-rater • Intra-rater

Internal Consistency Reliability: Cronbach’s Alpha • Requires multiple items supposedly measuring same construct to calculate • Extent to which all items measure the same construct (same latent variable)

Internal-Consistency Reliability • For multi-item scales • Cronbach’s alpha • ordinal scales • Kuder Richardson 20 (KR-20) • for dichotomous items

Minimum Standardsfor Internal Consistency Reliability • For group comparisons (e.g., regression, correlational analyses) • .70 or above is minimum (Nunnally, 1978) • .80 is optimal • above .90 is unnecessary • For individual assessment (e.g., treatment decisions) • .90 or above (.95) is preferred (Nunnally, 1978)

Internal-Consistency Reliability Can be Spurious • Based on only those who answered all questions in the measure • If a lot of people are having trouble with the items and skip some, they are not included in test of reliability

Internal-Consistency Reliability is a Function of Number of Items in Scale • Increases with the number of items • Very large scales (20 or more items) can have high reliability without other good scaling properties

Example: 20 item Beck Depression Inventory (BDI) • BDI 1961 version (symptoms “today”) • reliability .88 • 2 items correlated < .30 with other items in the scale • BDI 1978 version (past week) • reliability .86 • 3 items correlated < .30 with other items in the scale Beck AT et al. J Clin Psychol. 1984;40:1365-1367

Test-Retest Reliability • Repeat assessment on individuals who are not expected to change • Time between assessments should be: • Short enough so no change occurs • Long enough so subjects don’t recall first response • Coefficient is a correlation between two measurements • Type of correlation depends on scale properties • For single item measures, the only way to test reliability

Appropriate Test-Retest Coefficients by Type of Measure • Continuous scales (ratio or interval scales, multi-item Likert scales): • Pearson • Ordinal or non-normally distributed scales: • Spearman • Kendall’s tau • Dichotomous (categorical) measures: • Phi • Kappa

Minimum Standards for Test-Retest Reliability • Significance of a test-retest correlation has NOTHING to do with the adequacy of the reliability • Criteria: similar to those for internal consistency • >.70 is desirable • >.80 is optimal

Observer or Rater Reliability • Inter-rater reliability (across two or more raters) • Consistency (correlation) between two or more observers on the same subjects (one point in time) • Intra-rater reliability (within one rater) • A test-retest within one observer • Correlation among repeated values obtained by the same observer (over time)

Observer or Rater Reliability • Sometimes Pearson correlations are used - correlate one observer with another • Assesses association only • .65 to .95 are typical correlations • >.85 is considered acceptable McDowell and Newell

Association vs. Agreement When Correlating Two Times or Ratings • Association is degree to which one score linearly predicts other score • Agreement is extent to which same score is obtained on second measurement (retest, second observer) • Can have high correlation and poor agreement • If second score is consistently higher for all subjects, can obtain high correlation • Need second test of mean differences

Example of Association and Agreement • Scores at time 2 are exactly 3 points above scores at time 1 • Correlation (association) would be perfect (r=1.0) • Association is not perfect (no agreement on score in all cases - a difference of 3 between each score at time 1 and time 2

Types of Reliability Coefficients for Agreement Among Raters • Intraclass correlation • Kappa

Intraclass Correlation Coefficient for Testing Inter-rater Reliability (Kappa) • Coefficient indicates level of agreement of two or more judges, exceeding that which would be expected by chance • Appropriate for dichotomous (categorical) scales and ordinal scales • Several forms of kappa: • e.g., Cohen’s kappa is for 2 judges, dichotomous scale • Sensitive to number of observations, distribution of data

Interpreting Kappa: Level of Reliability <0.00 .00 - .20 .21 - .40 .41 - .60 .61 - .80 .81 - 1.00 Poor Slight Fair Moderate Substantial Almost perfect .60 or higher is acceptable (Landis, 1977)

Reliable Scale? • NO! • There is no such thing as a “reliable” scale • We accumulate “evidence” of reliability in a variety of populations in which it has been tested

Anita L. Stewart Institute for Health & Aging University of California, San Francisco