1 / 67

Anita L. Stewart Institute for Health & Aging University of California, San Francisco

Class 4 Psychometric Characteristics Part I: Variability, Reliability, Interpretability October 20, 2005. Anita L. Stewart Institute for Health & Aging University of California, San Francisco. Overview of Class 4. Basic psychometric characteristics Variability Reliability Interpretability

Download Presentation

Anita L. Stewart Institute for Health & Aging University of California, San Francisco

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Class 4Psychometric Characteristics Part I: Variability, Reliability, InterpretabilityOctober 20, 2005 Anita L. Stewart Institute for Health & Aging University of California, San Francisco

  2. Overview of Class 4 • Basic psychometric characteristics • Variability • Reliability • Interpretability • Validity and bias • Responsiveness and sensitivity to change

  3. Overview • This class: • Variability • Reliability • Interpretability • Next class (class 5) • Validity and bias • Responsiveness and sensitivity to change

  4. Overview • Basic psychometric characteristics • Variability • Reliability • Interpretability

  5. Variability • Good variability • All (or nearly all) scale levels are represented • Distribution approximates bell-shaped normal • Variability is a function of the sample • Need to understand variability of measure of interest in sample similar to one you are studying • Review criteria • Adequate variability in a range that is relevant to your study

  6. Common Indicators of Variability • Range of scores (possible, observed) • Mean, median, mode • Standard deviation (standard error) • Skewness • % at floor (lowest score) • % at ceiling (highest score)

  7. Range of Scores • Especially important for multi-item measures • Possible and observed • Example of difference: • CES-D possible range is 0-30 • Wong et al. study of mothers of young children: observed range was 0-23 • missing entire high end of the distribution (none had high levels of depression)

  8. Mean, Median, Mode • Mean - average • Median - midpoint • Mode - most frequent score • In normally distributed measures, these are all the same • In non-normal distributions, they will vary

  9. Mean and Standard Deviation • Most information on variability is from mean and standard deviation • Can envision how it is distributed on the possible range

  10. Normal Distributions(Or Approximately Normal) • Mean, SD tell the entire story of the distribution • + 1 SD on each side of the mean = 64% of the scores

  11. Examples from Sarkisian (2002): Expectations Regarding Aging Scores 0-100 – higher indicate better expectations

  12. Skewness • Positive skew - scores bunched at low end, long tail to the right • Negative skew - opposite pattern • Coefficient ranges from - infinity to + infinity • the closer to zero, the more normal • Test whether skewness coefficient is significantly different from zero • thus depends on sample size • Scores +2.0 are cause for concern

  13. Skewed Distributions • Mean and SD are not as useful • SD often goes out beyond the maximum or minimum possible

  14. Ceiling and Floor Effects: Similar to Skewness Information • Ceiling effects: substantial number of people get highest possible score • Floor effects: opposite • Not very meaningful for continuous scales • there will usually be very few at either end • More helpful for single-item measures or coarse scales with only a few levels

  15. … to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? 49% not limited at all (can’t improve) %

  16. … to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)? 49% not limited at all (can’t improve) %

  17. Advantages of multi-item scales revisited • Using multi-item scales minimizes likelihood of ceiling/floor effects • When items are skewed, multi-item scale “normalizes” the skew

  18. Percent with Highest (Best) Score:MOS 5-Item Mental Health Index • Items (6 pt scale - all of the time to none of the time): • Very nervous person - 34% none of the time • Felt calm and peaceful - 4% all of the time • Felt downhearted and blue - 33% none of the time • Happy person - 10% all of the time • So down in the dumps nothing could cheer you up – 63% none of the time • Summated 5-item scale (0-100 scale) • Only 5% had highest score Stewart A. et al., MOS book, 1992

  19. SF-36 Variability Information in Patients with Chronic Conditions (N=3,445) McHorney C et al. Med Care. 1994;32:40-66.

  20. Ceiling and floor effects: Expectations About Aging (Sarkisian)

  21. Ceiling and floor effects: Expectations About Aging (Sarkisian)

  22. Reasons for Poor Variability • Low variability in construct being measured in that “sample” (true low variation) • Items not adequately tapping construct • If only one item, especially hard • Items not detecting important differences in construct at one or the other end of the continuum • Solutions: add items

  23. Overview • Basic psychometric characteristics • Variability • Reliability • Interpretability

  24. Reliability • Extent to which an observed score is free of random error • Population-specific; reliability increases with: • sample size • variability in scores (dispersion) • a person’s level on the scale

  25. = + Components of an Individual’s Observed Item Score Observed true item score score error

  26. = + Components of Variability in Item Scores of a Group of Individuals Observed true score score variance variance Total variance (Variation is the sum of all observed item scores) error variance

  27. Reliability Depends on True Score Variance • Reliability is a group-level statistic • Reliability: • Reliability = 1 – (error variance) • Reliability is: Proportion of variance due to true score Total variance

  28. Reliability Depends on True Score Variance Reliability of .70 means 30% of the variancein the observed score is explainedby error Reliability = total variance – error variance Proportion of variance due to true score Total variance

  29. Reliability Depends on True Score Variance Proportion of variance due to true score Total variance Reliability = Total variance – error variance .70 = 100% - 30%

  30. Importance of Reliability • Necessary for validity (but not sufficient) • Low reliability attenuates correlations with other variables (harder to detect true correlations among variables) • May conclude that two variables are not related when they are • Greater reliability, greater power • Thus the more reliable your scales, the smaller sample size you need to detect an association

  31. Reliability Coefficient • Typically ranges from .00 - 1.00 • Higher scores indicate better reliability

  32. How Do You Know if a Scale or Measure Has Adequate Reliability? • Adequacy of reliability judged according to standard criteria • Criteria depend on type of coefficient

  33. Types of Reliability Tests • Internal-consistency • Test-retest • Inter-rater • Intra-rater

  34. Internal Consistency Reliability: Cronbach’s Alpha • Requires multiple items supposedly measuring same construct to calculate • Extent to which all items measure the same construct (same latent variable)

  35. Internal-Consistency Reliability • For multi-item scales • Cronbach’s alpha • ordinal scales • Kuder Richardson 20 (KR-20) • for dichotomous items

  36. Minimum Standardsfor Internal Consistency Reliability • For group comparisons (e.g., regression, correlational analyses) • .70 or above is minimum (Nunnally, 1978) • .80 is optimal • above .90 is unnecessary • For individual assessment (e.g., treatment decisions) • .90 or above (.95) is preferred (Nunnally, 1978)

  37. Internal-Consistency Reliability Can be Spurious • Based on only those who answered all questions in the measure • If a lot of people are having trouble with the items and skip some, they are not included in test of reliability

  38. Internal-Consistency Reliability is a Function of Number of Items in Scale • Increases with the number of items • Very large scales (20 or more items) can have high reliability without other good scaling properties

  39. Example: 20 item Beck Depression Inventory (BDI) • BDI 1961 version (symptoms “today”) • reliability .88 • 2 items correlated < .30 with other items in the scale • BDI 1978 version (past week) • reliability .86 • 3 items correlated < .30 with other items in the scale Beck AT et al. J Clin Psychol. 1984;40:1365-1367

  40. Test-Retest Reliability • Repeat assessment on individuals who are not expected to change • Time between assessments should be: • Short enough so no change occurs • Long enough so subjects don’t recall first response • Coefficient is a correlation between two measurements • Type of correlation depends on scale properties • For single item measures, the only way to test reliability

  41. Appropriate Test-Retest Coefficients by Type of Measure • Continuous scales (ratio or interval scales, multi-item Likert scales): • Pearson • Ordinal or non-normally distributed scales: • Spearman • Kendall’s tau • Dichotomous (categorical) measures: • Phi • Kappa

  42. Minimum Standards for Test-Retest Reliability • Significance of a test-retest correlation has NOTHING to do with the adequacy of the reliability • Criteria: similar to those for internal consistency • >.70 is desirable • >.80 is optimal

  43. Observer or Rater Reliability • Inter-rater reliability (across two or more raters) • Consistency (correlation) between two or more observers on the same subjects (one point in time) • Intra-rater reliability (within one rater) • A test-retest within one observer • Correlation among repeated values obtained by the same observer (over time)

  44. Observer or Rater Reliability • Sometimes Pearson correlations are used - correlate one observer with another • Assesses association only • .65 to .95 are typical correlations • >.85 is considered acceptable McDowell and Newell

  45. Association vs. Agreement When Correlating Two Times or Ratings • Association is degree to which one score linearly predicts other score • Agreement is extent to which same score is obtained on second measurement (retest, second observer) • Can have high correlation and poor agreement • If second score is consistently higher for all subjects, can obtain high correlation • Need second test of mean differences

  46. Example of Association and Agreement • Scores at time 2 are exactly 3 points above scores at time 1 • Correlation (association) would be perfect (r=1.0) • Association is not perfect (no agreement on score in all cases - a difference of 3 between each score at time 1 and time 2

  47. Types of Reliability Coefficients for Agreement Among Raters • Intraclass correlation • Kappa

  48. Intraclass Correlation Coefficient for Testing Inter-rater Reliability (Kappa) • Coefficient indicates level of agreement of two or more judges, exceeding that which would be expected by chance • Appropriate for dichotomous (categorical) scales and ordinal scales • Several forms of kappa: • e.g., Cohen’s kappa is for 2 judges, dichotomous scale • Sensitive to number of observations, distribution of data

  49. Interpreting Kappa: Level of Reliability <0.00 .00 - .20 .21 - .40 .41 - .60 .61 - .80 .81 - 1.00 Poor Slight Fair Moderate Substantial Almost perfect .60 or higher is acceptable (Landis, 1977)

  50. Reliable Scale? • NO! • There is no such thing as a “reliable” scale • We accumulate “evidence” of reliability in a variety of populations in which it has been tested

More Related