1 / 20

Improving Content Validity: A Confidence Interval for Small Sample Expert Agreement

Improving Content Validity: A Confidence Interval for Small Sample Expert Agreement. Jeffrey M. Miller & Randall D. Penfield NCME, San Diego April 13, 2004 University of Florida millerjm@ufl.edu & penfield@coe.ufl.edu. INTRODUCING CONTENT VALIDITY.

yardley
Download Presentation

Improving Content Validity: A Confidence Interval for Small Sample Expert Agreement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Content Validity: A Confidence Interval for Small Sample Expert Agreement Jeffrey M. Miller & Randall D. Penfield NCME, San Diego April 13, 2004 University of Florida millerjm@ufl.edu & penfield@coe.ufl.edu

  2. INTRODUCING CONTENT VALIDITY • “Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests (AERA/APA/NCME, 1999) • Content validity refers to the degree to which the content of the items reflects the content domain of interest (APA, 1954)

  3. THE NEED FOR IMPROVED REPORTING Content is a precursor to drawing a score-based inference. It is evidence-in-waiting (Shepard, 1993; Yalow & Popham, 1983) “Unfortunately, in many technical manuals, content representation is dealt with in a paragraph, indicating that selected panels of subject matter experts (SMEs) reviewed the test content, or mapped the items to the content standards…(Crocker, 2003)”

  4. QUANTIFYING CONTENT VALIDITY • Several indices for quantifying expert agreement have been proposed • The mean rating across raters is often used in calculations • However, the mean alone does not provide information regarding its proximity to the unknown population mean. • We need a usable inferential procedure go gain insight into the accuracy of the sample mean as an estimate of the population mean.

  5. THE CONFIDENCE INTERVAL A simple method is to calculate the traditional Waldconfidence interval However, this interval is inappropriate for rating scales. • Too few raters and response categories to assume population normality has not been violated. • No reason to believe the distribution should be normal. • The rating scale is bounded with categories that are discrete.

  6. AN ALTERNATIVE IS THE SCORE CONFIDENCE INTERVAL FOR RATING SCALES • Penfield (2003) demonstrated that the Score method outperformed the Wald interval especially when • The number of raters was small (e.g., ≤ 10) • The number of categories was small (e.g., ≤ 5) • Furthermore, this interval is asymmetric • It is based on the actual distribution for the mean rating of concern. • Further, the limits cannot extend below or above the actual limits of the categories.

  7. STEPS TO CALCULATING THE SCORE CONFIDENCE INTERVAL 1. Obtain values for n, k, and z n = the number of raters K = the highest possible rating z = the standard normal variate associated with the confidence level (e.g., +/- 1.96 at 95% confidence)

  8. 2. Calculate the mean item ratingThe sum of the ratings for an item divided by the number of raters

  9. 3. Calculate pp = Or if scale begins with 1 then p =

  10. 4. Use p to calculate the upper and lower limits for a confidence interval for population proportion (Wilson, 1927)

  11. 5. Calculate the upper and lower limits of the Score confidence intervalfor the population mean rating

  12. Shorthand Example Item: 3 + ? = 8 The content of this item represents the ability to add single-digit numbers. 1 2 3 4 Strongly Disagree Disagree Agree Strongly Agree Suppose the expert review session includes 10 raters. The responses are 3, 3, 3, 3, 3, 3, 3, 3, 3, 4

  13. Shorthand Example n = 10 k = 4 z = 1.96 the sum of the items = 31 = 31/10 = 3.10 p = so, p = 31 / (10*4) = 0.775

  14. Shorthand Example (cont.) = (65.842 – 11.042) / 87.683 = 0.625 = (65.842 + 11.042) / 87.683 = 0.877

  15. Shorthand Example (cont.) = 3.100 – 1.96*sqrt(0.938/10) = 2.500 = 3.100 + 1.96*sqrt(0.421/10) = 3.507

  16. We are 95% confident that the population mean rating falls somewhere between 2.500 and 3.507

  17. Content Validation • Method 1: Retain only items with a Score interval of a particular width based on • A priori determination of appropriateness • An empirical standard (25th and 75th percentiles of all widths) 2. Method 2: Retain items based on hypothesis test that the lower limit is above a particular value

  18. Rating Frequency for 10 Raters 95% Score CI Item 0 1 2 3 4 Mean Lower Upper 1 0 0 0 4 6 3.60 3.08 3.84 2 0 0 2 5 3 3.10 2.50 3.51 3 2 0 2 6 0 2.20 1.59 2.77 4 1 2 3 3 1 2.10 1.50 2.68 EXAMPLE WITH 4 ITEMS

  19. Conclusions • Score method provides a confidence interval that is not dependent on the normality assumption • Outperforms the Wald interval when the number of raters and scale categories is small • Provides a decision-making method for the fate of items in expert review sessions. • Computational complexity can be eased through simple programming in Excel, SPSS, and SAS

  20. For further reading, Penfield, R. D. (2003). A score method for constructing asymmetric confidence intervals for the mean of a rating scale item. Psychological Methods, 8, 149-163. Penfield, R. D., & Miller, J. M. (in press). Improving content validation studies using an asymmetric confidence interval for the mean of expert ratings. Applied Measurement in Education.

More Related