Improving Content Validity: A Confidence Interval for Small Sample Expert Agreement

Improving Content Validity: A Confidence Interval for Small Sample Expert Agreement Jeffrey M. Miller & Randall D. Penfield NCME, San Diego April 13, 2004 University of Florida millerjm@ufl.edu & penfield@coe.ufl.edu

INTRODUCING CONTENT VALIDITY • “Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests (AERA/APA/NCME, 1999) • Content validity refers to the degree to which the content of the items reflects the content domain of interest (APA, 1954)

THE NEED FOR IMPROVED REPORTING Content is a precursor to drawing a score-based inference. It is evidence-in-waiting (Shepard, 1993; Yalow & Popham, 1983) “Unfortunately, in many technical manuals, content representation is dealt with in a paragraph, indicating that selected panels of subject matter experts (SMEs) reviewed the test content, or mapped the items to the content standards…(Crocker, 2003)”

QUANTIFYING CONTENT VALIDITY • Several indices for quantifying expert agreement have been proposed • The mean rating across raters is often used in calculations • However, the mean alone does not provide information regarding its proximity to the unknown population mean. • We need a usable inferential procedure go gain insight into the accuracy of the sample mean as an estimate of the population mean.

THE CONFIDENCE INTERVAL A simple method is to calculate the traditional Waldconfidence interval However, this interval is inappropriate for rating scales. • Too few raters and response categories to assume population normality has not been violated. • No reason to believe the distribution should be normal. • The rating scale is bounded with categories that are discrete.

AN ALTERNATIVE IS THE SCORE CONFIDENCE INTERVAL FOR RATING SCALES • Penfield (2003) demonstrated that the Score method outperformed the Wald interval especially when • The number of raters was small (e.g., ≤ 10) • The number of categories was small (e.g., ≤ 5) • Furthermore, this interval is asymmetric • It is based on the actual distribution for the mean rating of concern. • Further, the limits cannot extend below or above the actual limits of the categories.

STEPS TO CALCULATING THE SCORE CONFIDENCE INTERVAL 1. Obtain values for n, k, and z n = the number of raters K = the highest possible rating z = the standard normal variate associated with the confidence level (e.g., +/- 1.96 at 95% confidence)

2. Calculate the mean item ratingThe sum of the ratings for an item divided by the number of raters

3. Calculate pp = Or if scale begins with 1 then p =

4. Use p to calculate the upper and lower limits for a confidence interval for population proportion (Wilson, 1927)

5. Calculate the upper and lower limits of the Score confidence intervalfor the population mean rating

Shorthand Example Item: 3 + ? = 8 The content of this item represents the ability to add single-digit numbers. 1 2 3 4 Strongly Disagree Disagree Agree Strongly Agree Suppose the expert review session includes 10 raters. The responses are 3, 3, 3, 3, 3, 3, 3, 3, 3, 4

Shorthand Example n = 10 k = 4 z = 1.96 the sum of the items = 31 = 31/10 = 3.10 p = so, p = 31 / (10*4) = 0.775

Shorthand Example (cont.) = (65.842 – 11.042) / 87.683 = 0.625 = (65.842 + 11.042) / 87.683 = 0.877

Shorthand Example (cont.) = 3.100 – 1.96*sqrt(0.938/10) = 2.500 = 3.100 + 1.96*sqrt(0.421/10) = 3.507

We are 95% confident that the population mean rating falls somewhere between 2.500 and 3.507

Content Validation • Method 1: Retain only items with a Score interval of a particular width based on • A priori determination of appropriateness • An empirical standard (25th and 75th percentiles of all widths) 2. Method 2: Retain items based on hypothesis test that the lower limit is above a particular value

Rating Frequency for 10 Raters 95% Score CI Item 0 1 2 3 4 Mean Lower Upper 1 0 0 0 4 6 3.60 3.08 3.84 2 0 0 2 5 3 3.10 2.50 3.51 3 2 0 2 6 0 2.20 1.59 2.77 4 1 2 3 3 1 2.10 1.50 2.68 EXAMPLE WITH 4 ITEMS

Conclusions • Score method provides a confidence interval that is not dependent on the normality assumption • Outperforms the Wald interval when the number of raters and scale categories is small • Provides a decision-making method for the fate of items in expert review sessions. • Computational complexity can be eased through simple programming in Excel, SPSS, and SAS

For further reading, Penfield, R. D. (2003). A score method for constructing asymmetric confidence intervals for the mean of a rating scale item. Psychological Methods, 8, 149-163. Penfield, R. D., & Miller, J. M. (in press). Improving content validation studies using an asymmetric confidence interval for the mean of expert ratings. Applied Measurement in Education.

Improving Content Validity: A Confidence Interval for Small Sample Expert Agreement