160 likes | 168 Views
Development of a Confidence Interval for Small Sample Expert Review of Item Content Validation. Jeffrey M. Miller & Randall D. Penfield FERA November 19, 2003 University of Florida millerjm@ufl.edu & penfield@coe.ufl.edu. INTRODUCING CONTENT VALIDITY.
E N D
Development of a Confidence Interval for Small Sample Expert Review of Item Content Validation Jeffrey M. Miller & Randall D. Penfield FERA November 19, 2003 University of Florida millerjm@ufl.edu & penfield@coe.ufl.edu
INTRODUCING CONTENT VALIDITY • “Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests (AERA/APA/NCME, 1999) • Content validity refers to the degree to which the content of the items reflects the content domain of interest (APA, 1954)
THE NEED FOR IMPROVED REPORTING Content is a precursor to drawing a score-based inference. It is evidence-in-waiting (Shepard, 1993; Yalow & Popham, 1983) “Unfortunately, in many technical manuals, content representation is dealt with in a paragraph, indicating that selected panels of subject matter experts (SMEs) reviewed the test content, or mapped the items to the content standards – and all is well (Crocker, 2003)”
QUANTIFYING CONTENT VALIDITY • Several indices for quantifying expert agreement have been proposed • For many, experts quantify the match of the item to an objective using a rating scale • The mean rating across raters is often used in calculations • Klein & Kosecoff’s Correlation (1975) • Aiken’s V (1985) • The mean, by itself, does not account for error and does not tell us how far it lies from the population mean. WE NEED A CONFIDENCE INTERVAL!
THE CONFIDENCE INTERVAL • The traditional confidence interval assumes a normal distribution for the sample mean of a rating scale. • However, the assumption of population normality can not be justified when analyzing the mean of an individual scale item because • 1.) the outcomes of the items are discrete • 2.) the items are bounded by the limits of the Likert-scale. • 3.) sample size for raters is too small even if the above were not problematic
SCORE CONFIDENCE INTERVAL FOR RATING SCALES • The Score confidence interval (Penfield, 2003) treats rating scale variables as outcomes of a binomial distribution. • This interval is asymmetric • Hence, it is based on the actual distribution for the item of concern. • Further, the limits cannot extend below or above the actual limits of the categories.
1. Obtain values for n, k, and z • n = the number of raters • k= the number of possible ratings • The highest rating is scale starts with 0 • The highest rating minus 1 if scale starts greater than 0 • z = the standard normal variate associated with the confidence level (e.g., +/- 1.96 at 95% confidence)
2. CalculateThe sum of the ratings for an item divided by the number of raters
3. Calculate p Or if scale begins with 1 then
4. Use p to calculate the upper and lower limits for the population proportion (Wilson, 1927)
5. Calculate the upper and lower limits of the Score confidence interval
Shorthand Example (cont.) Let n = 10, k = 4, z = 1.96, and let the sum of the items = 31 so, the mean equals 31/10 = 3.100 so, p = 31 / (10*4) = 0.775
Shorthand Example (cont.) = 3.100 – 1.96*sqrt(0.938/10) = 2.500 = 3.100 + 1.96*sqrt(0.421/10) = 3.507
= (65.842 – 11.042) / 87.683 = 0.625 = (65.842 + 11.042) / 87.683 = 0.877
Conclusion We are 95% confident that the population mean rating falls somewhere between 2.500 and 3.507
Rating Frequency for 10 Raters 95% Score CI Item 0 1 2 3 4 Mean Lower Upper 1 0 0 0 4 6 3.60 3.08 3.84 2 0 0 2 5 3 3.10 2.50 3.51 3 2 0 2 6 0 2.20 1.59 2.77 4 1 2 3 3 1 2.10 1.50 2.68 EXAMPLE WITH 4 ITEMS