210 likes | 431 Views
CRT Dependability. Consistency for criterion-referenced decisions. Challenges for CRT dependability. Raw scores may not show much variation (skewed distributions) CRT decisions are based on acceptable performance rather than relative position
E N D
CRT Dependability Consistency for criterion-referenced decisions
Challenges for CRT dependability • Raw scores may not show much variation (skewed distributions) • CRT decisions are based on acceptable performance rather than relative position • A measure of the dependability of the classification (i.e., master / non-master) is needed
Approaches using cut-score • Threshold loss agreement • In a test-retest situation, how consistently are the students classified as master / non-master • All misclassifications are considered equally serious • Squared error loss agreement • How consistent are the classifications • The consequences of misclassifying students far above or far below cut-point are considered more serious Berk, R. A. (1984). Selecting the index of reliability. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 231-266). Baltimore, MD: The Johns Hopkins University Press.
Issues with cut-scores • “The validity of the final classification decisions will depend as much upon the validity of the standard as upon the validity of the test content” (Shepard, 1984, p. 169) • “Just because excellence can be distinguished from incompetence at the extremes does not mean excellence and incompetence can be unambiguously separated at the cut-off.” (p. 171) Shepard, L. A. (1984). Setting performance standards. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 169-198). Baltimore, MD: The Johns Hopkins University Press.
Methods for determining cut-scores • Method 1: expert judgments about performance of hypothetical students on test • Method 2: test performance of actual students
Setting cut-scores (Brown, 1996, p. 257)
Institutional decisions (Brown, 1996, p. 260)
(p – pchance) K= (1 – pchance) Agreement coefficient (po), kappa Po = (A + D) / N 77 6 83 6 21 27 83 27 110 Po = (A + D) / N Pchance = [(A+B)(A+C)+(C+D)(B+D)]/N2 Po = (77+21) / 110 K = (.89 - .63) / (1 - .63) K = .70 Po = .89
Short-cut methods for one administration • Calculate an NRT reliability coefficient • Split-half, KR-20, Cronbach alpha • Convert cut-score to standardized score • Z = [(cut-score - .5 – mean)] / SD • Use Table 7.9 to estimate Agreement • Use Table 7.10 to estimate Kappa
Estimate the dependability for the HELP Reading test Assume a cut point of 60%. What is the raw score? 27 z = -0.36 Look at Table 9.1. What is the approximate value of the agreement coefficient? Look at Table 9.2. What is the approximate value of the kappa coefficient?
Squared-error loss agreement • Sensitive to degrees of mastery / non-mastery • Short-cut form of generalizability study • Classical Test Theory • OS = TS + E • Generalizability Theory • OS = TS + (E1 + E2 + . . . Ek) Brennan, Robert (1995). Handout from generalizability theory workshop.
Phi (lambda) dependability index # of items Cut-point Mean of proportion scores Standard deviation of proportion scores
Domain score dependability • Does not depend on cut-point for calculation • “estimates the stability of an individual’s score or proportion correct in the item domain, independent of any mastery standard” (Berk, 1984, p. 252) • Assumes a well-defined domain of behaviors
Confidence intervals • Analogous to SEM for NRTs • Interpreted as a proportion correct score rather than raw score
Reliability Recap • Longer tests are better than short tests • Well-written items are better than poorly written items • Items with high discrimination (ID for NRT, B-index for CRT) are better • A test made up of similar items is better • CRTs – a test that is related to the objectives is better • NRTs – a test that is well-centered and spreads out students is better