250 likes | 253 Views
Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory. Quinn N. Lathrop and Ying Cheng. Assistant Professor Ph.D., University of Illinois at Urbana-Champaign. Undergraduate Institution: University of California, Santa Cruz. Introduction. Simulation.
E N D
Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant ProfessorPh.D., University of Illinois at Urbana-Champaign Undergraduate Institution: University of California, Santa Cruz
Introduction Simulation Results Discussion
Introducation • Classification consistency is the degree to which examinees would be classified into the same performance categories over parallel replications of the same assessment (Lee, 2010). • classification accuracy refers to the extent to which actual classifications using observed cut scores agree with “true” classifications based on known true cut scores.(Lee, 2010). • Distribution method: Estimating true score distribution + the observed score distribution • Individual/person method: Each individual’s classification status • true CA increases with test length • decreases when the cut score is located near the mean of the examinee distribution
classification accuracy (CA) rate • 1 total sum scores—— Lee approach Bergeson(2007). language proficiency tests CA increased with grade • 2 latent trait estimates——Rudner approach • Sireci et al., 2008 math tests • Examinees >20 good estimate
The Lee Approach • The marginal probability of the total summed score X is given by • Pr(X = x | θ):conditional summed-score distribution g(θ):density • Let x1, x2, . . . , xK−1 denote a set of observed cut scores that are used to classify examinees into k categories. Given the conditional summed-score distribution and the cut scores, the conditional category probability can be computed by summing conditional summed-score probabilities for all x values that belong to category h, Expected summed scores can be obtained from the θ cut scores as
Suppose a set of true cut scores on the summed-score metric, τ1, τ2, . . . , τK−1, determine the true categorical status of each examinee with θ or τ (i.e., expected summed score). If the true categorical status, η (=1, 2, . . . , K), of an examinee is known, the conditional probability of accurate classification is simply • the true category η can be determined by comparing the expected summed score for with the true cut scores • the marginal classification accuracy index, γ, is given by
The Lee Approach Assumes that classifications are made on the basis of the total score x. • Response pattern V’ = (V1, V2, V3, . . . , VJ ), • Vj is the response to the item j; j=1, 2, . . . , J; • J is the test length. Vj can take values m=0, 1, . . . ,M, Its goal is to find the probability of each possible total score ( ) by summing the probabilities of all possible response patterns that would lead to that total scoregiven , and then aggregates the probabilities according to the cut scores: the probability of scoring in category k is
CA • Using sample estimated ^, the conditional CA estimate under the Lee approach can be given as • the probability of an examinee’s total score and his or her estimated expected true score based on ^ falling into the same category
Rudner-Based Indices • C + 1 cut-scores estimated examinee scores standard error estimates • The expected probability of scoring in each performancelevel • category C based on these assumptions can be written as Define a N* 3C matrix of weights the index can be written as
Rudner approach • The cut scores are aligned on the latent trait scale • based on normally distributed measurement error • the probability of scoring in category k with the Rudner approach is calculated as • Conditional CA is the probability of being placed in the category that the examinee truly belongs to given u
Marginal Indices • With the D-method, or the distribution-based method, the marginal CA is found by integrating the conditional CA over the domain use estimated quadrature points and weights and replace the integrals by summations • The P-method is person based, and simply averages the conditional indices computed for each examinee in the sample (uses the individual θ estimates)
Simulation • Dichotomous: • Items:10-80 by 10 • 1PL 2PL 3PLl • difficulty parameter N(0,1) • discrimination parameters: narrow N(0,0.3) / wider N(0,0.5) • Guessing U(0, 0.25). • Items: 10, 20, 40, and 80 • GRM- five ordered response categories • threshold 1: N(1, :5), threshold 2-4 N(1, :2) • D-methods:40 quadrature points and weights from ~N(0, • P-methods:1)N:250,500,1000 ~N(0, 1)
Two empirical accuracies were calculated: • making a classification based on ^, and the other • based on the observed total score X.
1PL Sample size se bias D-method> lee Shorter test
2PL • Classifications made on the basis of ^ were more accurate than classifications made on the basis of x (discrimination parameters vary more between items, the superiority of using ^ over x is more pronounced)
DISCUSSION • Results indicate that if the classification is made with x, Lee’s approach estimated the accuracy well. Lee’s approach, when coupled with the P-method, was slightly positively biased for short tests. While the D-method performed as well or better than the P-method, the D-method required an assumption of the distribution of the latent trait. • Rudner’s approach estimated the true accuracy of using ^u well. But the pattern of bias changed with the IRT model.
model fit will affect both Rudner and Lee approaches • item parameters and ability distribution are unknown in practice • Multiple cut scores • cognitive diagnostic models • multiple dimensions or multiple tests • Parameter accuracy?
the wrong model • robustness of Lee and Rudner approach • the signal detection theory • conditional false positive/negative error rate