220 likes | 393 Views
Detecting Item Parameter Drift in a CAT program using the Rasch Measurement Model. Mayuko Simon, David Chayer, Pam Hermann, and Yi Du Data Recognition Corporation April, 2012. How should banked item parameters be checked? .
E N D
Detecting Item Parameter Drift in a CAT program using the Rasch Measurement Model • Mayuko Simon, David Chayer, Pam Hermann, and Yi Du • Data Recognition Corporation • April, 2012
How should banked item parameters be checked? • The idea for this study came about when the authors were faced with a large existing bank of CAT items with estimated item parameters that needed augmentation.
Re-calibration of banked item parameters and item parameter drift • Recalibration is recommended at periodic interval • CAT item data is sparse matrix and range of students’ ability for each item are limited
What would be a reasonable way to recalibrate items? • The methods can be applied to • Maintenance of CAT item bank • Detecting item parameter drift • Calibration of field test items
How did other researchers calibrate/re-calibrate CAT data? • Missing imputation to avoid sparseness (Harmes, Parshall, and Kromrey, 2003) • Calibrate FT items by anchoring operational items (Wang and Wiley, 2004) • Calibrate FT item anchoring ability (Kingsbury, 2009) • Use ability to calibrate item parameter to detect drift (Stocking, 1988)
Simulation study • 300 items in item bank • 20,000 students’ simulated responses, N(0,1) • Known item parameter drift (10% of item bank) • Various drift sizes
Four calibration methods in this study • Anchor person ability (AP) • Anchor person ability and anchor 200 items difficulty out of 300 items (API) • Use of Displacement value from Winsteps output • Item by Item calibration (IBI)
IBI: Item by Item calibration • A vector of responses for an item • A vector of ability who took the item • Same concept as logistic regression, but use Winsteps to calibrate • No sparseness involved • Less data is needed (especially when not all items in a bank needed to be checked)
Evaluation • One sample t-test with alpha 0.01 for AP, API, and IBI • Cutoff value 0.4 for Displacement method • Type I error rate • Type II error rate • Sensitivity (Type II + Sensitivity = 1) • RMSE (average difference from banked value for flagged items) • BIAS (average bias from banked value for flagged items)
Type I error rate * Average over 40 replications • Type I error for Control is also inflated • Condition 1 had higher Type I error rate
Type II error rate * Average over 40 replications • Type II error for Displacement method is too high. • Condition 1 had higher Type II error rate
Sensitivity * Average over 40 replications • Sensitivity for Displacement method is too low. • Condition 1 had lower sensitivityrate
Items with small sample sizes and small drift are difficult to flag correctly.
Type II error were with items with small sample size and/or small drift Items with large drift Items with small N Item with small drift
Same item Same items Same items
Which method has re-calibrated item difficulty closer to the banked value? • Median of the RMSE are similar across three methods • IBI has less variance of RMSE than AP
Which method has less bias with the re-calibrated item difficulty? • All three methods has very small bias • IBI has less variance of BIAS than AP
Conclusion • Use caution with Displacement value to identify item parameter drift. • AP, API, and IBI worked reasonably well. • Items with small drift or small sample sizes are difficult to detect the item parameter drift • Compared to AP, IBI had less variance of RMSE and BIAS • Item parameter in one direction (condition 1) would cause more bias in the final ability estimate, leading to higher Type I and Type II errors.
Limitation and Future Study • Proportion of items with item parameter drift was 10% of the bank. • How the results would change with various proportion? How about the size of drift? • Used only Rasch model • How about other models and software? • Minimum sample size was 10 • How about different minimum sample sizes (e.g., 30,50, etc)? • No iterative procedure (no update of the item difficulty with drift) • Does results get better if we do iteratively, updating the difficulty after detecting?