260 likes | 359 Views
Some Perspectives on CAT for K-12 Assessments. Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June 20, 2010. Some CAT Questions. You want to implement CAT, but you wonder about what IRT model should you use?
E N D
Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June 20, 2010
Some CAT Questions • You want to implement CAT, but you wonder about what IRT model should you use? • You want to implement CAT, but you wonder how to put together a CAT pool and how you can best implement CAT? • You want to implement CAT, and you wonder about whether it has to be limited to on-grade items only?
Which Model to Use for CAT? • The Rasch and three-parameter logistic (3PL) models are the most popular for IRT applications with multiple-choice items • In applications to conventional fixed-form tests, the differences between the two models are not that great, i.e., when you do parallel forms equating, you get about the same answer based on either model
Which Model to Use for CAT? • With CAT, there are much greater differences between the Rasch and 3PL models. For example: • Rasch CAT only supports a reduction in test length of about 20% compared to a conventional test • 3PL CAT supports a reduction in test length of about 40-50% compared to a conventional test • Why? • With the Rasch model, the information functions for all items have the same shape and the information for an “optimally administered” item is not that much greater than a typical item
Reduced Test Length for an Optimal Rasch CAT Most conventional tests are about here for most students
Which Model to Use for CAT? • With CAT, there are much greater differences between the Rasch and 3PL models. For example: • 3PL CAT tends to select some items in the pool very often and may never select many perfectly good items • Rasch CAT selects items in a much more uniform manner • Why? • With the 3PL, some highly discriminating items provide much more information than other items and therefore are more attractive to the item selection algorithm
Rasch vs. 3PL Exposure – An Example 50% of 3PL items used 5% of the time or less 10% of 3PL items used more then 25% of the time
Which Model to Use for CAT? • Both Rasch and 3PL models have been used successfully in CAT applications • Psychometricians will offer different opinions about which model is best for CAT • Either model is defensible for CAT, but the models do behave quite differently • IRT model is just one consideration related to CAT; other considerations related to design are as important or even more important
Preparing Item Pools for CAT Transition • Ideally, the number of items in a CAT pool should be 10-12 times the number of items to be administered in the CAT (rule of thumb based on M. Stocking from ETS) • The CAT pool must include a sufficient number of easy and difficult items; this is usually a big challenge • More items are needed if students test from the same CAT pool multiple times, especially if previously seen items are not eligible to be used in repeat administrations
Preparing Item Pools for CAT Transition • Items for a CAT item pool must be calibrated to the same IRT scale • Most states have pools of calibrated items with good psychometric properties that might be used for CAT • These items have gone through extensive reviews • These items may have been used operationally • These items have been shown to have good psychometric characteristics
Preparing Item Pools for CAT Transition • However, there often challenges in using these items with the old statistics, such as: • The items were calibrated in paper but CAT is online • The items were in tests measuring old standards and CAT will be measuring new standards • Minor edits or format changes may be needed • Items may have come from different places • How can we make use of these items in a new adaptive test?
CAT Transition Strategy: Fixed-form Transition • Two year transition strategy • In year one, construct and administer a number (e.g., 6 to 10) of fixed-forms (field-test items can be embedded) using previous (paper-based) statistics for test construction • Administer the fixed form online • Re-calibrate the data from the fixed forms and link them to a common scale • Conduct standard setting on subset of the items from the fixed forms (can be a “synthetic” form) • Apply new cut to each fixed form for reporting
CAT Transition Strategy: Fixed-form Transition • In year two, combine all the items from the online conventional fixed-forms (plus additional field-tested items) to create the CAT pool • All items in the CAT pool will have item parameters on a common scale based on an online administration • Issues include: • Deciding how many fixed-forms to develop • Making the fixed-forms as parallel as possible • Building effective equating links between forms • Determining whether the fixed-forms should count • Making a smooth transition from fixed-forms to CAT (since the measurement properties will be different)
CAT Transition Strategy—Barely Adaptive Tests (BAT) • Another strategy for transition to CAT is to use “Barely Adaptive Testing” (BAT) • In this approach, the CAT algorithm is used to administer items from the pool based on paper-based IRT calibrations • However, the CAT algorithm does not adapt the difficulty to student performance as strongly as it normally would • The result is that each student takes a unique test, that is “slightly” targeted to them • Some examples help to clarify
This slide shows how a conventional test would be administered to three students at different levels of ability
Conventional tests are better for calibrating items but not so good for targeting measurement
This slide shows how CAT would be administered to three students at different levels of ability
CAT is best for targeting measurement but not so good for estimating item statistics No responses here for calibration No responses here for calibration
This slide shows how BAT would be administered to three students at different levels of ability
Why Does BAT Make Sense? • BAT is a compromise during a year of transition—it does better measurement that a conventional test and is better than CAT for calibrating items • BAT also permits the administration in the transition year to be very similar to the full CAT administration that will occur in year two and beyond (you can even call it CAT!)
CAT and Off-Grade-Level Testing • There are obvious psychometric benefits to including off-grade-level content in K-12 assessments, if supported by vertically articulated content standards • These benefits would seem particularly apparent for struggling students, including SWDs • Item pools can be substantially improved for measuring struggling students accurately • All students start at the same place (no “out of level” labeling)
CAT and Off-Grade-Level Testing • Some advocate of SWDs insist that CAT should consist only of on-grade-level content • The basis for this position seems to be a concern about washback effect • A psychometrician’s plea: The important consideration is instruction not assessment • The goal of the common core standards is college readiness for all students • The instructional imperative does not change based on what items are allowed in a CAT item pool
CAT and Off-Grade Level Testing • Could off-grade level content be included in accountability? • Yes, if ESEA relaxes “on-grade level requirements • Perhaps, if content standards span multiple grades • Some will say it “doesn’t matter” and that CAT works just fine with only on-grade level content • But it does matter. If we really want to do better at measuring student status and growth and we want to take full advantage of adaptive testing for all students, we need to allow the adaptive test to extent above and below grade level