270 likes | 400 Views
Maximum Information per Unit Time in Adaptive Testing. Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame 2 Center for Digital Data, Analytics, & Adaptive Learning Pearson 3 CTB/McGraw-Hill.
E N D
Maximum Information per Unit Time in Adaptive Testing Ying (“Alison”) Cheng1 John Behrens2 Qi Diao3 1Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame 2Center for Digital Data, Analytics, & Adaptive Learning Pearson 3CTB/McGraw-Hill
Test Efficiency • Weiss (1982): CAT can achieve the same measurement precision with half the number of items of linear tests when the maximum information (MI) method is used for item selection • Maximum information method (Lord, 1980) • Choosing the item that yields the largest amount of information at the most recent ability estimate • Maximum information Per Item
Test Efficiency • All tests are timed • Maximum information given a time limit • Choosing the item that yields the largest ratio of amount of information and time required • Maximum information per unit time (MIPUT) (Fan, Wang, Chang, & Douglas, 2013)
MI vs. MIPUT MI: , where the eligible set of items after t items have been administered, and is the information of item l evaluated at MIPUT: where denominator is the expected time required to finish item l given the working speed of the examinee, .
Implementation of MIPUT log-normal model for response time (van der Linden, 2006): So ,(time intensity) and (working speed) can be estimated from response time data (+)
Performance of MIPUT Fan et al. (2013) showed that the MIPUT method when compared to the MI method leads to: i) shorter testing time; ii) small loss of measurement precision; iii) visibly worse item pool usage. Fan et al. (2013) used a-stratification (Chang & Ying, 1999) with the MIPUT method to balance item pool usage and found it effective
a-stratification Item information: . Items with high discrimination parameter are over-used under the MI a-stratification restricts item selection to low-a items early in the test, and high-a items later Apparently high-a items are still over-used under MIPUT That’s why a-stratification helps balance item usage under MIPUT
Questions that Remain • Fan et al. (2013) simulated items that: • Item difficulty and time intensity are either correlated or not correlated; • Item discrimination and difficulty are not correlated; • Item discrimination and time intensity are not correlated. • In reality: • Item discrimination and difficulty are positively correlated (~.4-.6) (Chang, Qian, Ying, 2001). • Q1: How about item discrimination and time intensity?
Follow-Up Questions • Q2: If item discrimination and time intensity are indeed related: • Will MIPUT still lead to worse item pool usage than MI? • If so, is that still due to highly discrimination items or due to highly time saving items? • Q3: Under the 1PL model where item discrimination parameter is not a factor • Will MIPUT still lead to worse item pool usage than MI? • If so, is that due to highly time saving items? • If so, how can we control item exposure?
Q1: Item Discrimination and Time Intensity • Calibration of a large item bank • Online math testing data • 595 items • Over 2 million entries of testing data • 3PL and 2PL model – in the following analysis, focus on 2PL • Time intensity measured by the log-transformed average time on each item
Q2 • So item discrimination and time intensity are indeed related. Then • Will MIPUT still lead to worse item pool usage than MI? • If so, is that still due to highly discrimination items or due to highly time saving items?
A Simplified Version of MIPUT where denominator is the average time required to finish item l is not individualized May be more robust against violation to model assumptions
Simulation Details • CAT simulation • Test length: 20 or 40 • First item randomly chosen from the pool • 5,000 test takers ~ N(0,1) • Ability update: EAPwith prior of N(0,1) • No exposure control or content balancing if not specified otherwise
Findings On average, MIPUT leads to shorter tests (on average by 4 minutes than MI if test length is 20 – 10%, and 9 minutes if test length is 40 – 11%) MIPUT leads to slightly worse exposure control When item discrimination and time intensity are positively related, the disadvantage of MIPUT in exposure control becomes less conspicuous MI and MIPUT lead to negligible difference in measurement precision Over-exposure is still largely attributable to highly discrimination items
Q3 • Q3: Under the 1PL model where item discrimination parameter is not a factor • Will MIPUT still lead to worse item pool usage than MI? • If so, is that due to highly time saving items? • If so, how can we control item exposure?
Findings if Test Length = 20 • MI vs MIPUT • Negligible difference in measurement precision • MIPUT reduces testing time by 21 minutes for a 20-item test (55% reduction) But • MIPUT leads to much worse exposure control • Items that are highly time saving are favored • Correlation between the exposure rate and time intensity under MI-1PL: -.240 – an artifact of the item bank • Correlation between the exposure rate and time intensity under MIPUT-1PL: -.398
Exposure Control • a-stratification is not going to work • Randomesque (Kingsbury & Zara, 1989) • Randomly choose one out of n best items, e.g., n = 5 • MIPUT-R5 • Progressive Restricted (Revuelta & Ponsoda, 1998) • A weighted index, weight determined by the stage of the test • Random number and the time-adjusted item information • Higher weight given to the time-adjusted item information later in the test
Findings if Test Length = 20 • MIPUT_R5 • Maintains measurement precision • Much better exposure control • Reduces testing time on average by 12 minutes (>30% reduction) • MIPUT_PR • Maintains measurement precision • Better exposure control but still not quite so good • Reduces testing time on average by 18 minutes (reduction almost by half)
Findings if Test Length = 40 • Same findings replicated when test length doubles • MIPUT leads to much worse item pool usage because of the overreliance on time saving items • MIPUT_R5 • Maintains measurement precision • Much better exposure control • Reduces testing time on average by 13% • MIPUT_PR • Maintains measurement precision • Better exposure control but still not quite so good • Reduces testing time on average by 41%
Overall Summary MIPUT’s advantage of time saving is more conspicuous under the 1PL MIPUT leads to much worse item pool usage than MI and relies heavily on time saving items MIPUT_R5 is a promising method to maintain measurement precision, balance item pool usage and still keeps the time saving advantage
Future Directions Develop a parallel exposure control method under MIPUT to a-stratify: stratifying by time Investigates the performance of the simplified MIPUT and the original MIPUT in the presence of violation of assumptions to the log-normal model for response time More data analysis to explore the relationship between time intensity and item parameters Control total testing time (van der Linden & Xiong, 2013)
Thank You! CTB/McGraw-Hill 2014 R&D Grant Question or paper, please visit irtnd.wikispaces.com