University of Twente

The Impact of Item Response Theory in Educational Assessment: A Practical Point of ViewCees A.W. GlasUniversity of Twente, The Netherlands c.a.w.glas@gw.utwente.nl University of Twente

Measuring body height with a questionnaire 1. I bump my head quite often 2. For school pictures I was always asked to stand in the first row 3. In bed, I often suffer from cold feet 4. When walking down the stairs, I often take two steps at a time 5. I think I would do well in a basket ball team 6. As a police officer, I would not make much of an impression 7. In most cars I sit uncomfortably 8. I literally look up to most of my friends 9. Etc.

Test of Body Height 3 7 5 9 11 13 1 18 2 4 8 6 21 6 16 Ann Jim Jo

The Rasch model

Item Response Curve Rasch model Probability Correct Response Latent Ability Scale

Item Response Function Discrimination Probability of Success Guessing Difficulty Ability

Applications • Local reliability and optimal test construction • Test Equating • Multilevel item response theory in school effectiveness research

Item and Test Information • Information is a local measure of reliability • Item and test information function • In Adaptive Testing items are selected to maximize information at the estimated ability of examinee.

Adaptive Item Selection Information

Adaptive Item Selection Cont’d Information Item 1

Adaptive Item Selection Cont’d Test Item 2 Item 1 Information

Adaptive Item Selection Cont’d Test Information Item 3 Item 2 Item 1

Item and Test Information Cont’d Test Information Items Ability

Adaptive Testing with Content Constraints • Psychometrically optimal adaptive individualized testing • Test content specifications • Psychometrically optimal within content constraints and practical constraints • Discrete optimization problem

Adaptive Testing with Content Constraints Law School Admission Test • content constraints • item type constraints • word count constraints • answer key constraints • gender / minority orientation • clusters of items (testlets) • some items contain clues to each other

Test Constraints • Constraints are imposed by Linear - Programming techniques • For every item i a variable is defined

Test assembly model Maximize information in the test Item i is selected for the test or not. At most 5 items on statistics Items 12 and 35 contain clues to each other Time available is 60 minutes

Equating of Examinations • Problem: level of students and difficulty of examinations fluctuate over the years • Objective: to determine pass/fail cut-off scores on examinations in such a way that it reflects the same level of proficiency on the latent scale, • taking into account the difficulty level of the examinations • and differences in proficiency level over years

Simple Deterministic Model • Important feature of the model: Parameter Separation: distinct parameters for persons and items University of Twente

Model for Item with 5 response categories Probability Response Category X=0 X=4 X=1 X=3 X=2 Latent Ability Scale

Multidimensional IRT model University of Twente

Anchor Item Equating Design

Problems Anchor Item Design • Student ability increases between test administrations due to learning • Difference in ability and item ordering between anchor test and examination due to low motivation of students • If anchor test becomes known, the test functions different over the years • All these effects violate the model and bias the estimated cut-off scores

Equating Design Central Examinations, the Netherlands

Equating Design SweSat

Measurement model: GPCM • Alternatives to GPCM (Muraki): • Graded Response Model (Samejima) • Sequential Model (Tutz)

Structural Model Takane and de Leeuw (1987) Model is equivalent with a factor analysis model: Discrimination parameters are factor loadings Ability parameters are factor scores

IRT structural modeling

Problems with “ordinary” regression and analysis of variance models • Different aggregation levels: school level and student level • Variance structure: students within schools are more similar than students from different schools • Old unsatisfactory solutions: • aggregating to school level • disaggregating to student level • Newer solutions: multilevel models: Bryk & Raudenbush, Longford, Goldstein

Motivation for this approachAll the niceties of IRT are available in Multilevel Analysis • Method to model unreliability in the dependent and independent variables • Hetroscedasticity: reliability is defined locally • Incomplete test administration and calibration design (possibility to include selection models) • No assumption of normally distributed scores • Less ceiling problems

An Example (Shalabi, Fox, Glas, Bosker) • 3384 grade seven pupils in 119 schools in the West Bank • Mathematics test • Gender • SES • IQ • School Leadership • School Climate

Model: Intra-class correlation:

Conclusions • IRT is based on the idea of parameter separation • An IRT measurement model can be combined with a structural model • The combined model is equivalent with factor analysis and latent variable models and as such a generalization of other well-known regression models • Applications of IRT • Local reliability and optimal test construction • Test Equating • Multilevel IRT in school effectiveness research

University of Twente