Assessing the Fit of IRT Models in Language Testing

Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh

Outline • Item Response Theory (IRT) • Importance of Model Fit within IRT • Fit Procedures • Issues and Limitations • Lagrange Multiplier (LM) Test • An empirical study using LM Fit statistics • Sharing Results • Conclusions

Item Response Theory (IRT) • A family of mathematical models that provide a common framework for describing people and items • Examinee performance can be predicted in terms of the underlying trait • Provides a means for estimating abilities of people and characteristics of items

IRT Models • Dichotomous or Discrete • 1 Parameter Logistic Model / Rasch (1PL) • 2 Parameter Logistic Model (2PL) • 3 Parameter Logistic Model (3PL) • Polytomous or Scalar • Partial Credit Model (PCM) • Generalized Partial Credit Model (GPCM) • Graded Response Model (GRM)

Shape of Item Response Function

Model for Item with 5 response categories Probability Response Category

IRT Applications IRT applications in language testing are mainly used in • Test development • Item banking • Differential item functioning (DIF) • Computerized adaptive testing (CAT) • Test equating, linking and scaling • Standard setting The utility of the IRT model is dependent upon the extent to which the model accurately reflects the data

Model Fit from Item Perspective Measurement Invariance (MI): Item responses can be described by the same parameters in all sub-populations. Item Characteristic Curve (ICC): Describes the relation between the latent variable and the observable responses to items. Local Independence (LI):Responses to different items are independent given the latent trait variable value. Uni-dimensionalty Speededness Global

Consequences of Misfit Yen (2000) and Wainer & Thissen (2003) have shown the inadequacy of model-data fit Some of the adverse consequences are: • Biased ability estimates • Unfair ranks • Wrongly equated scores • Student misclassifications • Score precision • Validity

Existing Item Fit Procedures Chi – Square Statistics Tests of the discrepancy between the observed and expected frequencies. Pearson-Type Item-Fit Indices (Yen, 1984; Bock, 1972). Likelihood Ratio Based Item-Fit Indices (McKinley & Mills, 1985).

Issues in Existing Fit Procedures • The standard theory for chi-square statistics does not hold. • Failure to take into account the stochastic nature of the item parameter estimates. • Forming of subgroups for the test are based on model-dependent trait estimates. • There is an issue of the number of degrees of freedom. • It is sensitive to test length and sample size.

Lagrange Multiplier (LM) Test Glas(1999) proposed the LM test to the evaluation of model fit. The LM tests are used for testing a restricted model against a more general alternative one. Consider a null hypothesis about a model with parameters This model is a special case of a general model with parameters

LMItem Fit Statistics MI / DIF LI ICC Null Model Alternative Model Null Model Alternative Model Alternative Model Null Model

Empirical Example • Data from Cambridge English First (FCE) • Reading 3 parts/30 questions • Listening 4 parts/30 questions • Sample size over 35000 • The approach can be applied to any other language exam

Conclusions • LM statistics overcome existing FIT issues • Less computational intensive • Size of residuals in the form of Abs.Dif is highly valuable • Fit of IRT model holds reasonably (FCE) • Items violated - MI (4); ICC (3); LI (7) • Magnitude of violation is not severe

Thank you! & Questions

Assessing the Fit of IRT Models in Language Testing