270 likes | 433 Views
Applications of IRT Models. DIF and CAT. Which of these is the situation of a biased test?. The average score for males and females is different on an item is not the same. The correlation between males’ scores on an item is stronger than that for the females’ scores.
E N D
Applications of IRT Models DIF and CAT
Which of these is the situation of a biased test? • The average score for males and females is different on an item is not the same. • The correlation between males’ scores on an item is stronger than that for the females’ scores. • A group of males and females with exactly the same ability achieve different scores on an item.
Disentangling the Terminology • Item impact • Item impact is evident when examinees from different groups have differing probabilities of responding correctly to (or endorsing) an item because there are true differences between the groups in the underlying ability being measured by the item. • DIF • The differential probability of a correct response for examinees at the same trait level but from different groups. • DIF occurs when examinees from different groups show differing probabilities of success on (or endorsing) the item after matching on the underlying ability that the item is intended to measure. • Item bias • Item bias occurs when examinees of one group are less likely to answer an item correctly (or endorse an item) than examinees of another group because of some characteristic of the test item or testing situation that is not relevant to the test purpose. • Adverse Impact • Adverse impact is a legal term describing the situation in which group differences in test performance result in disproportionate examinee selection or related decisions (e.g., promotion). This is not evidence for test bias.
There are two types of DIF • Uniform DIF • The referent group always has a higher probability of a correct response than that for the focal group. • Non-uniform DIF • The direction of the advantage of one group’s likelihood of a correct response changes in different regions of the ability scale.
Relationship between IRT and CTST models • It has been shown that there is a relationship between 2 PL normal ogive IRT models and the single factor FA model (Lord & Novick, 1968) • The b-parameter is related to the threshold parameter divided by the item factor loading • The discrimination parameter is e2qual to the factor loading divided by the communality of the item • Highly discriminating items will have high factor loadings
Examining Measurement Invariance in CTST • Examining factorial invariance • Configural invariance • Zero and non-zero loading patterns are the same across groups • Pattern (metric) invariance • The factor loadings are equal across groups • Scalar (strong) invariance • The factor loadings and intercepts are equal across groups • Any group differences in means can be attributed to the common factors, which allows for meaningful group mean comparisons • Strict invariance • Factor loadings, intercepts, and unique variances are equal across groups • Any systematic differences in group means, variances, or covariances are due to the common factors
Examining DIF in IRT • IRT tests of DIF examine if the IRC (Item response curve) the same for the reference group as it is for the focal group. • The focal group is the smaller group in questions (the minority group). • The reference group is the larger group that generally has the established parameters. • If they are different, then this means that the probability of an individual in one group with ability x responding correctly is different than the probability of an individual with the same ability x in group two if getting the item correct. • DTF refers to a difference in the test characteristic curves, obtained by summing the item response functions for each group. • DTF is perhaps more important for selection because decisions are made based on test scores, not individual item responses.
Procedures for Detecting DIF/DTF • Parametric Procedures • Compare item parameters from two groups of examinees • Lord’s Chi-Square • Likelihood Ratio Test • Compare IRFs from two groups of examinees by measuring areas between them • Raju’s Area Measures
Likelihood Ratio Test • Distributed as a chi-square with degrees of freedom equal to the difference in the number of parameters estimated in the compact and the augmented model • The compact model assumes item parameters are the same for both groups • The augmented model constrains anchor items to be equal, but allows items of interest to have parameters that vary across groups
Raju’s Area Measures • Signed and unsigned areas • Indicates the area between two IRCs • Requires separate calibrations of the item parameters in each group, then use a linear transformation to put them on the same scale
Procedures for Detecting DIF/DTF • Non Parametric Procedures • Bivariate frequencies between item responses and group memberships conditional on levels of ability or trait estimation Logistic Regression • Simultaneous Item Bias Test (SIBTEST) • Mantel-Haenszel (MH) • Logistic Regression
Procedures for Detecting DIF/DTF • Simultaneous Item Bias Test (SIBTEST) • Examinees are matched on a true score ability estimate of ability • Creates a weighted mean difference between the reference and focal groups, which is then tested statistically • The means are adjusted to correct for differences in the ability distributions with a regression correction procedure • Some examination of this procedure has been conducted to examine changes in Type I error rates when the percent of DIF items is large
Mantel-Haenszel (MH) • Compares the item performance of two groups who were previously matched on the ability scale • Total test score can be used • K 2x2 contingency tables are made for each item for K number of ability levels • DIF is shown if the odds of correctly answering the item at a given score level is difference for the two groups
Mantel-Haenszel (MH) • The statistic for detecting DIF in an item is • Type A items – negligible DIF with ΔαMH < |1| • Type B items – moderate DIF with |1|<= ΔαMH <= |1.5, and MH test is statistically significant| • Type C items – large DIF with ΔαMH > |1.5|
Logistic Regression • If the group effect is significant and the interaction is not, then there is uniform DIF • If the interaction is significant, then there is non-uniform DIF • Conduct model comparisons by adding each successive model term
Computerized Adaptive Testing (CAT) • To obtain equal precision of measurement to that of a linear test, but with greater efficiency. • Give people only the items that are informative about them. • Reduce testing time and opportunity for error.
Issues of Research in a CAT system. • Early Issues • Precision of measurement • Estimation procedure, Prior estimates • Equivalence • Reliability of Estimate, Test Form Equivalence (Test Information), Testing Mode • Efficiency • Item selection methods, Test length • Newer Issues • Security • Item exposure • Tetstlet models
Item Exposure and Item Selection Methods • Sympson-Hetter • Directly controls item exposure probabilistically • Places a filter between item selection and item administration • Items are administered below a prespecified maximum exposure rate • P(S) probability that an item is selected as the best item • P(A) probability that an item is administered • P(A|S) conditional probability that an item is administered given that it is selected • Item exposure parameter • P(A)=P(A|S)*P(S)<=rmax • P(A|S) is easy to determine if P(S) is known, but P(S) must be determined through an iterative process
Item Exposure and Item Selection Methods • Conditional Sympson-Hetter or SLC (Sotcking and Lewis, 1998) • SH controls that item exposure for a population, but at various ability levels the exposure rates can be quite high • P(A|S) is determined at specific trait levels rather than across a population
Item Exposure and Item Selection Methods • a-stratified design (STR CAT; Chang & Ying, 1996, 1999) • Partition the item pool into multilevels and multistages according to the discrimination parameters • Start with the less discriminating items • This approach seems to improve item pool utilization and balanced item exposure rates • Then use a b-matching item selection procedure • It is less computationally complex • No other restrictions on item exposure is imposed