10 likes | 119 Views
Examining Differential Item Functioning of "Insensitive" Test Items Juliya Golubovich, James A. Grand, M. A., Neal Schmitt, Ph.D., and Ann Marie Ryan, Ph.D. Michigan State University. INTRODUCTION. DISCUSSION. RESULTS. METHOD. Procedure
E N D
Examining Differential Item Functioning of "Insensitive" Test ItemsJuliya Golubovich, James A. Grand, M. A., Neal Schmitt, Ph.D., and Ann Marie Ryan, Ph.D.Michigan State University INTRODUCTION DISCUSSION RESULTS METHOD Procedure Students responded to demographic questions using an online survey upon signing up for the study. They then appeared at a scheduled time to take the 30-item verbal ability test in a supervised group setting. The test was not timed. Statistical Analyses Post administration, we examined test items for differential item functioning based on gender. DIF exists on a particular item when individuals of equal ability but from different groups (in this case gender groups) have unequal probabilities of answering the item correctly. Differences in item characteristic curves (ICCs) across groups provide evidence of DIF. We used the two parameter logistic (2PL) model (probability of a correct response to an item modeled using item difficulty, item discrimination, and examinee ability). BILOG-MG showed good overall model fit to the data, 2(201) = 198.6, p > .05, but the model did not fit two of the 30 items and parameters for a third item could not be estimated for males. Thus, a total of 27 items (7 of them judged insensitive) were used for the final DIF analyses. Item parameters were estimated separately for males and females using BILOG-MG and parameter estimates for the two groups were placed on a common scale so that ICCs could be compared. ICC curves for each group were plotted on common graphs and visually inspected for large differences. As examination suggested differences in difficulty rather than discrimination in each case, the DIF function in BILOG-MG was used to test the significance of difficulty differences between ICCs. We examined whether insensitive item content would produce item-level male-female performance differences on a verbal ability test. Only 3 items were flagged as problematic by DIF analyses, and these items were not ones sensitivity reviewers saw as problematic. Our findings are consistent with previous research with regard to judgmental and statistical processes diverging in identification of differentially functioning items. Our study moves beyond earlier work, however, in that we examined types of insensitive items that in other studies would not make it onto a test to be examined post-administration. Our inability to find DIF on items sensitivity reviewers make certain to remove from tests could suggest that removing these types of insensitive items may not help prevent DIF and may lead to discarding items useful for assessing ability. Given the costs and challenges of test development, removing useful items is undesirable. However, this does not imply that sensitivity reviews are inefficacious. Even if the types of items we examined do not differentially impact groups’ performance, their presence may still influence test reactions. Beyond insuring that certain groups’ performance is not influenced by factors other than the construct of interest, sensitivity reviews are also conducted with the goal of minimizing negative test-taker reactions (McPhail, 2010). Limitations include that data was not collected in a high stakes context (where certain groups may react more negatively to insensitivity), that the 9 insensitive items were relatively easy (if insensitivity disrupts concentration this may be more detrimental on difficult items) and that with our relatively small sample, power may have been insufficient to identify items that were actually problematic. Fairness in testing is a prominent concern for selection specialists. To enhance test fairness test developers commonly use a sensitivity review process (or fairness review; Ramsey, 1993). During a typical sensitivity review, reviewer(s) go through test questions to identify and remove content that certain groups of test takers (e.g., gender, racial/ethnic, age, socio economic groups) could perceive as insensitive (e.g., upsetting, offensive). Consider an offensive fill in the blank item: The fact that even traditional country music singers have ________ words such as “bling” and “ho” seems to indicate that urban, hip-hop culture has ________ all music genres. An unaddressed question is whether items flagged by sensitivity reviewers as problematic are ones that would negatively affect certain test takers’ performance (i.e., show psychometric bias). To this end, we examined whether lack of fairness in the form of items judged to be insensitive (determined via a judgmental process before test administration) is associated with psychometric bias (determined via a statistical process after test administration). When equal ability test takers from different groups show unequal probabilities of responding to an item correctly, there is evidence of differential item functioning (DIF). We examined how item characteristics considered insensitive according to standard sensitivity review guidelines relate to the presence of DIF across male and female test takers. Gender was a reasonable choice given research suggesting women may be more reactive than men to problematic item content (Mael et al., 1996). Sample 336 students, primarily young (M = 19.45, SD = 1.65), White (73.6%), and about equally split on gender (n = 170 males). Measures Demographics. Age, gender, ethnicity, and race. Test items. Nine insensitive items were couched within a 30-item verbal ability test. The insensitive items came from a 54-item pool developed by the authors based on insensitive item exemplars from sensitivity reviewer training materials. Items belonged to various categories of insensitivity (e.g., offensive, emotionally provocative, portrays gender stereotype) that were derived from fairness guidelines (e.g., ACT, 2006). Testing professionals (n = 49) with experience serving as sensitivity reviewers (10.22 years on average) rated the original pool of 54 items on insensitivity using a four point scale (1-highly insensitive; 4-not problematic). Any one reviewer rated 18 of the 54 items. A sample of students (N = 301; 26.4% male; 14.6% non-White; mean age = 19.62) also evaluated these items (using the same scale). They were asked to assume the role of sensitivity reviewers after a brief tutorial on sensitivity reviews. The 9 items chosen for the current study were ones rated most insensitive on average by professional and student reviewers and ones that showed significant gender differences in student reviewer ratings. The non-problematic items in the current study (21 items) were selected from the larger set of 108 items rated by student reviewers that received the most favorable ratings (M = 3.85, SD = .04). • Overall, no significant differences in overall test performance were observed between women (M = 20.87, SD = 3.59) and men (M = 20.01, SD = 4.43). • None of the insensitive items analyzed exhibited a large amount of DIF; the three items that exhibited the greatest evidence of DIF (females had an advantage on these) were not among those rated as insensitive by judges. • Two of these items had significantly different difficulty estimates for men and women. Substantive analyses suggested that females’ advantage on these items may have been due to response style differences between men and women. • Figure 1. Male and female ICCs for item 19.