430 likes | 555 Views
E N D
1. Exploring Person-Item Interaction with Multidimensional Multilevel Rasch Models
Wen-Chung Wang
Department of Psychology
National Chung Cheng University, Taiwan
2. 2
3. 3
4. 4
5. 5 Outline Person-Item Interaction
Situations where person-item interaction may occur
Testlet-based items
Rater effects
Rating scales
Missing data
Parameter estimation
Concluding remark
6. 6 Person-Item Interaction Standard Rasch models (and other IRT models) assume that item parameters remain constant over persons (local item independence), meaning there is no person-item interaction.
If item responses fit the model’s expectation, good measurement quality is achieved, e.g., test-independent ability estimation and sample-independent item calibration.
If person do interact with items, e.g., an item is easy for one person but is difficult for another, then those models ignore this interaction will no longer be appropriate in that the parameter estimates are biased.
Fit statistics and DIF analysis may be able to identify this kind of misfit.
7. 7 Person-Item Interaction If person-item interaction is detected, one may
simply remove such items or persons, or
take the interaction into account directly into the model and obtain more information for further test revision.
The first approach may be sometime inappropriate because too many items or persons are removed.
The second approach needs to extend standard IRT models. We focus on this approach here.
8. 8 Situations where person-item interaction may occur Testlet-based items (items that share a common stimulus)
Items within a testlet may have chaining effects, which may vary over persons
Rater effects
Raters may have different severities in judging different persons’ responses
Rating scales
Persons may have different subjective judgments in selecting categories of rating scale items
Missing data
Persons may not miss items at random
9. 9 1. Testlet-based items Testlet design is advocated partly because it corresponds to real life situations in which problems are always interrelated.
Two approaches to modeling the person-item interaction
There fixed-effect approach:
Assumption: The interaction remains constant over persons
Analysis unit: Response pattern of a testlet
Parameterization: Adding a set of fixed-effect parameters into the model
Examples: Hoskens & De Boeck, 1997; Tuerlinckx & De Boeck, 2001; Wang, Cheng, & Wilson, 2004; Wilson & Adams, 1995
There random-effect approach:
Assumption: The interaction varies over persons
Analysis unit: Individual items
Parameterization: Adding a set of random-effect parameters into the model
Examples: Wainer, Bradlow & Du, 2000; X. Wang, Bradlow, & Wainer, 2002; W.-C. Wang & Wilson, 2005
10. 10 1. The Rasch testlet model The Rasch model is
When differential person-item interaction is suspected, we can add a random variable into the model:
to form:
11. 11 1. The Rasch testlet model For polytomous items, the partial credit model (PCM; Masters, 1982) and the rating scale model (RSM; Andrich, 1978) are
For testlet d (e.g., a scenario followed by several Likert-type items), the differential person-item interaction can be modeled as
12. 12 1. An example We analyzed the 2001 English test of the Basic Competence Tests for Journal High School Students. The test contained 44 multiple-choice items, including 17 independent items and 27 items in 11 testlets.
A total of 5000 examinees were randomly selected from a population with more than 300,000 examinees.
The Rasch testlet model and the standard Rasch model were fitted to the data.
13. 13 1. An example
14. 14
15. 15 1. An example The IRT reliability was .93 under the standard model and was .92 under the testlet model. The IRT reliability was overestimated when the person-item interaction within testlets was ignored under the standard model.
According to the Spearman-Brown Prophecy formula, the test would have to be increased approximately 17.4% in length in order to achieve a reliability of .93 from .92, when the person-item interaction was taken into account appropriately under the testlet model.
16. 16 2. Rater effects A typical Rasch model for a three-facet situation:
where qn is the latent trait of person n; Bi is the difficulty of item i; Cj is the difficulty of category k relative to category k – 1 (B and C belong to the same item facet); Dk is the severity of rater k; and pnijk and pnj(j-1)k are the probabilities of scoring j and j – 1, respectively.
In this model, rater k is assumed to hold a constant severity when judging different persons’ responses.
17. 17 2. Rater effects When rater k is suspected to show differential severities when judging different persons, we may add a random variable to model this rater-person interaction:
and
18. 18 2. Rater effects If , then rater k shows differential severities when giving ratings to persons. The larger the variance is, the greater the variation in severity. Therefore, it represents the intra-rater inconsistency.
The inter-rater inconsistency can be depicted by the dispersion of Dk. The more diverse is Dk between raters, the more variation in severity between raters.
19. 19 2. An example The data set contained 1797 Year 6 students’ responses to a single writing task. Each of the 1797 students’ writing scripts was graded by 2 of the 8 raters.
When assessing the scripts, each rater was required to provide two ratings, one was about overall performance and the other was about textual features. These two criteria were judged on a six-point scale.
The data set was analyzed using the random-effects three-facet model and the standard three-facet model, respectively.
According to the likelihood ratio test, adding 8 random variables in the random-effect model significantly increased model-data fit.
20. 20 2. An example The IRT reliability was .87 under the standard model and was .79 under the random-effect model. The reliability was overestimated under the standard model.
According to the Spearman-Brown Prophecy formula, the test would have to be increased approximately 77.9% in length in order to achieve a reliability of .87 from .79 for the random-effects model.
The overestimation in reliability was serious once the rater-person interaction was ignored.
21. 21 2. An example The severity estimates for the 8 raters were between -1.93 and 1.23, indicating that these raters did not show agreement in severity so that the inter-rater consistency was not very high.
The variance estimates of severity for the eight raters were between 0.48 and 7.10, with a mean of 2.53. Compared to the variance estimate for the person distribution of 8.12, some raters showed substantial intra-rater variation (low consistency) in severity while some others showed small intra-rater variation.
22. 22 3. Rating scales
In the RSM, each item is assumed to have a difficulty parameter di, and all the items share the same set of threshold parameters tj.
Likert-type or rating-scale items usually require persons’ subjective judgments, which are very likely to vary over persons. For example, person A may consider the distance between “Strongly disagree” to “Disagree” a huge gap, whereas person B may consider it a minor gap.
The RSM (or even the PCM) may be too stringent to fit real data from Likert-type or rating-scale items, because the subjective judgments are not properly taken into account.
23. 23 3. Rating scales Acknowledging the subjective nature of these kinds of items, test analysts usually adopt more liberal criteria in assessing model-data fit when applying the RSM to Likert-type items of self-report inventories than when applying the Rasch model to objective items of ability tests.
Another way of acknowledging the subjective nature is to directly take it into account in the model.
24. 24 3. Rating scales Instead of treating the threshold parameter tj as a fixed-effect, one may treat it as a random-effect over persons by imposing a distribution onto it and thus form the random-effect rating scale model (RE-RSM):
It is also possible to form the random-effect partial credit model (RE-PCM):
In general, the RE-PCM is not identifiable because there are too many random-effects. Therefore, we may constrain:
which is called the constrained random-effect partial credit model (CRE-PCM).
25. 25
26. 26
27. 27 3. An example The data consisted of the responses of 476 college students who rated how they would perform on 30 teaching-related jobs according to four-point Likert-type scales (very poor, poor, good, and very good), assuming they were beginning teachers.
The CRE-PCM, RE-RSM, PCM and RSM were fitted to the data. According to the likelihood ratio tests, the CRE-PCM fitted the data best, however, the RE-RSM also fitted well. Both the random-effect models fit better than the RSM and the PCM.
28. 28
29. 29 3. An example The estimated IRT reliabilities under the RSM (and PCM) and RE-RSM (and CRE-PCM) were .91 and .48, respectively.
Under the RE-RSM, the variance estimates for the three thresholds were 1.28, 0.99, and 3.08, which were very large, compared to the variance estimate for the latent trait 0.85.
A great proportion in test score was attributable to the random effects in the thresholds over persons, which led to a much smaller test reliability of .48.
30. 30 4. Missing data Item non-response frequently occurs in educational and psychological testing as well as surveys in other social sciences. Despite careful test design, any item may conceivably not be answered by some persons at some time.
Three kinds of missing data :
Missing completely at random (MCAR): if the missing data are missing at random and the observed data are observed at random.
Mmissing at random (MAR): if the missingness is related to the values of the observed data in the data set, but not to the value of the item itself. That is, conditional on the observed data the missingness is random.
Not missing at random (NMAR): if the missingness is related to the value of the item itself, and maybe other variables in the data or even variables that are not measured, so that the missing data are not ignorable.
31. 31 4. Missing data IRT models are especially suited to handle MAR data because the estimation of a person’s latent trait is based on the person’s observed item responses.
For some commonly used testing designs, such as alternating test forms, targeted testing, and adaptive testing, the missing data mechanism is ignorable for likelihood inferences about latent trait parameters.
When data are NMAR, due to systematic differences between respondents and nonrespondents, standard IRT models are no longer appropriate.
32. 32 4. Missing data We need an IRT model that can take the non-random missingness into account.
A feasible way of handling NMAR data is to assume a latent trait (e.g., response propensity or temperament) that underlies the binary responses of missingness, which may be correlated with the target latent trait q that the test intends to measure.
To model non-ignorable missing data along the same IRT logic, one can treat the binary responses of missingness as a function of item and person following a dichotomous IRT model (e.g., the Rasch model).
33. 33 4. Missing data Every item has its own set of parameters to describe the “difficulty” of being missed by persons (analogous to the standard difficulty parameters) and every person has a parameter to describe his or her tendency of not responding:
where and are the probabilities of no-responding (scoring 1) and responding (scoring 0) item i for person n, respectively.
The larger this item difficulty parameter is, the less likely the item will be missed. The larger the person parameter gn is, the more likely the person will not respond to items.
34. 34 4. Missing data If there are multiple missing categories (e.g., “No Opinion”, “Don’t Know”, or “Refuse to Answer”) and each category can be treated as driven by a distinct latent trait, then the above equation can be generalized to multiple equations, one for each kind of missingness.
The equations for the missingness and the standard equation for the IRT model (e.g., the Rasch, PCM or RSM) are simultaneously used to fit the data, so that the combined model is multidimensional.
You may view the original observed responses as Test A, and the nonresponded data (recoded as 1 and 0 for missing and 0 otherwise) as Test B, and then calibrate both data sets jointly with a multidimensional model.
35. 35 4. Missing data If g is uncorrelated with q, meaning that the missingness is driven by a latent trait that is independent of q, then knowing g will not help predict q. Therefore, the data are in fact MAR.
On the other hand, when g is correlated with q, the missingness contains some collateral information about q. As a result, the data are NMAR and thus using the standard unidimensional approach is no longer appropriate.
It can be inferred that the higher is the correlation between g and q, the more efficient is the use of the multidimensional approach and the less appropriate is the use of the unidimensional one.
36. 36 4. An example 670 adults in Taiwan responded to six 4-point Likert-type items (strongly disagree, disagree, agree, and strongly agree) about government administration (i.e., policy-making, saving money, free from corruption, long-term goal setting, public security, and economic growth) through face to face interviews.
When a respondent chose not to select any category from the four point scale, the respondent’s attitude was classified by the interviewer according to the three kinds of missingness: No Opinion, Don’t Know, and Refuse to Answer.
37. 37
38. 38 4. An example The data were analyzed two ways:
The RSM model was fitted to the observed data.
A four-dimensional model in which the RSM was fitted to the observed data and three Rasch models were fitted to the non-responded data, one for each kind of missingness.
The multidimensional approach yielded a statistically better fit than the unidimensional approach.
39. 39
40. 40 4. An example gN, gD and gR were only slightly correlated with q, indicating the unidimensional approach (assuming data are MAR) was appropriate.
gN, gD and gR were almost uncorrelated, which means that combining these three kinds of missingness into a single category was inappropriate.
The multidimensional and the unidimensional approaches yielded IRT reliabilities of .70 and .71, respectively.
In brief, for this particular data set the multidimensional and the unidimensional approaches made little difference.
41. 41 Parameter Estimation It can be shown that all the models in the four conditions are special cases of the multidimensional random coefficients multinomial logit model (MRCMLM; Adams, Wilson, & Wang, 1997).
The parameters in these models can be estimated using the computer program ConQuest (Wu, Adams, & Wilson, 1998).
42. 42 Parameter Estimation The MRCMLM, being a member of the exponential family of distributions, can be viewed as a generalized linear mixed model.
In addition to ConQuest, the SAS NLMIXED procedure and the Stata GLLAMM procedure are alternatives for fitting many common nonlinear and generalized linear mixed models, including the MRCMLM.
43. 43 Concluding remarks The multidimensional and multilevel approach provides a means to modeling person-item interaction.
Predictors (e.g., gender, age, SES) can be added to further account for the variations of the random variables (including q and g):
Two-parameter generation is also possible.
44. 44 Thank you very much!