370 likes | 385 Views
Introduction to IRT for non-psychometricians. Paul K. Crane, MD MPH Internal Medicine University of Washington. Outline. Definitions of IRT IRT vs. CTT Hays et al. paper Error and information Rational test construction DIF. What is IRT. IRT is a scoring algorithm
E N D
Introduction to IRT for non-psychometricians Paul K. Crane, MD MPH Internal Medicine University of Washington
Outline • Definitions of IRT • IRT vs. CTT • Hays et al. paper • Error and information • Rational test construction • DIF
What is IRT • IRT is a scoring algorithm • Every score (including standard “sum” scores) is a formula score • Score = Σwixi • Standard scores assume wi=1 • IRT empirically determines from the data the weights that are applied to each item
What is IRT • IRT has emerged as the dominant test development tool in educational psychology • IRT has an increasingly dominant role in test development in psychology in general • IRT is beginning to make inroads in medical testing; recent NIH and FDA interest; publication record increasing
Hays et al. paper Limited Limited Not limited Item a lot a little at all . Vigorous activities, running, lifting heavy objects, strenuous sports 1 2 3 . Climbing one flight 1 2 3 Walking more than 1 mile 1 2 3 Walking one block 1 2 3 Bathing / dressing self 1 2 3 Preparing meals / doing laundry 1 2 3 Shopping 1 2 3 Getting around inside home 1 2 3 Feeding self 1 2 3
Intuitive approach (sum score) • Simple approach: there are numbers that will be circled; total these up, and there we have a score • But: should “limited a lot” for walking a mile receive the same weight as “limited a lot” in getting around inside the home? • Should “limited a lot” for walking one block be twice as bad as “limited a little” for walking one block? • Relationships in a single item and between items
IRT’s role • IRT provides us with a data-driven means of rational scoring for such measures • In practice, the simple sum score is often very good; improvement is at the margins • IRT has other uses in test development and test assessment (second half of the talk) • If there is not a big problem, IRT analyses should validate expert opinion; give mathematical basis to gut feelings
Means and ceiling for items Item Mean (SD) Not limited (%) Vigorous activities 1.97 (0.86) 45 Walking > 1 mile 2.22 (0.84) 49 Climbing 1 flight of stairs 2.37 (0.76) 55 Shopping 2.61 (0.68) 72 Walking 1 block 2.63 (0.64) 72 Preparing meals, doing laundry 2.67 (0.63) 75 Bathing or dressing 2.80 (0.49) 84 Getting around inside home 2.81 (0.47) 84 Feeding self 2.90 (0.36) 91
Comments on Table 1 • Range of those with no limitations (45-91%) • Already from a measurement perspective this is a problem; ~45% will have a perfect score; ceiling / floor effect • Significant skew often found in medical settings; implications often not discussed • Especially difficult in longitudinal studies. What can we say about someone who had a perfect score before and a less than perfect score now? (NOT solved by IRT)
Dichotomous IRT models • Arbitrary choice to dichotomize into any limitation = 0 and no limitation = 1 • Rarely a good idea to collect more detailed data and throw it away for analysis; lose power (Samejima 1967; van Belle 2002) • Begin with 1PL model
Implications of 1PL model • All items have same weight (ICCs are parallel) • Only difference is “difficulty” (amount of trait required to endorse) • Nice math (sum score is sufficient) • Nested within 2PL model so can check to see whether it is acceptable • Math: p(y=1|θ,b)=1/[1+exp(-D(θ-b))] • This puts item difficulty and person trait level on the same scale
1PL results from Hays et al. Item Limited at all (%)* Difficulty Vigorous activities 55 0.5 Walking > 1 mile 51 0.1 Climbing 1 flight of stairs 45 -0.1 Shopping 28 -0.6 Walking 1 block 28 -0.7 Preparing meals, doing laundry 25 -0.8 Bathing or dressing 16 -1.2 Getting around inside home 16 -1.2 Feeding self 9 -1.6 * This is just 100- “not limited.” Slopes fixed at 3.49.
Comments on 1PL analysis • Note similarities to Table 1 • Same order of items • Relationships between items are similar • Recall: same data is being used to derive these estimates • Poor measurement properties highlighted – only one “hard” item with a difficulty of 0.5 • ? How good is this model for these items?
Implications of 2PL model • Items do not need to have the same weights (ICCs are not parallel) • Difficulty differs as before • Now slope differs as well • “Discrimination” parameter • Harder math (sum score no longer sufficient) • P(y=1|θ,a,b) = 1/[1+exp(-Da(θ-b))] • If a parameters are identical, reduces to 1PL model
Table 4: 2PL results Item Difficulty (b) Discrim (a) Vigorous activities 0.5 2.5 Walking > 1 mile 0.1 4.1 Climbing 1 flight stairs -0.1 3.5 Shopping -0.6 3.7 Walking 1 block -0.7 3.7 Making meals /laundry -0.8 3.8 Bathing or dressing -1.2 3.5 Mobility inside home -1.2 3.6 Feeding self -1.6* 3.2 * Note that b parameters are identical to 1PL model!
Assumptions of 1PL model • Because the 1PL model is nested within the 2PL model, we can test whether it is okay to treat the a’s as if they were constant
Comments on 2PL results • Not that different from 1PL results • Some variability in slopes, however • Easier to look at assumption of identical slopes on the information plane • Ability / trait level estimates tend to be very similar in simulation studies and in real data studies with 1PL and 2PL models • Difficulty much more important than slope in determining score
What about that dichotomization? • Don’t want to throw away all the data we collected; “Yes, a little” vs. “yes, a lot” currently lumped • Polytomous IRT models allow us to do this • GRM vs. PCM vs. RSM • GRM is most flexible; 2PL extension • Samejima (1967) ! • PCM and RSM are both Rasch extensions
Comments on GRM results • Empiric validation of expert opinion on how items relate to underlying construct • Scores have meaning in terms of items (“Trait estimates anchored to item content,” p. II-35) • Didn’t do PCM or RSM • Can see impact of dichotomizing items
Software for IRT • Currently old and clunky • PARSCALE will be illustrated tomorrow • Another option is MULTILOG • NIH has issued a SBIR for new and improved software that’s more user friendly
Outline revisited • Definitions of IRT • IRT vs. CTT • Hays et al. paper • Error and information • Rational test construction • DIF
Error and information • One of the real strengths of IRT is that measurement error is not assumed to be constant across the whole test • Instead, measurement error is modeled directly in the form of item (test) information • SEM = 1/SQRT(I(θ))
Information formulas • For 2PL model: I(θ)=D2a2P(θ)Q(θ) • D2 is a constant • a is the slope parameter • P(θ) is the probability of getting the item correct P(y=1| θ,a,b)=1/[1+exp(-Da(θ-b))] • Q(θ)=1-P(θ) • P(θ)Q(θ) generates a hill centered at θ=b • P(θ)0 as θ gets small; Q(θ)0 as θ gets large • Test information is sum of item information curves • Polytomous information is hard, but solved
Information implications • Item information is each item’s individual contribution to measurement precision at each possible score • Test information gives a picture of the test’s measurement precision across all scores • Can use test information to compare tests • Lord (1980) advocated a ratio for comparisons
Individual level measurement error • We estimate θ for each individual, and we can compute I(θ) as well • We know not only the score, but also the precision with which we know the score • Need to train providers to request the measurement precision • Can model error – one of the purposes of this workshop is to integrate error terms into statistical models
Rational test construction • So far we have described existing tests using new IRT tools • Can also build new tests from item banks • Construct a particular information profile and choose items to fill it out • Large literature in educational psychology • van der Linden and colleagues from Twente
Differential item functioning (DIF) • Population is increasingly heterogeneous • Concern about cultural, gender, education, etc. fairness of tests • Can use standard statistical tools to assess differential test impact, that is, different performance according to culture etc. • Have to control for underlying trait level to really get at test bias • DIF is how this is accomplished in educational psychology
Different approaches to DIF • IRT approaches – model items separately in different groups and compare • SIBTEST approach – based on dimensionality assessment • Logistic regression approach – treat as an epidemiological problem • Mantel-Haenszel approach – simple 2x2 table approach • MIMIC approach using MPLUS
Comments on DIF • Each technique is simple to explain but each is tricky to apply • Not all that clear based on the literature what to do with existing epidemiological data from studies with DIF • It is probably impossible to measure cognition without bias, especially education • Years of education doesn’t get to the heart of the problem either (Manley paper) • We re-visit DIF in detail on Thursday
The end. • Definitions of IRT • IRT vs. CTT • Hays et al. paper • Error and information • Rational test construction • DIF • Comments and questions?