1 / 37

Introduction to IRT for non-psychometricians

Explore the foundations of Item Response Theory (IRT) and its application in medical testing, comparing it to Classical Test Theory (CTT). Learn how IRT determines item weights for more accurate scoring, addressing limitations of sum scores. Understand the implications of using dichotomous IRT models and the mathematical principles behind 1PL and 2PL models. Discover the differences in item difficulty and discrimination parameters, and how these impact scoring. Delve into the assumptions and practical implications of IRT for improving test development practices.

allennancy
Download Presentation

Introduction to IRT for non-psychometricians

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to IRT for non-psychometricians Paul K. Crane, MD MPH Internal Medicine University of Washington

  2. Outline • Definitions of IRT • IRT vs. CTT • Hays et al. paper • Error and information • Rational test construction • DIF

  3. What is IRT • IRT is a scoring algorithm • Every score (including standard “sum” scores) is a formula score • Score = Σwixi • Standard scores assume wi=1 • IRT empirically determines from the data the weights that are applied to each item

  4. What is IRT • IRT has emerged as the dominant test development tool in educational psychology • IRT has an increasingly dominant role in test development in psychology in general • IRT is beginning to make inroads in medical testing; recent NIH and FDA interest; publication record increasing

  5. IRT vs. Classical Test Theory (CTT)

  6. Hays et al. paper Limited Limited Not limited Item a lot a little at all . Vigorous activities, running, lifting heavy objects, strenuous sports 1 2 3 . Climbing one flight 1 2 3 Walking more than 1 mile 1 2 3 Walking one block 1 2 3 Bathing / dressing self 1 2 3 Preparing meals / doing laundry 1 2 3 Shopping 1 2 3 Getting around inside home 1 2 3 Feeding self 1 2 3

  7. Intuitive approach (sum score) • Simple approach: there are numbers that will be circled; total these up, and there we have a score • But: should “limited a lot” for walking a mile receive the same weight as “limited a lot” in getting around inside the home? • Should “limited a lot” for walking one block be twice as bad as “limited a little” for walking one block? • Relationships in a single item and between items

  8. IRT’s role • IRT provides us with a data-driven means of rational scoring for such measures • In practice, the simple sum score is often very good; improvement is at the margins • IRT has other uses in test development and test assessment (second half of the talk) • If there is not a big problem, IRT analyses should validate expert opinion; give mathematical basis to gut feelings

  9. Means and ceiling for items Item Mean (SD) Not limited (%) Vigorous activities 1.97 (0.86) 45 Walking > 1 mile 2.22 (0.84) 49 Climbing 1 flight of stairs 2.37 (0.76) 55 Shopping 2.61 (0.68) 72 Walking 1 block 2.63 (0.64) 72 Preparing meals, doing laundry 2.67 (0.63) 75 Bathing or dressing 2.80 (0.49) 84 Getting around inside home 2.81 (0.47) 84 Feeding self 2.90 (0.36) 91

  10. Comments on Table 1 • Range of those with no limitations (45-91%) • Already from a measurement perspective this is a problem; ~45% will have a perfect score; ceiling / floor effect • Significant skew often found in medical settings; implications often not discussed • Especially difficult in longitudinal studies. What can we say about someone who had a perfect score before and a less than perfect score now? (NOT solved by IRT)

  11. Dichotomous IRT models • Arbitrary choice to dichotomize into any limitation = 0 and no limitation = 1 • Rarely a good idea to collect more detailed data and throw it away for analysis; lose power (Samejima 1967; van Belle 2002) • Begin with 1PL model

  12. 1PL (~Rasch) model

  13. Implications of 1PL model • All items have same weight (ICCs are parallel) • Only difference is “difficulty” (amount of trait required to endorse) • Nice math (sum score is sufficient) • Nested within 2PL model so can check to see whether it is acceptable • Math: p(y=1|θ,b)=1/[1+exp(-D(θ-b))] • This puts item difficulty and person trait level on the same scale

  14. 1PL results from Hays et al. Item Limited at all (%)* Difficulty Vigorous activities 55 0.5 Walking > 1 mile 51 0.1 Climbing 1 flight of stairs 45 -0.1 Shopping 28 -0.6 Walking 1 block 28 -0.7 Preparing meals, doing laundry 25 -0.8 Bathing or dressing 16 -1.2 Getting around inside home 16 -1.2 Feeding self 9 -1.6 * This is just 100- “not limited.” Slopes fixed at 3.49.

  15. Comments on 1PL analysis • Note similarities to Table 1 • Same order of items • Relationships between items are similar • Recall: same data is being used to derive these estimates • Poor measurement properties highlighted – only one “hard” item with a difficulty of 0.5 • ? How good is this model for these items?

  16. 2PL model

  17. Implications of 2PL model • Items do not need to have the same weights (ICCs are not parallel) • Difficulty differs as before • Now slope differs as well • “Discrimination” parameter • Harder math (sum score no longer sufficient) • P(y=1|θ,a,b) = 1/[1+exp(-Da(θ-b))] • If a parameters are identical, reduces to 1PL model

  18. Table 4: 2PL results Item Difficulty (b) Discrim (a) Vigorous activities 0.5 2.5 Walking > 1 mile 0.1 4.1 Climbing 1 flight stairs -0.1 3.5 Shopping -0.6 3.7 Walking 1 block -0.7 3.7 Making meals /laundry -0.8 3.8 Bathing or dressing -1.2 3.5 Mobility inside home -1.2 3.6 Feeding self -1.6* 3.2 * Note that b parameters are identical to 1PL model!

  19. Assumptions of 1PL model • Because the 1PL model is nested within the 2PL model, we can test whether it is okay to treat the a’s as if they were constant

  20. Graph of Hays 2PL results

  21. Graph of Hays 2PL results – I(θ)

  22. Comments on 2PL results • Not that different from 1PL results • Some variability in slopes, however • Easier to look at assumption of identical slopes on the information plane • Ability / trait level estimates tend to be very similar in simulation studies and in real data studies with 1PL and 2PL models • Difficulty much more important than slope in determining score

  23. What about that dichotomization? • Don’t want to throw away all the data we collected; “Yes, a little” vs. “yes, a lot” currently lumped • Polytomous IRT models allow us to do this • GRM vs. PCM vs. RSM • GRM is most flexible; 2PL extension • Samejima (1967) ! • PCM and RSM are both Rasch extensions

  24. Table 5: GRM results

  25. Comments on GRM results • Empiric validation of expert opinion on how items relate to underlying construct • Scores have meaning in terms of items (“Trait estimates anchored to item content,” p. II-35) • Didn’t do PCM or RSM • Can see impact of dichotomizing items

  26. Software for IRT • Currently old and clunky • PARSCALE will be illustrated tomorrow • Another option is MULTILOG • NIH has issued a SBIR for new and improved software that’s more user friendly

  27. Outline revisited • Definitions of IRT • IRT vs. CTT • Hays et al. paper • Error and information • Rational test construction • DIF

  28. Error and information • One of the real strengths of IRT is that measurement error is not assumed to be constant across the whole test • Instead, measurement error is modeled directly in the form of item (test) information • SEM = 1/SQRT(I(θ))

  29. Information formulas • For 2PL model: I(θ)=D2a2P(θ)Q(θ) • D2 is a constant • a is the slope parameter • P(θ) is the probability of getting the item correct P(y=1| θ,a,b)=1/[1+exp(-Da(θ-b))] • Q(θ)=1-P(θ) • P(θ)Q(θ) generates a hill centered at θ=b • P(θ)0 as θ gets small; Q(θ)0 as θ gets large • Test information is sum of item information curves • Polytomous information is hard, but solved

  30. Information implications • Item information is each item’s individual contribution to measurement precision at each possible score • Test information gives a picture of the test’s measurement precision across all scores • Can use test information to compare tests • Lord (1980) advocated a ratio for comparisons

  31. Relative information, CSI ‘D’ and CASI

  32. Individual level measurement error • We estimate θ for each individual, and we can compute I(θ) as well • We know not only the score, but also the precision with which we know the score • Need to train providers to request the measurement precision • Can model error – one of the purposes of this workshop is to integrate error terms into statistical models

  33. Rational test construction • So far we have described existing tests using new IRT tools • Can also build new tests from item banks • Construct a particular information profile and choose items to fill it out • Large literature in educational psychology • van der Linden and colleagues from Twente

  34. Differential item functioning (DIF) • Population is increasingly heterogeneous • Concern about cultural, gender, education, etc. fairness of tests • Can use standard statistical tools to assess differential test impact, that is, different performance according to culture etc. • Have to control for underlying trait level to really get at test bias • DIF is how this is accomplished in educational psychology

  35. Different approaches to DIF • IRT approaches – model items separately in different groups and compare • SIBTEST approach – based on dimensionality assessment • Logistic regression approach – treat as an epidemiological problem • Mantel-Haenszel approach – simple 2x2 table approach • MIMIC approach using MPLUS

  36. Comments on DIF • Each technique is simple to explain but each is tricky to apply • Not all that clear based on the literature what to do with existing epidemiological data from studies with DIF • It is probably impossible to measure cognition without bias, especially education • Years of education doesn’t get to the heart of the problem either (Manley paper) • We re-visit DIF in detail on Thursday

  37. The end. • Definitions of IRT • IRT vs. CTT • Hays et al. paper • Error and information • Rational test construction • DIF • Comments and questions?

More Related