Introduction to Item Response Theory (IRT) Friday Harbor 2009 Paul K. Crane, MD MPH Dan Mungas, PhD
Disclaimer • Funding for this conference was made possible, in part by Grant R13 AG030995 from the National Institute on Aging. • The views expressed do not necessarily reflect official policies of the Department of Health and Human Services; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government.
Topics • Application: brief introduction to neuropsychology • Application: educational testing • Similarities and differences • IRT is a single factor CFA model • Cool things you can do with IRT that you can’t do easily with classical test theory • Limitations of IRT; extensions to deal with those limitations
Neuropsychology • Tests administered to understand brain functioning • Usually multiple domains are assessed • Typical clinical questions: • Is this person impaired? • If so, what is the diagnosis? • This person has been treated with (medication X, cognitive therapy Y); is it working?
Neuropsychological test interpretation • Consider predicted ability (“premorbid” ability) and use as a benchmark for interpreting testing results • Based on history, such as occupational attainment, educational background, rank in the military • Based on vocabulary testing; vocabulary is relatively preserved in AD (but not other pathologies) • Generally assume premorbid ability is consistent across all cognitive domains • Compare results to estimated premorbid ability
Neuropsychological tests • Most developed decades ago, primarily used in clinical rather than research settings • Patterns of deficits • More recent development of epidemiological studies with large batteries of tests • Potential opportunity for modern test theory to improve validity of statistical inference from these sorts of data
Educational testing • A century+ of educational tests • Lord and Novick’s Statistical Theories of Mental Test Scores (1967) • Psychometrics: educational testing as a discipline • Testing companies (ETS, ACT) • State-level testing • Constant generation of new items • Tests change all the time • IRT has emerged as the dominant paradigm for educational tests
Similarities and differences • In both cases, we care about the mental processes that lead to item responses rather than the item responses themselves • Items as indicators of a latent trait or ability • Multiple choice format more common in educational tests • Ordinal formats more common in neuropsychological tests • Item generation less common in neuropsychological tests • Generally many more indicators of each latent trait in educational tests than neuropsychological tests
IRT is a single factor Confirmatory Factor Analysis (CFA) model • Developed from separate lines of thinking • Mathematically equivalent • IRT has better developed infrastructure for addressing measurement precision / reliability • Better for interpretability of individual test taker’s score • Facilitates CAT (computerized adaptive testing, to be discussed later) • Structural equation modeling (SEM) has better developed infrastructure for addressing violations to IRT’s assumptions and for “the structural part”, which may be the primary interest from a research perspective • Infrastructure for addressing measurement precision is being developed (Dr. Curtis)
Item characteristic curves b parameter: difficulty
Item characteristic curves a parameter: slope
Where do item parameters come from? • Large data set • Need a good distribution of the thing measured by the test (some high functioning, some low functioning) • Do not need to be representative of anything in particular (not “norms”) • IRT package figures out where the people are and where the items are • Iterative EM algorithm • Think of items as beads on horizontal strings that can be moved left-right depending on difficulty
What do we do with ICCs? • Important picture: the test characteristic curve • TCC is the sum of all of the item characteristic curves • Plot of the standard score associated with each value of the underlying thing measured by the test • Next slides are TCCs of imaginary tests made up of dichotomous items
Comments on that test • Essentially linear test characteristic curve • Immaterial whether the standard score or the IRT score is used in analyses • No ceiling or floor effect • People at the extremes of the thing measured by the test will get some right and get some wrong • Pretty nice test!
Comments on that test • Essentially linear test characteristic curve • Immaterial whether the standard score or the IRT score is used in analyses • No ceiling or floor effect • People at the extremes of the thing measured by the test will get some right and get some wrong • Pretty nice test! • But that’s what we said about the last one and it had twice as many items!
Why might we want twice as many items? • Measurement precision / reliability • CTT: summarized in a single number: alpha • IRT: conceptualized as a quantity that may vary across the range of the test • Information • Mathematical relationship between information and standard error of measurement • Intuitively makes sense that a test with 2x the items will measure more precisely / more reliably than a test with 1x the items
Comments about these information and SEM curves • Information curves look more different than the SEM curves • Inverse square root relationship • TIC 100 SEM 0.10 (1/10) • TIC 25 SEM 0.20 (1/5) • TIC 16 SEM 0.25 (1/4) • TIC 9 SEM 0.33 (1/3) • TIC 4 SEM 0.50 (1/2) • Trade off between test length and measurement precision • CAT discussion later
These were highly selected “tests” • It would be possible to design such a test if we started with a robust item pool • Almost certainly not going to happen by accident / history • What are more realistic tests? • First example: items bunched up in the middle
Comments on these TCCs • Same number of items but very different shapes • Now it may matter whether you use an IRT score or a standard score in analyses • Both ceiling and floor effects
Comments on the TICs and SEMs • Comparing the red test and the blue test: the red test is better for people of moderate ability (more items close to where they are) • For people right in the middle, measurement precision is just as good as a test twice as long • Items far away from your ability level don’t help your standard error • The blue test is better for people at the extremes (more items close to where they are)
Where do information curves come from? • Item information curves use the same parameters as the item characteristic curves (difficulty level, b, and strength of association with latent trait or ability, a) (see next slides) • Test information is the sum of all of the item information curves • We can do that because of local independence
More thoughts on IICs • The b parameters shift the location of the curve left and right (where is the bead on the string?) • The a parameters modify the height of the curve • But not tons; location location location • IRT gives us tools to evaluate and manage measurement precision
Comments on global cognitive tests • None of these have hard items, so people with average and high levels of cognition are measured with low precision • Curvilinearity in these tests may be a problem for longitudinal data analyses, especially when people start at different places • For example, education as a risk factor • Varying levels of measurement precision across time
What about specific domains? • Baseline data from ADNI • Assigned items to batteries • Figured out how to generate a TCC vs. a z score composite (which was not as simple as I had thought)
ADNI memory TCC 0.8 0.7 0.5 From 1 to 0: 0.8 z units. From 0 to -1: 0.7. From -1 to -2: 0.5
ADNI Executive Functioning TCC THAT looks pretty linear
ADNI memory and EF TICs So EF has linear scaling, but does not have much measurement precision
Summary of this section • IRT provides tools to evaluate and manage measurement precision / standard error of measurement • Test characteristic curve and the test information curve are very helpful in understanding how a test is working • Alongside a histogram of abilities from a population of interest, which are on the same metric
