740 likes | 782 Views
Explore the application of Item Response Theory (IRT) in understanding brain functioning in neuropsychology and educational testing. Learn about similarities and differences, limitations, and cool capabilities of IRT. Understand how IRT enhances test interpretation and improves validity in both fields.
E N D
Introduction to Item Response Theory (IRT) Friday Harbor 2009 Paul K. Crane, MD MPH Dan Mungas, PhD
Disclaimer • Funding for this conference was made possible, in part by Grant R13 AG030995 from the National Institute on Aging. • The views expressed do not necessarily reflect official policies of the Department of Health and Human Services; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government.
Topics • Application: brief introduction to neuropsychology • Application: educational testing • Similarities and differences • IRT is a single factor CFA model • Cool things you can do with IRT that you can’t do easily with classical test theory • Limitations of IRT; extensions to deal with those limitations
Neuropsychology • Tests administered to understand brain functioning • Usually multiple domains are assessed • Typical clinical questions: • Is this person impaired? • If so, what is the diagnosis? • This person has been treated with (medication X, cognitive therapy Y); is it working?
Neuropsychological test interpretation • Consider predicted ability (“premorbid” ability) and use as a benchmark for interpreting testing results • Based on history, such as occupational attainment, educational background, rank in the military • Based on vocabulary testing; vocabulary is relatively preserved in AD (but not other pathologies) • Generally assume premorbid ability is consistent across all cognitive domains • Compare results to estimated premorbid ability
Neuropsychological tests • Most developed decades ago, primarily used in clinical rather than research settings • Patterns of deficits • More recent development of epidemiological studies with large batteries of tests • Potential opportunity for modern test theory to improve validity of statistical inference from these sorts of data
Educational testing • A century+ of educational tests • Lord and Novick’s Statistical Theories of Mental Test Scores (1967) • Psychometrics: educational testing as a discipline • Testing companies (ETS, ACT) • State-level testing • Constant generation of new items • Tests change all the time • IRT has emerged as the dominant paradigm for educational tests
Similarities and differences • In both cases, we care about the mental processes that lead to item responses rather than the item responses themselves • Items as indicators of a latent trait or ability • Multiple choice format more common in educational tests • Ordinal formats more common in neuropsychological tests • Item generation less common in neuropsychological tests • Generally many more indicators of each latent trait in educational tests than neuropsychological tests
IRT is a single factor Confirmatory Factor Analysis (CFA) model • Developed from separate lines of thinking • Mathematically equivalent • IRT has better developed infrastructure for addressing measurement precision / reliability • Better for interpretability of individual test taker’s score • Facilitates CAT (computerized adaptive testing, to be discussed later) • Structural equation modeling (SEM) has better developed infrastructure for addressing violations to IRT’s assumptions and for “the structural part”, which may be the primary interest from a research perspective • Infrastructure for addressing measurement precision is being developed (Dr. Curtis)
Item characteristic curves b parameter: difficulty
Item characteristic curves a parameter: slope
Where do item parameters come from? • Large data set • Need a good distribution of the thing measured by the test (some high functioning, some low functioning) • Do not need to be representative of anything in particular (not “norms”) • IRT package figures out where the people are and where the items are • Iterative EM algorithm • Think of items as beads on horizontal strings that can be moved left-right depending on difficulty
What do we do with ICCs? • Important picture: the test characteristic curve • TCC is the sum of all of the item characteristic curves • Plot of the standard score associated with each value of the underlying thing measured by the test • Next slides are TCCs of imaginary tests made up of dichotomous items
Comments on that test • Essentially linear test characteristic curve • Immaterial whether the standard score or the IRT score is used in analyses • No ceiling or floor effect • People at the extremes of the thing measured by the test will get some right and get some wrong • Pretty nice test!
Comments on that test • Essentially linear test characteristic curve • Immaterial whether the standard score or the IRT score is used in analyses • No ceiling or floor effect • People at the extremes of the thing measured by the test will get some right and get some wrong • Pretty nice test! • But that’s what we said about the last one and it had twice as many items!
Why might we want twice as many items? • Measurement precision / reliability • CTT: summarized in a single number: alpha • IRT: conceptualized as a quantity that may vary across the range of the test • Information • Mathematical relationship between information and standard error of measurement • Intuitively makes sense that a test with 2x the items will measure more precisely / more reliably than a test with 1x the items
Comments about these information and SEM curves • Information curves look more different than the SEM curves • Inverse square root relationship • TIC 100 SEM 0.10 (1/10) • TIC 25 SEM 0.20 (1/5) • TIC 16 SEM 0.25 (1/4) • TIC 9 SEM 0.33 (1/3) • TIC 4 SEM 0.50 (1/2) • Trade off between test length and measurement precision • CAT discussion later
These were highly selected “tests” • It would be possible to design such a test if we started with a robust item pool • Almost certainly not going to happen by accident / history • What are more realistic tests? • First example: items bunched up in the middle
Comments on these TCCs • Same number of items but very different shapes • Now it may matter whether you use an IRT score or a standard score in analyses • Both ceiling and floor effects
Comments on the TICs and SEMs • Comparing the red test and the blue test: the red test is better for people of moderate ability (more items close to where they are) • For people right in the middle, measurement precision is just as good as a test twice as long • Items far away from your ability level don’t help your standard error • The blue test is better for people at the extremes (more items close to where they are)
Where do information curves come from? • Item information curves use the same parameters as the item characteristic curves (difficulty level, b, and strength of association with latent trait or ability, a) (see next slides) • Test information is the sum of all of the item information curves • We can do that because of local independence
More thoughts on IICs • The b parameters shift the location of the curve left and right (where is the bead on the string?) • The a parameters modify the height of the curve • But not tons; location location location • IRT gives us tools to evaluate and manage measurement precision
Comments on global cognitive tests • None of these have hard items, so people with average and high levels of cognition are measured with low precision • Curvilinearity in these tests may be a problem for longitudinal data analyses, especially when people start at different places • For example, education as a risk factor • Varying levels of measurement precision across time
What about specific domains? • Baseline data from ADNI • Assigned items to batteries • Figured out how to generate a TCC vs. a z score composite (which was not as simple as I had thought)
ADNI memory TCC 0.8 0.7 0.5 From 1 to 0: 0.8 z units. From 0 to -1: 0.7. From -1 to -2: 0.5
ADNI Executive Functioning TCC THAT looks pretty linear
ADNI memory and EF TICs So EF has linear scaling, but does not have much measurement precision
Summary of this section • IRT provides tools to evaluate and manage measurement precision / standard error of measurement • Test characteristic curve and the test information curve are very helpful in understanding how a test is working • Alongside a histogram of abilities from a population of interest, which are on the same metric
Topics • Application: brief introduction to neuropsychology • Application: educational testing • Similarities and differences • IRT is a single factor CFA model • Cool things you can do with IRT that you can’t do easily with classical test theory • Limitations of IRT; extensions to deal with those limitations