590 likes | 998 Views
Item Response Theory. Dan Mungas, Ph.D. Department of Neurology University of California, Davis. What is it? Why should anyone care?. IRT Basics. Item Response Theory - What Is It. Modern approach to psychometric test development Mathematical measurement theory
E N D
Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis
Item Response Theory - What Is It • Modern approach to psychometric test development • Mathematical measurement theory • Associated numeric and computational methods • Widely used in large scale educational, achievement, and aptitude testing • More than 50 years of conceptual and methodological development
Item Response Theory - Methods • Dataset consists of rectangular table • rows correspond to examinees • columns correspond to items • IRT applications simultaneously estimate examinee ability and item parameters • iterative, maximum likelihood estimation algorithms • processor intensive, no longer a problem
Item Types • Dichotomous • Multiple Choice • Polytomous • Information is greater for polytomous item than for the same item dichotomized at a cutpoint
What is the item level response • Smallest discrete unit (e.g. Object Naming) • Sum of correct responses (trials in word list learning test) • For practical reasons, continuous measures might have to be recoded into ordinal scales with reduced response categories (10, 15)
Item Response Theory - Basic Results • Item parameters • difficulty • discrimination • correction for guessing • most applicable for multiple choice items • Subject Ability (in the psychometric sense) • Capacity to successfully respond to test items (or propensity to respond in a certain direction) • Net result of all genetic and environmental influences • Measured by scales composed of homogenous items • Item difficulty and subject ability are on the same scale
Item Response Theory - Outcomes • Item-Level Results • Item Characteristic Curve (ICC) • non-linear function relating ability to probability of correct response to item • Item Information Curve (IIC) • non-linear function showing precision of measurement (reliability) at different ability points • Both curves are defined by the item parameters
Item Response Theory - Outcomes • Test-Level Results • Test Characteristic Curve (TCC) • non-linear function relating ability to expected total test score • Test Information Curve (TIC) • non-linear function showing precision of measurement (reliability) at different ability points • Both sum of item level functions of included items
Item Response Theory - Fundamental Assumptions • Unidimensionality - items measure a homogenous, single domain • Local independence - covariance among items is determined only by the latent dimension measured by the item set
IRT Models • 1PL (Rasch) • Only Difficulty and Ability are estimated • Discrimination is assumed to be equal across items • 2PL • Discrimination, Difficulty and Ability are estimated • Guessing is assumed to not have an effect • 3PL • Discrimination, Difficulty, Guessing, and Ability are estimated (multiple choice items)
Item Response Theory - Invariance Properties • Invariance requires that basic assumptions are met • Item parameters are invariant across different samples • Within the range of overlap of distributions • Distributions of samples can differ • Ability estimates are invariant across different item sets • Assumes that ability range of items spans ability range of subjects that is of interest
Why Do We Care -Applications of IRT in Health Care Settings • Refined scoring of tests • Characterization of psychometric properties of existing tests • Construction of new tests
Test Scoring • IRT permits refined scoring of items that allows for differential weighting of items based on their item parameters
Physical Function Scale Hays, Morales & Reise (2000) Item LIMITED LIMITED NOT LIMITED A LOT A LITTLE AT ALL Vigorous activities, running, Lifting heavy objects, Strenuous sports 1 2 3 Climbing one flight 1 2 3 Walking more than 1 mile 1 2 3 Walking one block 1 2 3 Bathing / dressing self 1 2 3 Preparing meals / doing laundry 1 2 3 Shopping 1 2 3 Getting around inside home 1 2 3 Feeding self 1 2 3
How to Score Test • Simple approach: there are numbers that will be circled; total these up, and we have a score. • But: should “limited a lot” for walking a mile receive the same weight as “limited a lot” in getting around inside the home? • Should “limited a lot” for walking one block be twice as bad as “limited a little” for walking one block?
How IRT Can Help • IRT provides us with a data-driven means of rational scoring for such measures • Items that are more discriminating are given greater weight • In practice, the simple sum score is often very good; improvement is at the margins
Description of Psychometric Properties • The Test Information Curve (TIC) shows reliability that continuously varies by ability • Depicts ability levels associated with high and low reliability • The standard error of measurement is directly related to information value (I(Q)) • SEM(Q) = 1 / sqrt(I(Q)) • SEM (Q) and I(Q) also have a direct correspondence to traditional r • r (Q) = 1 - 1/ I(Q)
TICs for English and Spanish language Versions of Two Scales Mungas et al., 2004
Construction of New Scales • Items can be selected to create scales with desired measurement properties • Can be used for prospective test development • Can be used to create new scales from existing tests/item pools • IRT will not overcome inadequate items
TICs from an Existing Global Cognition Scale and Re-Calibrated Existing Cognitive Tests Mungas et al., 2003
Principles of Scale Construction • Information corresponds to assessment goals • Broad and flat TIC for longitudinal change measure in population with heterogenous ability • For selection or diagnostic test, peak at point of ability continuum where discrimination is most important • But normal cognition spans a 4.0 s.d. range, and is even greater in demographically diverse populations
Other Issues In IRT • Polytomous IRT models are available • Useful for ordinal (Likert) rating scales • Each possible score of the item (minus 1) is treated like a separate item with a different difficulty parameter • Information is greater for polytomous item than for the same item dichotomized at a cutpoint
Other Issues in IRT • Applicable to broad range of content domains • IRT certainly applies to cognitive abilities • Also applies to other health outcomes • Quality of life • Physical function • Fatigue • Depression • Pain
Other Issues in IRT • Differential Item Function - Test Bias • IRT provides explicit methods to evaluate and quantify the extent to which items and tests have different measurement properties in different groups • e.g. racial and ethnic groups, linguistic groups, gender
English and Spanish Item Characteristic Curves for “Lamb/Cordero” Item
English and Spanish Item Characteristic Curves for “Stone/Piedra” Item
Differential Item Function (DIF) • DIF refers to systematic bias in measuring “true” ability - doesn’t address group differences in ability
Challenges/ Limitations of IRT • Large samples required for stable estimation • 150-200 for 1PL • 400-500 for 2PL • 600-1000 for 3PL • Analytic methods are labor intensive • There are a number of (expensive *) applications readily available for IRT analyses • Evaluation of basic assumptions, identification of appropriate model, and systematic IRT analysis require considerable expertise and labor * but, R!!
Computerized Adaptive Testing (CAT) • IRT based computer driven method • Selects items that most closely match examinee’s ability • Administers only items needed to achieve a pre-specified level of precision in measurement (information, s.e.m., reliability)
Why CAT • Efficiency • Administration - • Standardization • Time efficiency • Data collection • Scoring • Computer can implement complex scoring algorithms
What You Need for CAT • Computer technology • Item Selection • Item Administration • Scale Scoring • Item bank with IRT parameters • Range of item difficulty relevant to measurement needs
What is Straightforward/Easy? • Dichotomous items • Multiple choice items • Ordered polytomous response scales • Up to 10-15 response options
Technical Challenges • Continuous response scales (memory, timed tasks) • Can be recoded into smaller number of ordered response ranges • Lose information
Methodological Challenges • Sample size requirements • Minimally 300-600 cases for stable estimation of item parameters • Differential Item Function and Measurement Bias • Essentially involves item calibration within groups of interest • e.g., age, education, language, gender, race • Available literature provides minimal guidance
References • Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item response theory and health outcomes measurement in the 21st century. Med Care, 38(9 Suppl), II28-42. • Mungas, D., Reed, B. R., & Kramer, J. H. (2003). Psychometrically matched measures of global cognition, memory, and executive function for assessment of cognitive decline in older persons. Neuropsychology, 17(3), 380-392. • Mungas, D., Reed, B. R., Crane, P. K., Haan, M. N., & González, H. (2004). Spanish and English Neuropsychological Assessment Scales (SENAS): Further development and psychometric characteristics. Psychological Assessment, 16(4), 347-359.