Item Response Theory

Item Response Theory

What’s wrong with the old approach? • Classical test theory • Sample dependent • Parallel test form issue • Comparing examinee scores • Reliability • No predictability • “Error” is the same for everybody

So, what is IRT? • A family of mathematical models that describe the interaction between examinees and test items • Examinee performance can be predicted in terms of the underlying trait • Provides a means for estimating scores for people and characteristics of items • Common framework for describing people and items

Some Terminology • “Ability” • We use this as a generic term used to describe the “thing” that we are trying to measure • The “thing” can be any old “thing” and we need not concern ourselves with labeling the “thing”, but examples of the “thing” include: • Reading ability • Math performance • Depression

The ogive • Natural occurring form that describes something about people • Used throughout science, engineering, and the social sciences • Also, used in architecture, carpentry, photograph, art, and so forth

The ogive

The Item Characteristic Curve (ICC) • This function really does everything: • Scales items & people onto a common metric • Helps in standard setting • Foundation of equating • Some meaning in terms of student ability

The ICC • Any line in a Cartesian system can be defined by a formula • The simplest formula for the ogive is the logistic function:

The ICC • Where bis the item parameter, and qis the person parameter • The equation represents the probability of responding correctly to item i given the ability of person j.

bis the inflection point Item i bi=0.125

We can now use the item parameter to calculate p • Let’s assume we have a student with q =1.0, and we have ourb= 0.125 • Then we can simply plug in the numbers into our formula

Using the item parameters to calculate p p = 0.705 qi=1.00

Wait a minute • What do you mean a student with an ability of 1.0?? • Does an ability of 0.0 mean that a student has NO ability? • What if my student has a reading ability estimate of -1.2?

The ability scale • Ability is on an arbitrary scale that just so happens to be centered around 0.0 • We use arbitrary scales all the time: • Fahrenheit • Celsius • Decibels • DJIA

Scaled Scores • Although ability estimates are centered around zero – reported scores are not • However, scaled scores are typically a linear transformation of ability estimates • Example of a linear transformation: • (Ability x Slope) + Intercept

The need for scaled scores ½ the kids will have negative ability estimates

The Two Scales of Measurement • Reporting Scale (Scaled Scores) • Student/parent level report • School/district report • Cross year comparisons • Performance level categorization • The Psychometric Scale (q) • IRT item and person parameters • Equating • Standard setting

Unfortunately, life can get a lot worse • Items vary from one another in a variety of ways: • Difficulty • Discrimination • Guessing • Item type (MC vs. CR)

Items can vary in terms of difficulty Easier item Harder item Ability of a student

Items can vary in terms of discrimination • Discrimination is reflected by the “pitch” in the ICC • Thus, we allow the ICCs to vary in terms of their slope

Good item discrimination Noticeable difference in p 2 close ability levels

Poor item discrimination smaller difference Same 2 ability levels

Guessing This item is asymptotically approaching 0.25

Constructed Response Items

Items and people • Interact in a variety of ways • We can use IRT to show that there exists a nice little s-shaped curve that shows this interaction • As ability increases – the probability of a correct response increases

Advantages of IRT • Because of the stochastic nature of IRT there are many statistical principles we can take advantage of • A test is a sum of its parts

The test characteristic curve • A test is made up of many items • The TCC can be used to summarize across all of our items • The TCC is simply the summation of ICCs along our ability continuum • For any ability level we can use the TCC to estimate the overall test score for an examinee

Several ICCs are on a test

The test characteristic curve

The test characteristic curve • From an observed test score (i.e., a student’s total test score) we can estimate ability • The TCC is used in standard setting to establish performance levels • The TCC can also be used to equate tests from one year to the next

Estimating Ability Total score = 3 Ability≈0.175

Psychometric “Information” • The amount that an item contributes to estimating ability • Items that are close to a person’s ability provide more information than items that are far away • An item is most informative around the point of inflection

Item Information Item is most informative here because this is where we can discriminate among nearby q values

Item Information Item is much less informative at points along q where there is little slope in the ICC

Test Information • Test information is the sum of item information • Tests are also most “informative” where the slope of the TCC is the greatest • Information (like everything else in IRT) is a function of ability • Test information really is test “precision”

Let’s start with a TCC

Information Functions BP/P We can evaluate information at a given cutpoint

Information and CTT • CTT has reliability and of course the famous a coefficient • IRT has the test information function • Test quality can be evaluated conditionally along the performance continuum • In IRT information is, conveniently, reciprocally related to standard error

Standard Error as a function of ability q = 0.175 SE = 0.25

Standard Error of Ability Total score = 3 Ability≈0.175

Standard Error of Ability Total score = 3 Ability≈0.175 Confident region of ability estimate }

Item Response Theory • A vast kingdom of equations, and dizzying array of complex concepts • Ultimately, we use IRT to explain the interaction between students and test items • The cornerstone to IRT is the ICC which depicts that as ability increases the chances of getting an item correct increases

Item Response Theory • Everything in IRT can be studied conditionally along the performance continuum • The CTT concept of reliability is what we call test information, and we can think of this as being a function of test precision • SE is related to information and can also be studied along q

The Utility of Item Response Theory • Can be used to estimate characteristics of items and people • Can be used in the test development process to maximize information (minimize SE) at critical points along q • Can even be used for test administration purposes

Item Response Theory