Equating And Scaling

Equating And Scaling

The Goal of this Session Is… • To give you a general idea of: • What equating is • Why we need to do it • How it works • As part of this, we will discuss: • Different equating models • A quick review of IRT • Scaling

Why would the average raw scores on a test administered in 2006 and 2007 differ? Why Equate?

Why Equate? • Why would the average raw scores on a test in 2006 and 2007 differ? • The 2006 form is harder than the 2007 form • This year’s students are better prepared than last year’s were • Both

Why Equate? • Equating allows us to determine the extent to which: • one test is harder than the other (which is usually the case) • one group is more “able” (i.e., has more of the construct of interest) than the other (also usually the case) • This enables us to ensure that well-prepared examinees get higher scores than the less well-prepared, regardless of the test they took

Why Equate? • If we gave both groups the same test, we could directly compare their performance, but this is not practical • Security • Release of items • This is where equating items come in: a subset of items administered in both tests

Why Equate? • The performance on the equating items is used to compare student ability across the two groups • We can use this information to determine to what extent the difference in performance is due to one group being better prepared than the other • Once we know that, we can then determine how much harder or easier one test is than the other and adjust so that scores based on the two tests can be compared directly  the tests are equated

Equating Items • In order to ensure that this is done accurately, the equating items should have the following characteristics: • Good psychometric properties • Be parallel to the overall test • Content • MC vs. CR items • Passage length • Graphics • etc.

Equating Items • The difficulty of the non-equating items in each test can vary -- within reason. However… • We don’t want the “feel” of the assessment to change differentially for different subgroups of students. As one example: • If one test is harder than another, lower performing students may be more frustrated on the harder test • If one test is easier than another, higher performing students may be more bored and unmotivated on the easier test • Either of these could result in differences in performance on the two tests that are unrelated to the construct of interest…

Equating Items • In addition, equating items should not be changed in any way from one administration to the next • Again, any change in the item (wording, location within the test, response options, etc.) can cause a change in student performance that is unrelated to the construct of interest MUST!

Equating Models • Classical test theory models (CTT) • Item response theory models (IRT) • Internal anchor (counts toward student scores) • External anchor (doesn’t count toward student scores) • Intact, separate anchor test • Embedded anchor test

Equating Models • CTT models are concerned with estimating the relationships of the anchor test with each total test, and the anchor test in group 1 with the anchor test in group 2. • IRT models focus on estimating the relationship of each item with the underlying trait (q) that is being measured.

Equating Models Difficulty Difficulty Ability Anchor 1 Anchor 2 Test 2 Test 1 Classical test theory equating diagram

Example • Group 1 (2006) • Total test score: 30.6 • Score on equating items: 14.2 • Group 2 (2007) • Total test score: 38.6 • Score on equating items: 15.5 • Based on their performance on the equating items, we know that Group 2 is a bit higher performing, but their total score on the test is quite a bit higher which suggests that the 2007 test is easier.

Equating Models • CTT models are well known, commonly used, and are relatively easy computationally • IRT models have a shorter history and are computationally difficult, but they have certain advantages that make their use desirable • At MP, we use pretty much exclusively IRT equating models

Basics of Item Response Theory • Why Use IRT? • Review of IRT: • The Item Characteristic Curve (ICC) • The Test Characteristic Curve (TCC)

Why Use IRT? • Advantages over CTT • IRT allows us to calculate an estimate of student “ability” (q), not just observe how a particular student performs on a particular test • IRT uses the same theta scale to describe students and items; this has certain advantages • It provides more sophisticated information that (depending on the specific model used) takes into consideration various characteristics of the item

The ICC • Describes the interaction between examinees and test items • In the simplest case, ability is a function of item difficulty • As more sophisticated models are used, other item characteristics are taken into consideration as well

The Basics

Item Difficulty

Item Discrimination

Item Guessing

A Test is Made up of Many ICCs

For a given examinee with ability (θ) = 1.0

For a given examinee with ability (θ) = 1.0 • The expected score on the total test is equal to the sum of the probabilities for each item on the test: 0.82+0.48+0.98+0.99+0.82+0.35=4.41

The TCC • Summation of ICCs • Describes the relationship between “ability” and expected performance on the whole test

TCC is the sum of the ICCs

Is It Really That Simple? • Polytomous Items • Parameter Estimation • Item Parameters • Person Parameters • Various IRT Models • Examinee-Model Fit

So What Does This Do For Us? • Using the TCC, we can estimate the total test score for a student at a given level of ability • In actuality, however, this isn’t what we want to do: we already know the students’ total raw scores; what we don’t know is their ability. • Fortunately, once we have the ICCs and TCC, we can go the other way: we can estimate ability based on a student’s observed total test score.

So What Does This Have to Do with Equating? • Back in 2006, we established the relationship between the total test and student ability using the theta scale • Using the equating items, we can put the 2007 test on the same scale

How Do We Do This? • Estimate item parameters (i.e., calibrate the items) for 2006 test • Estimate item parameters for 2007 test, fixing the parameters for the equating items to their 2006 values • This “forces” the ability estimates for 2007 to be on the same scale as those for 2006 • As a result, we will get the same ability estimate for a student regardless of which test they took

2007 2006 2006 and 2007 TCCson the Same Scale

Typical Equating Process • Selecting Equating Items • IRT Calibrations/equating • Determining scores for reporting (scaling)

Selecting Equating Items • Initial Selection • Test questions from last year’s test are included in this year’s test • The total points from equating items should be at least 40% of the total points on the test • The distribution of the items across different relevant categories is similar to that of the whole test • Each item should be in about the same position this year and last year

Selecting Equating Items • We also do some statistical checks to look for items that are functioning very differently in 2007 than they did in 2006, relative to the rest of the equating items • If we find those, we will exclude them from use as equating items

Item Calibrations • We talked about this earlier, remember? • Estimate parameters for 2006 items • Estimate parameters for 2007 items, fixing the values for the equating items • Voila: the same ability estimate for students, regardless of which test they took!

Scaling • It does not really make sense to report scores on the raw score metric: • Equated raw scores do not equal the number of points the student achieved on that test, but rather the number of points that the student would be expected to achieve on the “equated to” test

Scaling • Similarly, it does not really make sense to report scores onthe theta metric: • While psychometricians are quite fond of theta scores, they have some unfortunate characteristics (decimal and negative values) that would make them alarming to most test users • (Note: “they” in the previous sentence refers to the theta scores…)

Scaling • It does make sense to report scores on an arbitrary scale that has no inherent meaning. • The meaning of the scale is defined by the assessment • Scaled scores are typically a linear transformation of ability estimates • Example of a linear transformation: • (Ability x Slope) + Intercept

Scaling • This appears to be pretty simple, but, like most things, scaling is more complicated than it appears at first

Issues in Scaling • Endpoints: • If one test is more difficult than the other, the highest possible raw score on the harder test ought to result in a higher scaled score than the top score on the easier test. • However, top & bottom scores may be truncated so that a student who gets one or more items wrong may still receive the top scaled score, or a student who gets some items right may still receive the lowest scaled score.

Issues in Scaling • Number of points • Should be sufficient to differentiate examinees. • Should not be more than the number of raw score points. • Cut points • If more than two cut-points are used and each cutpoint is a pre-determined scaled score, the scale will be non-linear. In this case taking averages is questionable.

Issues in Scaling • Scale compression and/or expansion • If cut points are very close together on the theta scale and far apart on the scaled score scale, or vice versa • You can have compression in one part of the scale and expansion in another part

Equating And Scaling