Test Development and Analysis

Test Development and Analysis Session Three- 28 May, 2014 An Ideal Measurement Margaret Wu

Introduction to Item Response Theory – overview of this session • What are the properties of an ideal measurement? • What tools can help us improve our test? • What are the problems with using raw scores, and classical test theory? • Properties of the Raschmodel

What are the purposes of constructing tests? To measure something that is usually unobservable. ================================ So, we need to make sure that • Our measures are accurate (reliability); • Our measures are indeed tapping into what we intend to measure (validity); • There is a stable frame of reference from one test to another.

Properties of an Ideal Measurement • Scores we obtained are meaningful. • What can each of these students do? • Scores are independent of the sample of items used • If a different set of items are used, we will get the same results, in terms of the placement of the people on a scale. Cath Ann Bill

Using Raw Scores • Can raw scores provide the properties of an ideal measurement? • Distances between differences in scores are not easily interpretable. • Difficult to link item scores to person scores.

Equating raw scores - 2 100% A A Score on the hard test C C D B B B C A 100% 0 Score on the easy test

Link Raw Scores on Items and Persons Task Difficulties Object Scores word problems ? 25% 90% arithmetic with vulgar fractions ? 50% 70% multi-step arithmetic ? 70% 50% single digit addition ? 90% 25%

Classical Test Theory • Focuses on • Total score on a test, not performance on each item. • Consistency of the test (reliability) • Provides • Ordering of people (thru total score) • Ordering of items (thru % correct per item) (But the ordering is not quite on an interval scale. That is, the interpretation of “distances” between people (or between items) is not sample invariant.) • Norm reference interpretation.

Item Response Theory • Item response theory helps us with achieving the goals of constructing the “best “ measurement. • IRT provides tools to assess the extent to which good measurement properties are achieved. • If item response data fit the IRT model, measurement is at its most powerful level. • Person abilities and item difficulties are calibrated on the same scale. • Meanings can be constructed to describe scores • Student scores should be independent of the particular set of items in the test.

Latent Variables, Manifest Variables and a Sense of Direction 1 1 2 2 Latent Variable 3 3 Other stuff 4 4 5 5 6 6 A Bigger Idea Little Ideas

IRT • IRT models give the probability of success of a person on an item. • IRT models are not deterministic, but probabilistic. • Given the “item difficulty” and “person ability”, one can compute the probability of success for each person on each item

Building a Model Probability of Success 1.0 0.5 0.0 Very high achievement Very low achievement

Imagine a middle difficulty task Probability of Success 1.0  0.5   0.0 Very high achievement Very low achievement

Item Characteristic Curve Probability of Success 1.0  0.5   0.0 Very high achievement Very low achievement

Item Difficulty -- 1 

Variation in item difficulty 3 1 2

Variation in item difficulty

Comparing Students and Items Task Difficulties 1 advanced knowledge 2 3 Location of a student 4 5 basic knowledge 6

Student Ability | Item Difficulty | 49 | 208 3.0 | | | | 110 X | | X | 278 XX | 45 106 108 306 342 XXX | 158 230 308 X | XXXXX | 25 349 XXX | 69 | 2.0 XXXXXX | 148 256 XXXXXX | 52 124 167 XXX | 2 6 40 71 115 XXXXXXXXXX | 1 73 168 XXXX | 41 47 235 247 165 255 XXXXXXXX | 105 XXXXXX | 99 112 113 169 XXXXXXX | 48 54 81 260 269 312 332 XXX | 107 XX | 5 22 288 330 XXX | 282 284 XXXXXXX | 18 166 177 183 1.0 XX | 12 16 63 83 119 185 220 226 289 317 XXX | 31 140 233 234 263 285 X | 35 43 302 316 X | 10 87 137 159 200 299 XX | 15 27 122 205 258 305 348 X | 111 216 261 272 324 | 42 59 61 79 117 162 198 202 203 323 343 X | 80 134 217 259 | 20 46 89 96 125 155 176 184 231 238 291 X | 11 60 92 94 100 133 139 144 147 150 175 182 1 97 | 28 33 180 201 257 300 304 318 322 325 | 57 84 141 151 161 206 243 273 313 321 328 340 0.0 | 30 38 76 145 242 295 297 | 3 39 66 67 74 101 123 244 265 311 | 17 103 228 240 267 310 346 | 44 128 199 215 227 248

Comparing Students and Items Difficult                  Location of a student                      Easy

Comparing Students and Items Difficult Location of a student?  11   10    9     8     7      6     5     4     3 2     1     Easy

Sort correct and incorrect responses Difficult Location of a student?  11   10    9     8     Easier to see if we separate out the correct and incorrect responses 7 A KidMap!      6     5     4     3 2     1     Easy

Constructing Proficiency Scales • Step 1: carry out a skills audit of items • Step 2: locate the skills along the ability scale • Step 3: decide on band level cut-off values along the ability scale, and response probability. • Step 4: write summary descriptions of the skills for each band level • Step 5: calculate student abilities and place students in levels. • Step 6: decide on any transformations of scaled scores. • Step 7: compute cohort statistics such as percentages in levels.

Invariance of Skills Descriptions • The “scale” must apply to all people in terms of relative difficulties of items. • To achieve this, all items must tap into the same construct.

IRT models • Builds on the notion of probability of success. The probability of success is a function of the difference between the ability and the item difficulty ( - ): • Pr(X=1) = f( - ) • Different IRT models have different functional form f( - ).

Rasch model (1-parameter IRT model) When

Some properties of the Rasch model • Item characteristic curves are “parallel” • It means that once we know ( - ), the difference between ability and difficulty, we can determine the probability of success for the person on the item. • The ordering of item difficulties is the same for all people of differing abilities. • Not all items will have parallel ICC, so item construction needs to choose those items with parallel ICC.

Parallel Item Characteristic Curves • Curves do not cross each other.

An example where an Item tests a different construct from other items

IRT Statistics • Fit indices tell us whether items are tapping into the same construct. • Discrimination indices tell us whether an item can discriminate between low and high ability students. • Item characteristic curves (ICC) show pictorially the fit of the data to the model.

Theoretical and empirical ICC - 1 • Reasonably good fit

Theoretical and empirical ICC - 2 • Fit is not so good. Item is more discriminating than expected.

Test Development and Analysis