340 likes | 481 Views
Test Development and Analysis. Session Three - 28 May, 2014 An Ideal Measurement Margaret Wu. Introduction to Item Response Theory – overview of this session. What are the properties of an ideal measurement ? What tools can help us improve our test?
E N D
Test Development and Analysis Session Three- 28 May, 2014 An Ideal Measurement Margaret Wu
Introduction to Item Response Theory – overview of this session • What are the properties of an ideal measurement? • What tools can help us improve our test? • What are the problems with using raw scores, and classical test theory? • Properties of the Raschmodel
What are the purposes of constructing tests? To measure something that is usually unobservable. ================================ So, we need to make sure that • Our measures are accurate (reliability); • Our measures are indeed tapping into what we intend to measure (validity); • There is a stable frame of reference from one test to another.
Properties of an Ideal Measurement • Scores we obtained are meaningful. • What can each of these students do? • Scores are independent of the sample of items used • If a different set of items are used, we will get the same results, in terms of the placement of the people on a scale. Cath Ann Bill
Using Raw Scores • Can raw scores provide the properties of an ideal measurement? • Distances between differences in scores are not easily interpretable. • Difficult to link item scores to person scores.
Equating raw scores - 2 100% A A Score on the hard test C C D B B B C A 100% 0 Score on the easy test
Link Raw Scores on Items and Persons Task Difficulties Object Scores word problems ? 25% 90% arithmetic with vulgar fractions ? 50% 70% multi-step arithmetic ? 70% 50% single digit addition ? 90% 25%
Classical Test Theory • Focuses on • Total score on a test, not performance on each item. • Consistency of the test (reliability) • Provides • Ordering of people (thru total score) • Ordering of items (thru % correct per item) (But the ordering is not quite on an interval scale. That is, the interpretation of “distances” between people (or between items) is not sample invariant.) • Norm reference interpretation.
Item Response Theory • Item response theory helps us with achieving the goals of constructing the “best “ measurement. • IRT provides tools to assess the extent to which good measurement properties are achieved. • If item response data fit the IRT model, measurement is at its most powerful level. • Person abilities and item difficulties are calibrated on the same scale. • Meanings can be constructed to describe scores • Student scores should be independent of the particular set of items in the test.
Latent Variables, Manifest Variables and a Sense of Direction 1 1 2 2 Latent Variable 3 3 Other stuff 4 4 5 5 6 6 A Bigger Idea Little Ideas
IRT • IRT models give the probability of success of a person on an item. • IRT models are not deterministic, but probabilistic. • Given the “item difficulty” and “person ability”, one can compute the probability of success for each person on each item
Building a Model Probability of Success 1.0 0.5 0.0 Very high achievement Very low achievement
Imagine a middle difficulty task Probability of Success 1.0 0.5 0.0 Very high achievement Very low achievement
Item Characteristic Curve Probability of Success 1.0 0.5 0.0 Very high achievement Very low achievement
Variation in item difficulty 3 1 2
Comparing Students and Items Task Difficulties 1 advanced knowledge 2 3 Location of a student 4 5 basic knowledge 6
Student Ability | Item Difficulty | 49 | 208 3.0 | | | | 110 X | | X | 278 XX | 45 106 108 306 342 XXX | 158 230 308 X | XXXXX | 25 349 XXX | 69 | 2.0 XXXXXX | 148 256 XXXXXX | 52 124 167 XXX | 2 6 40 71 115 XXXXXXXXXX | 1 73 168 XXXX | 41 47 235 247 165 255 XXXXXXXX | 105 XXXXXX | 99 112 113 169 XXXXXXX | 48 54 81 260 269 312 332 XXX | 107 XX | 5 22 288 330 XXX | 282 284 XXXXXXX | 18 166 177 183 1.0 XX | 12 16 63 83 119 185 220 226 289 317 XXX | 31 140 233 234 263 285 X | 35 43 302 316 X | 10 87 137 159 200 299 XX | 15 27 122 205 258 305 348 X | 111 216 261 272 324 | 42 59 61 79 117 162 198 202 203 323 343 X | 80 134 217 259 | 20 46 89 96 125 155 176 184 231 238 291 X | 11 60 92 94 100 133 139 144 147 150 175 182 1 97 | 28 33 180 201 257 300 304 318 322 325 | 57 84 141 151 161 206 243 273 313 321 328 340 0.0 | 30 38 76 145 242 295 297 | 3 39 66 67 74 101 123 244 265 311 | 17 103 228 240 267 310 346 | 44 128 199 215 227 248
Comparing Students and Items Difficult Location of a student Easy
Comparing Students and Items Difficult Location of a student? 11 10 9 8 7 6 5 4 3 2 1 Easy
Comparing Students and Items Difficult Location of a student? 11 10 9 8 7 6 5 4 3 2 1 Easy
Comparing Students and Items Difficult Location of a student? 11 10 9 8 7 6 5 4 3 2 1 Easy
Sort correct and incorrect responses Difficult Location of a student? 11 10 9 8 Easier to see if we separate out the correct and incorrect responses 7 A KidMap! 6 5 4 3 2 1 Easy
Constructing Proficiency Scales • Step 1: carry out a skills audit of items • Step 2: locate the skills along the ability scale • Step 3: decide on band level cut-off values along the ability scale, and response probability. • Step 4: write summary descriptions of the skills for each band level • Step 5: calculate student abilities and place students in levels. • Step 6: decide on any transformations of scaled scores. • Step 7: compute cohort statistics such as percentages in levels.
Invariance of Skills Descriptions • The “scale” must apply to all people in terms of relative difficulties of items. • To achieve this, all items must tap into the same construct.
IRT models • Builds on the notion of probability of success. The probability of success is a function of the difference between the ability and the item difficulty ( - ): • Pr(X=1) = f( - ) • Different IRT models have different functional form f( - ).
Some properties of the Rasch model • Item characteristic curves are “parallel” • It means that once we know ( - ), the difference between ability and difficulty, we can determine the probability of success for the person on the item. • The ordering of item difficulties is the same for all people of differing abilities. • Not all items will have parallel ICC, so item construction needs to choose those items with parallel ICC.
Parallel Item Characteristic Curves • Curves do not cross each other.
An example where an Item tests a different construct from other items
IRT Statistics • Fit indices tell us whether items are tapping into the same construct. • Discrimination indices tell us whether an item can discriminate between low and high ability students. • Item characteristic curves (ICC) show pictorially the fit of the data to the model.
Theoretical and empirical ICC - 1 • Reasonably good fit
Theoretical and empirical ICC - 2 • Fit is not so good. Item is more discriminating than expected.