1 / 34

Test Development and Analysis

Test Development and Analysis. Session Three - 28 May, 2014 An Ideal Measurement Margaret Wu. Introduction to Item Response Theory – overview of this session. What are the properties of an ideal measurement ? What tools can help us improve our test?

kasia
Download Presentation

Test Development and Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Test Development and Analysis Session Three- 28 May, 2014 An Ideal Measurement Margaret Wu

  2. Introduction to Item Response Theory – overview of this session • What are the properties of an ideal measurement? • What tools can help us improve our test? • What are the problems with using raw scores, and classical test theory? • Properties of the Raschmodel

  3. What are the purposes of constructing tests? To measure something that is usually unobservable. ================================ So, we need to make sure that • Our measures are accurate (reliability); • Our measures are indeed tapping into what we intend to measure (validity); • There is a stable frame of reference from one test to another.

  4. Properties of an Ideal Measurement • Scores we obtained are meaningful. • What can each of these students do? • Scores are independent of the sample of items used • If a different set of items are used, we will get the same results, in terms of the placement of the people on a scale. Cath Ann Bill

  5. Using Raw Scores • Can raw scores provide the properties of an ideal measurement? • Distances between differences in scores are not easily interpretable. • Difficult to link item scores to person scores.

  6. Equating raw scores - 2 100% A A Score on the hard test C C D B B B C A 100% 0 Score on the easy test

  7. Link Raw Scores on Items and Persons Task Difficulties Object Scores word problems ? 25% 90% arithmetic with vulgar fractions ? 50% 70% multi-step arithmetic ? 70% 50% single digit addition ? 90% 25%

  8. Classical Test Theory • Focuses on • Total score on a test, not performance on each item. • Consistency of the test (reliability) • Provides • Ordering of people (thru total score) • Ordering of items (thru % correct per item) (But the ordering is not quite on an interval scale. That is, the interpretation of “distances” between people (or between items) is not sample invariant.) • Norm reference interpretation.

  9. Item Response Theory • Item response theory helps us with achieving the goals of constructing the “best “ measurement. • IRT provides tools to assess the extent to which good measurement properties are achieved. • If item response data fit the IRT model, measurement is at its most powerful level. • Person abilities and item difficulties are calibrated on the same scale. • Meanings can be constructed to describe scores • Student scores should be independent of the particular set of items in the test.

  10. Latent Variables, Manifest Variables and a Sense of Direction 1 1 2 2 Latent Variable 3 3 Other stuff 4 4 5 5 6 6 A Bigger Idea Little Ideas

  11. IRT • IRT models give the probability of success of a person on an item. • IRT models are not deterministic, but probabilistic. • Given the “item difficulty” and “person ability”, one can compute the probability of success for each person on each item

  12. Building a Model Probability of Success 1.0 0.5 0.0 Very high achievement Very low achievement

  13. Imagine a middle difficulty task Probability of Success 1.0  0.5   0.0 Very high achievement Very low achievement

  14. Item Characteristic Curve Probability of Success 1.0  0.5   0.0 Very high achievement Very low achievement

  15. Item Difficulty -- 1

  16. Variation in item difficulty 3 1 2

  17. Variation in item difficulty

  18. Comparing Students and Items Task Difficulties 1 advanced knowledge 2 3 Location of a student 4 5 basic knowledge 6

  19. Student Ability | Item Difficulty | 49 | 208 3.0 | | | | 110 X | | X | 278 XX | 45 106 108 306 342 XXX | 158 230 308 X | XXXXX | 25 349 XXX | 69 | 2.0 XXXXXX | 148 256 XXXXXX | 52 124 167 XXX | 2 6 40 71 115 XXXXXXXXXX | 1 73 168 XXXX | 41 47 235 247 165 255 XXXXXXXX | 105 XXXXXX | 99 112 113 169 XXXXXXX | 48 54 81 260 269 312 332 XXX | 107 XX | 5 22 288 330 XXX | 282 284 XXXXXXX | 18 166 177 183 1.0 XX | 12 16 63 83 119 185 220 226 289 317 XXX | 31 140 233 234 263 285 X | 35 43 302 316 X | 10 87 137 159 200 299 XX | 15 27 122 205 258 305 348 X | 111 216 261 272 324 | 42 59 61 79 117 162 198 202 203 323 343 X | 80 134 217 259 | 20 46 89 96 125 155 176 184 231 238 291 X | 11 60 92 94 100 133 139 144 147 150 175 182 1 97 | 28 33 180 201 257 300 304 318 322 325 | 57 84 141 151 161 206 243 273 313 321 328 340 0.0 | 30 38 76 145 242 295 297 | 3 39 66 67 74 101 123 244 265 311 | 17 103 228 240 267 310 346 | 44 128 199 215 227 248

  20. Comparing Students and Items Difficult                  Location of a student                      Easy

  21. Comparing Students and Items Difficult Location of a student?  11   10    9     8     7      6     5     4     3 2     1     Easy

  22. Comparing Students and Items Difficult Location of a student?  11   10    9     8     7      6     5     4     3 2     1     Easy

  23. Comparing Students and Items Difficult Location of a student?  11   10    9     8     7      6     5     4     3 2     1     Easy

  24. Sort correct and incorrect responses Difficult Location of a student?  11   10    9     8     Easier to see if we separate out the correct and incorrect responses 7 A KidMap!      6     5     4     3 2     1     Easy

  25. Constructing Proficiency Scales • Step 1: carry out a skills audit of items • Step 2: locate the skills along the ability scale • Step 3: decide on band level cut-off values along the ability scale, and response probability. • Step 4: write summary descriptions of the skills for each band level • Step 5: calculate student abilities and place students in levels. • Step 6: decide on any transformations of scaled scores. • Step 7: compute cohort statistics such as percentages in levels.

  26. Invariance of Skills Descriptions • The “scale” must apply to all people in terms of relative difficulties of items. • To achieve this, all items must tap into the same construct.

  27. IRT models • Builds on the notion of probability of success. The probability of success is a function of the difference between the ability and the item difficulty ( - ): • Pr(X=1) = f( - ) • Different IRT models have different functional form f( - ).

  28. Rasch model (1-parameter IRT model) When

  29. Some properties of the Rasch model • Item characteristic curves are “parallel” • It means that once we know ( - ), the difference between ability and difficulty, we can determine the probability of success for the person on the item. • The ordering of item difficulties is the same for all people of differing abilities. • Not all items will have parallel ICC, so item construction needs to choose those items with parallel ICC.

  30. Parallel Item Characteristic Curves • Curves do not cross each other.

  31. An example where an Item tests a different construct from other items

  32. IRT Statistics • Fit indices tell us whether items are tapping into the same construct. • Discrimination indices tell us whether an item can discriminate between low and high ability students. • Item characteristic curves (ICC) show pictorially the fit of the data to the model.

  33. Theoretical and empirical ICC - 1 • Reasonably good fit

  34. Theoretical and empirical ICC - 2 • Fit is not so good. Item is more discriminating than expected.

More Related