Computerized Adaptive Testing: What is it and How Does it Work?

Computerized Adaptive Testing: What is it and How Does it Work?

Goals of this session • Learn about Computerized Adaptive Testing (CAT) • Review Item Response Theory (IRT) • Combining CAT with IRT • Pros and cons of CAT • Answer questions

Not to be confused with… Computerized Adaptive Testing: Not as cute, but far fewer hairballs.

PART I Introduction to CAT

Motivation for Understanding CAT • There are already operational assessments that use CAT • Some believe it will revolutionize classroom testing in the future • Interesting idea that speaks to potential of computers to have new uses in education • Item Response Theory is all over testing now

OK, so what is CAT? • A type of assessment where a question is displayed on a monitor • Students use mouse to select answer • Computer chooses next question based on previous responses • Next question is displayed on monitor, or else test ends

A graphical representation Questions chosen depend on prior responses

Analogy: A Game of 20 Questions • I am thinking of an object. You have 20 “yes-or-no” questions to figure it out. • Would you write out all your questions ahead of time? 1) Is it an animal? 2) Is it a vegetable? 3) Is it blue? 4) Is it red? 5) Is it bigger than a car? 6) Etc.

20 Questions, Continued • Isn’t it more effective to base your next question on previous answers? 1) Is it an animal? NO. 2) Is it a vegetable? YES. 3) Is it commonly found in a salad? YES. 4) Is it green? NO. 5) Would Bugs Bunny eat it? YES.

Same principle used in CAT • Computer keeps track of each student’s pattern of responses so far • As test progresses, learn more about individual student • Choose next question (item) to get maximal info about that particular student’s level of ability • Purpose of assessment: Get best possible information about students

Some items are more informative than others? Sure! • Some items are easier than others: 2 + 2 vs. 54389 + 34697 • Some items are more relevant than others: 3 + 7 vs. Academy Awards question • Some items are better at discerning proficient students from those who need improvement

Which is most informative? • Suppose we have only 2 types of students: “Advanced” and “Beginning” • Use the test to classify each student • Which item below is the best for this purpose?

Item 3 is the best • Item 1 is completely useless • Item 2 gives some information • Item 3 is all you need!

But wait… • Wouldn’t we choose Item 3 for ALL students? • If so, why customize a test for an individual student? • Answer: For some students, Item A is more informative. For others, Item B is more informative.

When is one item more informative than another? • Item A: 2 + 2 • Item B: (34 + 68) / 2 • If you’ve answered many difficult items correctly, Item A is waste of time • If you’ve answered many easy items incorrectly, Item B is too hard • Thus, give Item B to high-performing students, Item A to low-performing students

Isn’t that unfair? • It seems like CAT penalizes students for performing well at start • If we give different items to different students, how can we compare their performances? • The above question arises whether we use CAT or not • Item Response Theory to the rescue!

Summary of Part I • CAT customizes assessment based on previous responses, as in 20 Questions • Certain items more informative than others • For some students, Item A is more informative; for others, Item B is • When give different items to different students, need way to relate student performances (Item Response Theory)

PART II Review of Item Response Theory

Item Response Theory (IRT) • Quantifies the relation between examinees and test items • For each item, gives probability of correct response by ability level • Provides a means for describing characteristics of items, estimating ability of examinees • Places examinees on common scale when they have taken different items

The IRT Model: One item

Different items have different curves

Where did those curves come from? • In IRT, ability is denoted by θ • Probability of a correct response is • Each item has its own values of a, b, and c. We know them from field testing • a is the “discrimination”: Related to the slope • bis the “difficulty”: Harder item, higher b • c is the “guessing parameter”: Chance of lucky guess

Effect of the a parameter • All curves shown have equal b and c parameters • Larger a increases the slope in the middle

Effect of the b parameter • All curves shown have equal a and c parameters • Larger b means harder item

Effect of the c parameter • All curves shown have equal a and b parameters • c is the left asymptote

Wait a minute • What do you mean by a student with an ability of 1.0? • Does an ability of 0.0 mean that a student has NO ability? • What if my student has a reading ability of -1.2? What in the world does that mean???

The ability scale • Ability is on an arbitrary scale that just happens to be centered around 0.0 • We use arbitrary scales all the time: • Fahrenheit • Celsius • Decibels • Nevertheless, need more “user-friendly” reporting: “scaled” scores on conventional scale like 200-300

Giving a score for each student • First assign an ability (θ) value to each student (say, -4 to 4) • Student is given the value of θ that is most consistent with his/her responses • The better he/she does on the test, the higher the value of θ that he/she receives • Computer converts the θ score to a scaled score • Report final score!

Assigning scores • Set of answers: (C,C,I,C,C,I,I,C,C,C,I,C,C) • We know which items were taken by each student: a, b, c parameters • If Student 1’s items were harder than Student 2’s, take into account through item parameters • Student 1: θ = 1.25, scaled score = 290 • Student 2: θ = 0.65, scaled score = 268 • Can compare students who took different items!!!

Summary of Part II • If you didn’t get all that, don’t worry • Just remember: • In IRT, different items have different curves (depending on a, b, c parameters) • IRT allows us to give scores on the same scale, even when students take different items • These features critical in CAT • So how do we choose which items to give?

PART III Combining CAT with IRT

CAT Reminder • CAT customizes assessment based on previous responses • For some students, Item A is more informative; for others, Item B is • With IRT, it’s OK to give different items to different students

Which item would you choose next? PREVIOUS RESPONSES: • 10 + 19 = ? Answered correctly. • 27 + 38 = ? Answered incorrectly. • 12 + 26 = ? Answered incorrectly. POSSIBLE ITEMS TO GIVE NEXT: • 18 + 9 = ? • 13 + 17 = ? • 14 + 20 = ?

Item selection to match ability/difficulty • Want to give items appropriate to ability • 2 + 2 is not informative for high-performing students; (34 + 68) / 2 is not informative for low-performing students • Student has taken 10 items, awaits 11th • Classic approach: Give item whose difficulty (b) is closest to current ability estimate (θ)

Which item is better for θ = -1.2? Easier item Harder item

More complex item selection • Previous method: Match difficulty to ability • This criterion only uses b parameter and θ • Recall that a parameter is related to slope, c is guessing parameter • Shouldn’t we consider those when choosing next item?

Another item selection method • Ideal item: High value of a; value of b close to θ; low value of c • “Fisher Information” combines these factors into a single number • Choose item with highest Fisher Info

Game: Which item would you choose? • Suppose our current estimate of θ is 0.6

Results • If matching ability estimate (0.6) with difficulty, we would give Item 2 • If using Fisher Info, we would give Item 2

Round 2 • Suppose our current estimate of θ is 0.7

Round 2 Results • If matching ability estimate (0.7) with difficulty, we would give Item 2 • If using Fisher Info, we would give Item 1

Summary of Part III • Tailor items to be most informative about individual student’s ability • Do this by combining CAT with IRT • One method: Match difficulty with current estimate of θ • Another method: Take all parameters into account via Fisher Info

PART IV Practical Considerations

Problem: Content Balance • In operational testing, must balance content (e.g., math test of algebra, geometry, number sense) • What if all your most informative items come from the same content strand? • In practice, dozens of constraints for each CAT: Content, topics, enemies list, etc. • CAT solution: Pick most informative item among those “in play”

Problem: Test security • CAT administered on multiple occasions • Person A takes exam, memorizes items, tells Person B. Person B takes exam, benefits from Person A’s information • Different students, different items; however, some items more popular than others • CAT solution: Limit the amount each item can be administered

CAT “Pros” • Convenient administration • Immediate scoring • Items maximally informative: Exams just as accurate, with shorter tests • Items at correct level: High-performing students not bored, low-performing students not overwhelmed

CAT “Cons” • Limited by technology • Potential bias versus students with less computer experience • Content balance less exact than paper-and-pencil testing • Test security • Expensive

Final summary • Introduction to CAT: Benefits of giving different items to different students • Review of IRT • Using IRT to select items in a CAT • Pros and cons of CAT

Computerized Adaptive Testing: What is it and How Does it Work?