Evaluating the Technical Quality of Computerized Adaptive Tests

Evaluating the Technical Quality of Computerized Adaptive Tests NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

What is different about the adaptive context? • How do you conceptualize adaptive assessments? • How do you make the transition from fixed form thinking? • How can you evaluate the quality of these tests?

In the fixed form world…. • Test Blueprint + items = Test Form = Student Test Event • Percent correct is an indicator of difficulty • Commonly accepted criteria for acceptance

In the adaptive context… • Test Blueprint is a design for the student test event • Item pool + test structure + algorithm determine each test event • Variable linking block (all items) • P-values close to .5 • Metrics not as well-established.

Everything supports the test event

What’s going on here? • You are moving from the concept of a population responding to a form into the realm of a person responding to an individual item. • Indicators based on sets of people responding to sets of items may be uninformative • The scale representing the latent trait assumes greater importance.

Move from population-based thinking to Responses to Items • Forms are not linked to one another. Pool consists of items linked to the scale. Scores from non-parallel tests are expressed and interpreted on the scale. • Percent correct is not important in assessing ability. The test event establishes the difficulty of the items a student is getting right about half the time. The goal of the test session is to solve for theta (Use the IRT equation with your favorite number of parameters.)

Start with the Test Blueprint • What do you want every student to get? • Content – categories and proportions • Cognitive characteristics • Item types • How many items in each test event? • What are you going to report? For individuals? For groups? • Overall scores • Sub-scores • Achievement category

How do you evaluate pool adequacy? • How do you evaluate pool adequacy? • Reckase – P-optimal pool evaluation. Analysis of “bins”. Satisfy some proportion of a fully informative pool. • It’s unrealistic to expect that every value of theta will have a maximally informative item. This method specifies a degree of optimality. • The p-optimal method can be used to evaluate existing pools or specify pool design.

How do you evaluate pool adequacy? Veldkamp & van der Linden - Shadow test method – 1. At every point in the test, a test that meets constraints and has maximum information at the current ability estimate is assembled. 2. The item in the shadow test with maximum information is administered 4. Update the ability estimate. 5. Return all unused items to the pool. 6. Adjust the constraints to allow for the attributes of the item administered. 7. Repeat Steps 2-6 until end of test.

Adaptive Test Design-Algorithm • How will you guarantee that each students gets the material in your test design? • Item selection, scoring, domain sampling • How will you guarantee reliable scores and categories? • Overall scores • Sub-scores • Achievement category • How do you control for item exposure?

Adaptive test event - Start • Assumption: you have a calibrated item pool that supports your test purpose • What do you need to know about the examinee? • How will you choose the initial item? Jumping into the item pool

Adaptive test event – Finding Theta • Assumption: you have a response to the initial item • How do you estimate ability? • How do you estimate error? • How do you choose the next item? • How do you satisfy your test event design? Progressing through the item pool

Adaptive test event – Termination • What triggers the end of the test? • Number of items • Error threshold • Proctor termination • What is reported to the student at the end? High achiever getting out of the pool

How do I know it’s a good test? • Classical reliability estimates depend on correlation among items. In CAT, inter-item correlation is low. This is an illustration of local independence. • In general CATs use the Marginal Reliability Coefficient (Samejima, 1977, 1994). This is based on analysis of the test information function over all values of theta. • In evaluating tests, it can be interpreted like coefficient alpha.

How do I know it’s a good test before giving it to zillions of students? Simulation is your friend • Using the actual pool, test structure and algorithm, simulate student responses at interesting levels of theta. • Compare the test’s estimated thetas with true thetas. • Bias: Average difference • Fit: Root Mean Squared Error

CAT depends on a calibrated bank • When items are used operationally, responses are gathered from those with highest info (I.e., ability and difficulty are close) • variance is low so correlational indicators are not appropriate • P-values are around .5

Evaluating item technical quality • Calibration depends on common person link to scale • Expose to a representative sample • The trick is to get informative responses

Evaluating item technical quality • In calibration, the process is to find difficulty from responses of examinees with known abilities. • Look at a vector of p-values across the range of theta. • Evaluate the relationship between observed and expected p-values for your IRT model; may use chi-square or correlation of p to expected p. • What value of difficulty maximizes this relationship?

Ask lots of questions. Keep pestering until understanding dawns. Thank you for your attention! Questions, comments? Contact: marty.mccall@nwea.org

Evaluating the Technical Quality of Computerized Adaptive Tests

Evaluating the Technical Quality of Computerized Adaptive Tests

Presentation Transcript

Evaluating Psychological Tests

A Comparison of Progressive Item Selection Procedures for Computerized Adaptive Tests

EVALUATING IMAGING QUALITY

Tests of Water Quality

Reliability of Non-Destructive Tests for Evaluating Concrete Quality

Evaluating the Quality of Northern Ireland’s Democracy

Evaluating the quality of vital statistics

Evaluating the Quality of Online Programs

Evaluating the Technical Adequacy of FBAs and BIPs

Evaluating Data Quality

Quality of Technical Education

Evaluating the Quality of Research Papers

Evaluating Milk Quality

Evaluating the Quality of Health Care

Evaluating the Quality of Health Care

Classical and Bayesian Computerized Adaptive Testing Algorithms

Evaluating Water Quality

Evaluating the quality of services

Technical Considerations in Alignment for Computerized Adaptive Testing

Technical Adequacy of Tests

EVALUATING IMAGING QUALITY