The Science and Art of Exam Development

The Science and Art of Exam Development Paul E. Jones, PhD Thomson Prometric

What is validity and how do I know if my test has it?

Validity “Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of tests. Validity is, therefore, the most fundamental considerations in developing and evaluating tests.” (APA Standards, 1999, p. 9)

A test may yield valid judgments about people… • If it measures the domain it was defined to measure. • If the test items have good measurement properties. • If the test scores and the pass/fail decisions are reliable. • If alternate forms of the test are on the same scale. • If you apply defensible judgment criteria. • if you allow enough time for competent (but not necessarily speedy) candidates to take the test. • If it is presented to the candidate in a standardized fashion, without environmental distractions. • If the test taker is not cheating and the test has not deteriorated.

Is this a Valid Test? 1. 4 - 3 = _____ 6. 3 - 2 = _____ 2. 9 - 2 = _____ 7. 8 - 7 = _____ 3. 4 - 4 = _____ 8. 9 - 5 = _____ 4. 7 - 6 = _____ 9. 6 - 2 = _____ 5. 5 - 1 = _____ 10. 8 - 3 = _____

The Validity = Technical Quality of the Testing System Design Item Bank

Doc Doc Doc Doc Doc Doc Doc The Validity Argument is Part of the Testing System Design Item Bank

How should I start a new testing initiative?

A Testing System Begins with Design Design Item Bank

Test Design Begins with Test Definition • Test Title • Credential Name • Test Purpose (“This test will certify that the successful candidate has important knowledge and skills necessary to…” ) • Intended Audience • Candidate Preparation • High-Level Knowledge and Skills Covered • Products or Technologies Addressed • Knowledge and Skills Assumed but Not Tested • Knowledge and Skills Related to the Test but Not Tested • Borderline Candidate Description • Testing Methods • Test Organization • Test Stakeholders • Other Information

Test Definition Begins with Program Design

Test Objective Test Definition Leads to Practice Analysis

Practice Analysis Leads to Test Objectives

Test Objectives are Embedded in a Blueprint

Once I have a blueprint, how do I develop appropriate exam items?

The Testing System Design Item Bank

Creating Items Content Characteristics Response Modes Choose one Content Options Choose Many Text Graphics Audio Video Simulations Applications Item Single M/C Multiple M/C Single P&C Multiple P&C Drag & Drop Brief FR Essay FR Simulation/App Scoring

Desirable Measurement Properties of Items • Item-objective linkage • Appropriate difficulty • Discrimination • Interpretability

Item-Objective Linkage

Good Item Development Practices • SME writers in a social environment • Industry-accepted item writing principles • Item banking tool • Mentoring • Rapid editing • Group technical reviews

How can I gather and use data to develop an item bank?

Classical Item Analysis: Difficulty and Discrimination

Classical Option Analysis: Good Item n proportion discrim Q1 Q2 Q3 Q4 Q5 >

n proportion discrim Q1 Q2 Q3 Q4 Q5 > Classical Option Analysis: Problem Item

a=0.6 b=-1.5 c=0.4 a=1.2 b=-0.5 c=0.1 a=1.0 b=1.0 c=0.25 IRT Item Analysis: Difficulty and Discrimination

Good IRT Model Fit

How can I assemble test forms from my item bank?

Reliability “Reliability refers to the degree to which test scores are free from errors of measurement.” (APA Standards, 1985, p. 19)

More Reliable Test

Less Reliable Test

How to Enhance Reliability When Assembling Test Forms • Score reliability/generalizability • Select items with good measurement properties. • Present enough items. • Target items at candidate ability level. • Sample items consistently from across the content domain (use a clearly-defined test blueprint). • Score dependability • Same as above. • Minimize differences in test difficulty. • Pass-Fail consistency • Select enough items. • Target items at the cut score. • Maintain same score distribution shape between forms

Building Simultaneous Parallel Forms Using Classical Theory

Building Simultaneous Parallel Forms Using IRT

Setting Cut Scores Why not just set the cut score at 75% correct?

Setting Cut Scores Why not just set the cut score so that 80% of the candidates pass?

The logic of criterion-based cut score setting • Certain knowledge and skills are necessary for practice. • The test measures an important subset of these knowledge and skills, and thus readiness for practice. • The passing [cut] score is such that those who pass have a high enough level of mastery of the KSJs to be ready for practice [at the level defined in the test definition], while those who fail do not. (Kane, Crooks, and Cohen, 1997)

The Main Goal in Setting Cut Scores Meeting the “Goldilocks Criteria” “We want the passing score to be neither too high nor too low, but at least approximately, just right.” Kane, Crooks, and Cohen, 1997, p. 8

Two General Approaches to Setting Cut Scores • Test-Centered Approaches:Modified Angoff • Bookmark • Examinee-Centered Approaches:Borderline • Contrasting Groups

What should I consider as I manage my testing system?

Security of a Testing System Design • Write more items!!! • Create authentic items. • Use isomorphs. • Use Automated Item Generation. • Use secure banking software and connectivity • Use in-person development Item Bank

Security of a Testing System Design • Establish prerequisite qualifications. • Use narrow testing windows. • Establish test/retest restrictions. • Use identity verification and biometrics. • Require test takers to sign NDAs. • Monitor test takers on site. • Intervene if cheating is detected. • Monitor individual test center performance. • Track suspicious test takers over time. Item Bank

Security of a Testing System • Perform frequent detailed psychometric review. • Restrict the use of items and test forms. • Analyze response times. • Perform DRIFT analyses. • Calibrate items efficiently. Design Item Bank

Item Parameter Drift

Security of a Testing System Design Item Bank • Many unique fixed forms • Linear on-the-Fly testing (LOFT) • Computerized adaptive testing (CAT) • Computerized mastery testing (CMT) • Multi-staged testing (MST)

Item Analysis Activity

The Science and Art of Exam Development

The Science and Art of Exam Development

Presentation Transcript

The Art and Science of Debriefing

Exam taking: Art or Science?

The Art and Science of Storytelling

The Art and Science of Precepting

The Power of Art and Science

The Art and Science of Teaching

The art and science of presentations

The Art and Science of Teaching

The Art and Science of Valuation

The Art and Science

THE ART AND SCIENCE OF HEALING

The Art and Science of Bootstrapping

The Science of Art

The Art and Science of Teaching :

The Art, Science, and Engineering of Software Development

The science (and art) of design

The Art and Science of Estimating Software Development Cost

The Science of Art:

The Art and Science of CCRPI

The science and art of rehabilitation

Unveiling the Art and Science of Mobile App Development

PDF DOWNLOAD The Art and Science of Personality Development