400 likes | 533 Views
Fundamental Testing Assumptions Revisited: Examination Length and Number of Options. Karine Georges & Kelly Piasentin Assessment Strategies Inc. Overview.
E N D
Fundamental Testing Assumptions Revisited: Examination Length and Number of Options Karine Georges & Kelly Piasentin Assessment Strategies Inc.
Overview • Credentialing organizations seek to balance many factors such as program validity and credibility with more tangible aspects such as costs and ease of development. Two such aspects are investigated: • Method to reduce the total number of test questions while retaining validity and reliability. • The effects of reducing the typical number of options from four (4) to three (3).
Part I Examination Length: A Case Study Karine Georges, MSc.
Case Study: Certification Program • Tasked in 2007 to determine whether 180-item, 4-hour examinations could be shortened in light of a potential move to CBT.
Validity and Examination Length • Content Validity: The number of items on an examination must be sufficient to ensure adequate representative coverage. • Face Validity: If shortened, perceptions of stakeholders need to be considered vis-a-vis comparable professions.
Examination Length and Reliability • What is an acceptable reliability index for credentialing? • “ A reliability correlation coefficient should fall in the high .80s or above for longer examinations (e.g., 150 or more items)”. [NOCA, 2004]. • What is the range of reliability indices for the current 180-item certification examinations? • Average : .84 • Min: .78 • Max: .92
Examination Length and Practical Considerations If reliability is related to item length why shorten the examination? Costs and efficiency • Each item costs between $300-$1000 to develop (Vale, 2006). • Need additional items for safeguard purposes, or ancillary materials such as prep guides or readiness tests. • Client’s intention to go to CBT makes it an advantage to have shorter examinations so seat time can be reduced and more candidates accommodated within the testing period.
Research Approaches • Two approaches: • Classical Test Theory (CTT) approach Examining reliability coefficient using Spearman-Brown formula. • Item Response Theory (IRT) approach Examining the item information function using empirical data.
CTT Results for the Two Certification Programs Spearman Brown Formulation: Pxx= Npxx 1+ (N-1) pxx • Results show that examinations can be lowered by 20-30 questions (or about 10%) and still remain above .80.
Limitations of CTT Results • General Limitations of Spearman Brown: • Assumption that examinations are exactly parallel • Only one value for a range of abilities • Largely impacted by cohort
IRT Approach: Item Information Curve • Research has shown that in higher stakes examinations with Pass/Fail decisions such as certification examinations, examinations can be shortened without impacting classification abilities (Schulz & Wang, 2001) • What would be the impact if the certification examinations had 10% fewer items? • How about 25% or 50%?
IRT - Item Information Curve • IRT models specify the probability of a discrete outcome such as a correct response to an item, in terms of person and item parameters. • Person parameter: ability of a candidate (theta) • Item parameters: a: Discrimination (slope) b: Difficulty (location) c: Guessing
IRT - Test Information Curve • All Item Information Curves add to a Test Information Curve • Amount of information scale differs based on length of examination and quality of the items • Pass/Fail decision must be made where error is minimal (ideally where the passmark is located) and where level of ability can be clearly differentiated
IRT - Results and Implications • The examinations can be reduced by at least 10% without significantly impacting the pass/fail decision. • Other factors to take into consideration • Number of candidates • Robustness of item bank
Other Considerations • What about face validity? • How would an examination with 90 items be viewed by other professionals compared to a comparable examination of 180 items?
Other Certification Programs • Review of over 75 certification programs within the same profession. • The average number of items: 164 or between 150-175 items (including experimental items) • Minimum: 100 • Maximum: 250
Summary • Data suggest that the number of items can be reduced by 10% with minimal impact on the validity and reliability.
Part II How Many Options is Optimal in Multiple Choice Testing? Kelly Piasentin, PhD
Multiple Choice Testing • Most common format used in Licensure and Certification examinations • Consists of a stem (i.e., the question being asked) and a series of options to choose from (usually 4) Example: • In which state is the 2008 CLEAR conference being held? • Arkansas • Alaska • Arizona • Alabama Stem Options
Advantages of Multiple Choice • Versatility • Efficiency • Scoring accuracy and economy • Reliability • Diagnosis • Control of difficulty • Amenable to item analysis
Disadvantages of Multiple Choice • Time consuming to write • Difficult to create effective distracters (i.e., options that are plausible, but incorrect)
Time Spent Writing MCQs • Sample of 75 Item Writers for 3 different licensing/certification examinations • Average time spent writing an MCQ: 52 minutes • Percentage of time spent writing:
Effort Spent Writing Distracters Of the 75 Item Writers… • 25% reported that it was difficult to write the 1st distracter • 40% reported that it was difficult to write the 2nd distracter • 75% reported that it was difficult to write the 3rd distracter
How many options should an MCQ have? • 4-option MCQs are widely used in standardized testing everywhere • But, are 4 options ideal? • Some IW guidelines say, “develop as many options as feasible” (Haladyna & Downing, 1989) • More recently, “develop as many functional distractors as are feasible” (Haladyna, Downing, & Rodriguez, 2002) • Increasing emphasis on the quality of distractors as opposed to the quantity
Definition of a Functional Distracter “A functional distracter is one that has (a) a significant negative point-biserial correlation with the total test score, (b) a negative sloping item characteristic curve, and (c) a frequency of response greater than 5% for the total group.” Haladyna & Downing (1988)
How does # options impact guessing? • With 4 options, candidates have a 25% chance of getting any one question correct by simply guessing • Probability is reduced to 20% if there are 5 options • Probability is increased to 33% if there are 3 options • BUT…. if a typical examination has 25 items, each with 3-options, chance of getting at least a 70% on the examination by pure blind guessing is 1 in 25,000 • So, do you get more bang for your buck by having more options?
Are 4-option MCQs optimal? Factors to consider: • Time and cost it takes to develop distracters • Time it takes for candidates to complete the examination • Psychometric properties of examination • Item difficulty • Item discrimination • Test reliability (Coefficient alpha)
Arguments in favour of 3-options: • Less time is needed to develop two plausible distracters • More 3-option items can be administered without increasing testing time • Inclusion of additional high quality items per unit of time should improve test score reliability • Having fewer options decreases the likelihood of exposing additional aspects of the domain to candidates (e.g., context clues to other questions)
Data from a Licensing/Certification Examination • Number of MCQs: 235 • Number of candidates: 5,393 • Mean item difficulty: .721 • Mean discrimination index: .166 • Test reliability: .88 • Most chosen distracter: .167 • 2nd most chosen distracter: .077 • Least chosen distracter: .035
Reducing Examination Items to 3 Options What would be the effect on item difficulty, discrimination and reliability of reducing the items on the examination to 3 options if the least chosen distracter was: • Attributed to correct answer? • Attributed to 2nd least chosen distracter? • Randomly distributed to each of the other 3 choices?
Reducing Examination Items to 3 Options If least chosen attributed to correct answer: • Item difficulty: .752 • Mean discrimination index: .136 • Coefficient Alpha: .834
Reducing Examination Items to 3 Options If least chosen attributed to 2nd least chosen distracter: • Item difficulty: .720 • Mean discrimination index: .168 • Reliability: .881
Reducing Examination Items to 3 Options If least chosen distributed randomly to each of the other 3 choices: • Item difficulty .731 • Mean discrimination index: .158 • Reliability : .868
4 Options vs. 3 Options • Moving from 4 options to 3 options did not have a significant impact on average item difficulty, discrimination or test reliability.
Summary • Two primary benefits of using 3 options (as opposed to 4 options) • Faster item writing • Better testing • Better quality items • Cost savings • Shorter test time • More questions in same amount of time (potential for increased reliability)
Conclusion • These two presentations demonstrate that you can accrue some efficiencies from reducing test length and number of response options without compromising test validity. • Further research needed to confirm findings.
Contact Information Assessment Strategies 1400 Blair Place, Suite 210 Ottawa, ON K1J 9B8 Canada. Telephone: 613-237-0241 E-mail: www.asinc.ca • Karine Georges, MSc kgeorges@asinc.ca • Kelly Piasentin, PhD kpiasentin@asinc.ca