How to Assess and Measure Competency

How to Assess andMeasure Competency Robert C. Shaw, Jr., PhD Program Director

Presentation Outline • Describe a program’s responsibilities • Assess appropriate content • Measure abilities as precisely as possible • Reference each cut score to a criterion

The validity claim • Our program is confident we can make valid inferences from an assessment because • we carefully selected and structured the content and • observed scores are reasonably precise • Weakness in either claim diminishes the validity argument

Define appropriate content What should we assess?

Stakeholders’ Expectations Information sources for content Certification Board’s Expectations

What should we assess? • A program should seek multiple opinions about program content • May mean more than one faculty person in the program • Could extend to survey results from several stakeholders • Those who hire your graduates • Those who graduated

Describe potential content • Define potential content by describing job behaviors or tasks • Interpret ABG results • Determine the appropriate time to refer a patient for consultation from another service • Adjust mechanical ventilation settings to optimize oxygenation for a patient while minimizing the risk of pulmonary injury

Define terminal behaviors • Focus terminal assessments on end-product behavior you expect students to master • Insert a pulmonary artery catheter in a patient within a critical care setting using standard technique while minimizing risks of infection and lung involvement • Integrate pulmonary function testing results with patient history and other laboratory results to produce a diagnosis

Measure task criticality • Typically expressed by the interaction of a • importance/significance/risk measure and a • frequency/extent measure

How important is the task to success? OR How significant is the task to safe and effective practice? 4=Extremely 3=Very 2=Moderately 1=Minimally Potential survey measurements

If this task is incorrectly performed, how strong is the risk? 3= Potentially fatal 2=Likely to increase morbidity 1= Unlikely to have an adverse effect Potential survey measurements • 3=High • 2=Moderate • 1=Low

How frequently do you perform the task? 3=Every week 2=A few times each year 1=Less than once a year Potential survey measurements • 3=Very often • 2=Occasionally • 1=Infrequently

Have you performed the task in the last year? 1=Yes 0=No Potential survey measurements

What can we do with task measurements? • Normed-referenced approach • Rank order tasks from most to least critical • Start at the top and work down using available time • Criterion-referenced approach • Identify tasks that are sufficiently critical to ensure program coverage and competency assessment

Select item type(s) for each assessment • Constructed response (e.g., short answer, essay, performance) • Short development time • Long scoring time • Scores have strong subjective characteristics • Selected response (e.g., true/false, matching, multiple-choice) • Long development time • Short scoring time • Scores have strong objective characteristics

High stakes terminal assessments should be standardized • Specify how the assessment should look before writing/selecting items • Test specifications ensure each assessment is similar, fair, and covers critical content

Test specifications are typically two-dimensional

Entire test blueprint/matrix

Test specifications and items • Each item should be linked to a task and a cognitive process level • It helps to store items in a database • A sophisticated database will permit additional layers of classification • Acute/chronic care • Age groups

Item banking software • FastTest $$$ • www.assess.com/frmSoftCat.htm • ExamView $$$ • www.pearsonncs.com/examview/ examview.htm • LXR*Test $$$ • www.lxrtest.com/

Measure abilities precisely Are we confident an assessment has yielded a sufficiently precise ability estimate?

Reliability • Theoretical premise • Observed scores are assumed to express true ability plus some measurement error • High reliability implies low measurement error

Reliability • Reliability indices are R2 values, which express the percentage of observed score variance that can be attributed to true score variance • How high is high enough? • A test score reliability value of at least .85 is a characteristic of large-scale, standardized assessments, many exceed .90 • Sufficiently reliable test scores from a test built by a program should show values of at least .60

Reliability • Reliability is an attribute of a set of test scores, it is not an attribute of a test • Therefore, a program should assess reliability for each group • KR20 is appropriate for dichotomously scored (0,1) items • Coefficient alpha works for polytomously (0, 1,…n) scored items

Why are selected response items used for so many assessments? • Assuming the time to assess is constant, more responses can be elicited from students using selected response items • more items = • broader content coverage = • increased information = • enhanced measurement precision = • stronger validity • Scores are more strongly objective

Add items or options? • A program cannot go wrong by adding more items to an assessment • A program may only consume space and time by adding more options to multiple-choice items • There is growing evidence items with 3 options are optimal, particularly when doing so permits inclusion of more items on an assessment • Dr. Thomas Haladyna, Arizona State University

Up to a point, measurement precision and item quantity are directly related Reliability Higher quality items Lower quality items Item Count

What encourages high item quality? • Write well • Clear, concise, accurate • Remove unnecessary information from the stimulus • Present nuanced choices that require a sophisticated mastery of material to correctly respond • Item review is another opportunity to seek multiple opinions

What encourages high item quality? • Avoid formats known to be flawed • D. All of the above • D. None of the above • Negative wording • All of the following are true EXCEPT • Which of the following is not true?

What encourages high item quality? • Apply quality improvement principles • Analyze item performance • Retain items that contribute to test score reliability • Change or discard items that fail to contribute or negatively affect reliability

Item analysis properties • Difficulty • p = proportion of students who correctly responded • Discrimination • rpb = correlation between item success and students’ test scores

Item difficulty Contribution to Test Score Reliability 0.0 0.4 0.6 1.0 p

Item discrimination • Because rpb values are correlations, values reflect one of three possibilities relative to reliability • Positive contribution • No contribution • Negative contribution

Using item parameters diagnostically • Relative to reliability contribution, item • p values provide magnitude information • rpb values provide magnitude and direction (+ or -) information

Using item parameters diagnostically • Difficulty and discrimination properties equally contribute to reliability • The best items show .30<p<.70 AND ppb>.20 • The worst items exist at the difficulty extremes and show zero or negative discrimination

After diagnosing an item that shows a weak or negative reliability contribution • What should we do? • Observe option response frequencies and mean scores • Identify incorrect responses that attracted students with test scores equal to or greater than the average • Replace the offending option with a less attractive response • Rewrite the stem to clarify ambiguities OR • Discard the whole item and use a better one the next time

Item analysis software • Iteman $$$ • www.assess.com/Software/iteman.htm • examSystem II $$$ • www.pearsonncs.com/examsystem/index.htm • LXR*Test $$$ • www.lxrtest.com/ • True Score II $$ • www.nine-patch.com/TSCDL.htm • Excel Templates $Free • www.eflclub.com/elvin/publications/2003/itemanalysis.html

Internal resources may be available • There is a good probability a large university with education, psychology, and/or statistics departments will have a system available for scoring items and providing analyses of test scores and items

Reference each cut score to a criterion Should we define and assess minimal competence for our program?

Cut points • Highly reliable test scores reveal differences between students’ abilities and can help accurately rank order students, which may be important to employers • However, the program is likely interested in assessing whether each student is sufficiently competent to safely and effectively practice • Such assessment concerns typically surface as students are about to graduate

Measuring minimal competence • A program should decide whether it wants to create one large assessment with a single compensatory cut point OR • Should each content domain have its own cut, a conjunctive model

Why are there so many compensatory cut competency assessments? • If a program selects the more rigorous conjunctive model, then each component test will produce its own set of scores, each with its own reliability • Each component must have a sufficient number of items or data points to be confident each student group’s test scores will show adequate reliability • Modules of less than 80-100 program-made items are unlikely to produce adequate reliability

Seek multiple opinions . . . again • Program faculty should define skills competent practitioners possess • This is a group activity • Each cut point should be linked to a definition of minimally competent practitioners

Performance assessments • Pick your spots • Ensure a sufficient quantity of information is collected • Standardize administration • Measure agreement between/among evaluators

Summary • Collective opinions are closer to the truth about • appropriate assessment content, • item quality, and • justifiable cut scores than any one opinion • Unreliable scales have no utility

Thank you for the opportunity to share some details about measurement Questions?

How to Assess and Measure Competency