The Jointly Developed English BAT as a Multi-Stage, Computer Adaptive Test

The Jointly Developed English BAT as a Multi-Stage, Computer Adaptive Test BILC 2014 Conference Bruges, Belgium Ray Clifford

Outline • BILC: Always a learning experience. • A review of a some important characteristics of STANAG 6001. • Ramps and Stairs: Two contrasting approaches to test design and scoring. • Designing tests of multidimensional traits, such as language proficiency. • Results and conclusions.

Outline • BILC: Always a learning experience. • A review of some important characteristics of STANAG 6001. • Ramps and Stairs: Two contrasting approaches to test design and scoring. • Designing tests of multidimensional traits, such as language proficiency. • Results and conclusions.

Glenn Fulcher Explained Models, Frameworks, and Test Specifications. Glenn Fulcher University of Leicester BILC Conference 2014

Bart Deygers disambiguated rating procedures.

We even learned on the tour.And where there is learning,there is testing!

Learning on the Bruges Tour • Why was the tower built with a slant?

Learning on the Bruges Tour • Why was the tower built with a slant? • Just because the builders were so inclined.

Learning on the Bruges Tour • Why did we walk instead of taking a carriage?

Learning on the Bruges Tour • Why did we walk instead of taking a carriage? • This is a serious meeting; no horsing around is allowed.

Learning on the Bruges Tour • Why should one not walk on the grass?

Learning on the Bruges Tour • Why should one not walk on the grass? • This is noisy grass; walking on it will disturb the neighbors.

Learning on the Bruges Tour • Why is getting an Education important?

Learning on the Bruges Tour • Why is getting an Education important? • It will reflect well on you.

Learning on the Bruges Tour • Why are the baroque buildings in better condition than other buildings?

Learning on the Bruges Tour • Why are the baroque buildings in better condition than other buildings? • The town believes: If it isn’t baroque, don’t fix it!

Learning on the Bruges Tour • What does this feature of the hotel have to do with STANAG 6001 testing?

Learning on the Bruges Tour • What does this feature of the hotel have to do with STANAG 6001 testing? • By the end of the presentation, you will know.

Some Characteristics of STANAG 6001 • Every base STANAG 6001 level description has 3 components: • Context (Content/Topics) • Communication Tasks/Functions. • Accuracy (and precision) expectations. • At every base level, each of these components is different from the descriptions in other levels. • The levels are not a single “scale”, but a hierarchy of Criterion-Referenced abilities.

[A review item from Glenn Fulcher’s presentation yesterday] Mistaken Ideas “…when we speak of ‘setting performance standards’ we are…referring to the…concrete activity of deriving cut points along a score scale” (Cizek and Bunch, 2007, p. 14). Climbing the ladder

The STANAG 6001 Hierarchy Note: The ladders overlap.

STANAG 6001 with Buckets The blue arrows indicate the water (ability) observed at each level. 3 2 1 Notes: The buckets may begin filling at the same time. Some Level 2 ability will develop before Level 1 is mastered. That is ok, because the buckets will still reach their full (mastery) state sequentially.

Some Characteristics of STANAG 6001 • The levels describe real-life communication situations. • The levels are not linked to any curriculum. • It is expected that a speaker be able to speak (write) extemporaneously, using unrehearsed language. • The reader (listener) should be able comprehend authentic language.

Some Characteristics of STANAG 6001 • As with other Criterion-Referenced testing scenarios, STANAG 6001 includes three essential TCA components at each level. • Task • Condition • Accuracy • All three components are required to define a “measurable criterion”. • All specifications of a criterion must be met to earn a rating at that level.

Some Characteristics of STANAG 6001 • The one-page summary, NATO STANAG 6001 Ed. 4 for Non-Specialists shows this three-part TCA composition of the base levels quite clearly. • There is a reason why that summary, does not include the “+ levels”. • Plus levels describe some frequently-occurring ability profiles of people whose performance does not fully satisfy the criterion at the next higher level.

Ramps Versus Steps • What does this feature of the hotel have to do with STANAG 6001 testing? - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - • The steps represent a Criterion-Referenced test design. • The ramp represents a Norm-Referenced test design. C-R N-R

Ramp: N-R Scoring Procedures • Produce a single score for the test. (And the error variance for all ability levels is included in that single score.) • Extrapolate or convert the score to estimate the equivalent “step height” using imprecise “standard setting” judgments. • The single score is compensatory, e.g. a person with one foot on Level 1 and one foot on Level 3 will earn a Level 2 rating.

Test Development Procedures: Norm-Referenced Tests • Create a table of test specifications. • Train item writers in item-writing techniques. • Develop a lot of items. • Test the items for difficulty, discrimination, and reliability by administering them to several hundred learners. • Use statistics to eliminate “bad” items. • Administer the resulting test. • Report results compared to other students.

Test Development Procedures: Norm-Referenced Tests (cont.) • Note: Relating a norm-referenced score to a multi-level set of criteria (such as the hierarchical set of STANAG 6001 criteria) presents formidable theoretical and practical challenges.

A Traditional MethodOf Setting Cut Scores Level3Group Groups of ”known” ability N-R test to be calibrated Level2Group Level1Group

The Results One Hopes For: Level3Group Groups of “known” ability Test to be calibrated Level2Group Level1Group

The Results One Always Gets(Some test takers score below and some score above their “known” ability.) Level3Group Groups of ”known” ability Level2Group Test Scores Received Level1Group

No matter where the cut scores are set, they are wrong for many test takers. Level3Group Groups of ”known” ability Level2Group Test Scores Received Level1Group

Is there a better way thanindirect extrapolationto assign proficiency levels?Would close adherence to the C-R proficiency scale characteristics improve testing accuracy?

Major Steps in Creating C-RReading Proficiency Tests • Define the construct to be tested. • Use each level’s Task, Conditions, and Accuracy criteria to establish test specifications for each level to be tested. • Train the items writers to develop C-R, TCA-aligned items targeting each level. • Test whether the C-R, TCA-aligned item sets cluster in difficulty by level and whether the clusters do not overlap.

Steps: C-R Scoring Procedures • Calculate the person’s ability score for each step. (Only the error variance for one step is included in each of those scores.) • Use con-compensatory scoring to identify each person’s “floor and ceiling” ability. • The floor is the highest level where mastery of the TCA criterion is demonstrated. • The ceiling is the first level where the person’s performance does not meet all the criterion.

Language Learning Considerations • Language learners do not completely master the communication tasks and topical domains of one proficiency level before they begin learning the skills described at the next higher level. • Usually, learners will have developed conceptual control or even partial control over the next higher proficiency level by the time they have attained sustained, consistent control over the lower level.

Testing Considerations • Why “Floor” and “Ceiling” ratings are used: • Criterion-Referenced testing requires a separate score for each criterion. • C-R testing uses non-compensatory scoring (and a single overall score on a multi-level test is always a compensatory score). • Dual scores explain ability distinctions that would be regarded as error variance in multi-level tests that report only a total test score.

Outline • BILC: Always a learning experience. • A review of some important characteristics of STANAG 6001. • Ramps and Stairs: Two contrasting approaches to test design and scoring. • Testing multidimensional traits, such as language proficiency. • Results and conclusions.

To Test Multidimensional Traits • Define incremental states or stages of the trait (where each stage is more complex in dimensionality than the preceding stage). • Maintain strict alignment of the: • Theoretical construct model. • Test development model. • Psychometric scoring model. Luecht, R. (2003). Multistage complexity in language proficiency assessment: A framework for aligning theoretical perspectives, test development, and psychometrics. Foreign Language Annals, 36, 527-535.

To Have a C-R Test • The following elements must be aligned: • The construct to be tested. • Test development specifications. • The scoring model. • If these elements are aligned the test is legally defensible. Shrock, S. A. & Coscarelli, W. C. (2007). Criterion-referenced test development: Technical and legal guidelines for corporate training and certification. (3rd.ed.). San Francisco, CA: John Wiley and Sons.

Outline • BILC: Always a learning experience. • A review of some important core characteristics of STANAG 6001. • Ramps and Stairs: Two contrasting approaches to test design and scoring. • Designing tests of multidimensional traits, such as language proficiency. • Results and conclusions.

Benchmark Advisory Tests • The BATs follow the Criterion-Referenced scoring model used by the human-adaptive Oral Proficiency Interview (OPI) : • To earn a specific proficiency rating, the test taker has to satisfy all of the level-specific Task, Conditions, and Accuracy (TCA) criteria associated with that level. • Note 1: When researchers tried assigning ratings based on a total of component scores, they found that total scores didn’t accurately predict the Criterion-Referenced, OPI ratings. • Note 2: The same non-alignment occurred when they used multiple-regression analyses,

Why use Criterion-Referenced scoring rather than total scores? • Proficiency ratings are “criterion” ratings. • C-R ratings require non-compensatory rating judgments at each level. • All total and average scores, even when weighted, are compensatory scores. • “Floor and ceiling” score comparisons are needed to assign a final rating. • Raters can’t apply “floor and ceiling” rating criteria using a single or composite score.

Why do Speaking tests work? • We defined a primary construct for each proficiency level, and a secondary construct that the primary constructs form a hierarchy. • Converted these proficiency constructs into test specifications. • Created a test delivery system based on those test specifications called the OPI. • Applied Criterion-Referenced (C-R), “floor and ceiling”, scoring procedures.

OPI Speaking Tests Work • The accuracy of the C-R approach is supported by inter-rater agreement statistics. • Pearson’s r = 0.978 • Cohen weighted Kappa = 0.920 (See Foreign Language Annals, Vol. 36, No. 4, 2003, p.512) • The construct of a difficulty hierarchy is supported by the fact that the “floor and ceiling” rating system does not produce “inversions” between the assigned ratings.

We used these steps to createthe reading BAT. • Defined the overall construct, level-specific criterion, and a secondary construct that the levels form a hierarchy. • Converted the construct and criterion into test specifications. • Developed a test delivery system based on those multi-stage test specifications. • Applied Criterion-Referenced, “floor and ceiling”, scoring procedures.

The Construct to be Tested Author purpose Reading purpose Orient – Get necessary information Inform – Learn Evaluate – Evaluate and synthesize Proficient reading: The active, automatic, far-transfer process of using one’s internalized language and culture expectancy system to efficiently comprehend an authentic text for the purpose for which it was written.

The Jointly Developed English BAT as a Multi-Stage, Computer Adaptive Test

The Jointly Developed English BAT as a Multi-Stage, Computer Adaptive Test

Presentation Transcript

BAT Test Demonstration Version

Multi-stage Amplifiers

Test Item Bank System And Computer Adaptive Test System

Shakespeare The World as a Stage

TEST OF ENGLISH AS A FOREIGN LANGUAGE

Jointly developed by

Computer Adaptive Testing

Adaptive Computer Interfaces

Global English test developed by the British Council English proficiency test for adults (16+)

Multi-Stage Bidding

Multi-stage Network

THE GENERATION OF AUTOMATED STUDENT FEEDBACK FOR A COMPUTER-ADAPTIVE TEST

Benefit-Risk Analysis of Multi- Stage Adaptive Designs

Multi-Stage Sampling

Stage 14 Test

ISBE Developed Test

Stage 5: The test statistic!

B.Eng. Computer Engineering Jointly offered by the

English Willow Cricket Bat

B.Eng. Computer Engineering Jointly offered by the

TEST OF ENGLISH AS A FOREIGN LANGUAGE

The Computer as a Tutor