440 likes | 452 Views
NATO BAT Testing: The First 200. BILC Professional Seminar 6 October, 2009 Copenhagen, Denmark Dr. Elvira Swender, ACTFL. This Report. History of Benchmark Advisory Tests (BAT) 2009 Administration of BAT in 4-Skills BAT Scores Comparing National Scores to Benchmark Scores Observations.
E N D
NATO BAT Testing: The First 200 BILC Professional Seminar 6 October, 2009 Copenhagen, Denmark Dr. Elvira Swender, ACTFL
This Report History of Benchmark Advisory Tests (BAT) 2009 Administration of BAT in 4-Skills BAT Scores Comparing National Scores to Benchmark Scores Observations
This Report History of Benchmark Advisory Tests (BAT) 2009 Administration of BAT Combined BAT Scores Comparing National Scores to Benchmark Scores Observations
Why Benchmark Testing? • To provide an external measure against which nations can compare their national STANAG test results • To promote relative parity of scale interpretation and application across national testing programs • To standardize what is tested and how it is tested
BAT History • Launched as a volunteer, collaborative project • The BILC Test Working Group • 13 members from 8 nations • Contributions received from many other nations • The original goal was to develop a Reading test • Later awarded a competitive contract by ACT • December, 2006
BAT History (cont’d) • ACTFL working with BILC Working Group • To develop tests in 4 skill modalities. • Reading and Listening tests piloted and validated • Speaking and Writing tests developed • Testers and raters trained and certified • Test administration and reporting protocols developed • 200 BAT 4-skills tests allocated under the contract • Tests administered and rated • Scores reported to Nations
BAT Reading and Listening Tests Internet-delivered and computer scored Criterion-referenced tests Allow for direct application of the STANAG Proficiency Scale Each proficiency level is tested separately Test takers take all items for Levels 1,2,3 20 texts at each level; one item with multiple choice responses per text The proficiency rating is assigned based on two separate scores “Floor” – sustained ability across a range of tasks and contexts specific to one level “Ceiling” – non-sustained ability at the next higher proficiency level
BAT Speaking Test • Telephonic Oral Proficiency Interview • Goal is to a produce a speech sample that best demonstrates the speaker’s highest level of spoken language ability across the tasks and contexts for the level • Interview consists of • Standardized structure of “level checks” and “probes” • NATO specific role-play situation • Conducted and rated by one certified BAT-S Tester • Independently second rated by a separate certified tester or rater • Ratings must agree exactly • Level and plus level scores are assigned • Discrepancies are arbitrated
BAT Writing Test • Internet-delivered • Open constructed response • Four, multi-level, prompts • Prompts target tasks and contexts of STANAG levels 1,2,3 • NATO specific prompt • Rated by a minimum of two certified BAT-W Raters • Ratings must agree exactly • Level and plus level scores are assigned • Discrepancies are arbitrated
This Report History of Benchmark Advisory Tests (BAT) 2009 Administration of BAT battery Combined BAT Scores Comparing National Scores to Benchmark Scores Observations
2009 BAT Administration Allocation to 11 Nations 8 Nations have completed testing Testing began in May, 2009 Tests administered by LTI, the ACTFL Testing Office
2009 BAT Administration • Each Nation has a customized client site • Request tests • View and print test schedules • Obtain test administration instructions, passwords, and test codes • Retrieve Ratings
This Report History of Benchmark Advisory Tests (BAT) 2009 Administration of BAT Combined BAT Scores Comparing National Scores to Benchmark Scores Observations
This Report History of Benchmark Advisory Tests (BAT) 2009 Administration of BAT Combined BAT Scores Comparing National Scores to Benchmark Scores Observations
40% 29% – – (5) (7) – – 64% 56% 92% 39% (11) (18) (13) (18) 89% 83% 83% 50% (18) (18) (18) (18) 85% 47% 55% 60% (20) (19) (20) (20) 69% 47% 64% 50% (16) (15) (14) (18) 8% – 54% – (12) – (13) – 24% 0% 33% 0% (17) (18) (18) (18) Alignment of National Scores and BAT Scores Listening Listening Speaking Speaking Reading Reading Writing Writing Black White Red Blue Maroon Purple Yellow
This Report History of Benchmark Advisory Tests (BAT) 2009 Administration of BAT Combined BAT Scores Comparing National Scores to Benchmark Scores Observations
Observations – Listening Scores Exact agreement of BAT and National Scores is 58% 69 of the 119 Listening scores agree exactly When the scores disagree, the National score is HIGHER 88% of the time In 8 cases (7%), disagreement is across two levels 1 vs 3 and 2 vs 4
Observations – Speaking Scores Exact agreement of BAT and National Scores is 46% 53 of 115 Speaking scores agree exactly When the scores disagree, the National score is HIGHER in all cases In 6 cases (6%),the disagreement is across two levels 1 vs 3 and 2 vs 4
Observations – Reading Scores Exact agreement of BAT and National Scores is 62% 74 of 119 Reading scores agree exactly When the scores disagree, the National score is HIGHERin 85% of the cases In 2 cases, the disagreement is across two levels 1 vs 3
Observations – Writing Scores Exact agreement of BAT and National Scores is 38% 44 of 115 Writing scores agree exactly When there is disagreement, the National score is HIGHER in all cases In 15 cases, the disagreement is across two levels 1 vs 3 and 2 vs 4
Accounting for Strictness or Leniency • Testing rehearsed rather than unrehearsed material • Performance vs proficiency • Inconsistencies in interpretation of the STANAG • When “plus” ratings are not used, the tendency to award the next higher level rating to a performance that is substantially better than a baseline performance
For Receptive Skills • Compensatory cut score setting • Lack of alignment of author purpose, text type, and reader task at level • Inadequate item response alternatives
For Productive Skills • Misalignment of test type and test purpose • Ex: list of discrete questions when goal is to measure spoken language proficiency • Inadequate tester/rater norming
Plus Ratings Within the Level 1 Range 60% of ratings are 1 40% of ratings are 1+ Within the Level 2 Range 50% of ratings are 2 50% of ratings are 2+
Profiles • Only 12 of 115 profiles (10%) were “flat” • 1 1 1 1 (8) • 2 2 2 2 (2) • 3 3 3 3 (2) • All remaining profiles are mixed
We are all wondering. What will the future bring?
Let’s hope it’s not the same kind of anxiety these early linguists experienced.
Alignment of National Scores and BAT Scores