EALTA MILSIG: Standardising the assessment of writing across nations

EALTA MILSIG:Standardising the assessment of writing across nations Ülle Türk Language Testing Unit Estonian Defence Forces STANAG 6001 testing conference 7-9 July 2009 Zagreb, Croatia

Outline Background Aims of the project Procedure Standard setting Results Conclusions

Background: EALTA • EALTA = European Association for Language Testing and Assessment • Established in 2004 as a professional association for language testers in Europe. • Mission: to promote the understanding of theoretical principles of language testing and assessment, and the improvement and sharing of testing and assessment practices throughout Europe. • Annual conferences • Discussion lists • ealta-members@lists.lancs.ac.uk • specialist lists

Background: MILSIG • March 2008 – MILSIG mailing list established: ealta-mil@lists.lancs.ac.uk • EALTA conference in 2008: • a meeting of language testers working in the military • participating countries/ institutions: Denmark, Estonia, Latvia, Lithuania, SHAPE, Slovenia, Sweden • agreement to co-operate in standardising writing assessment

Aims of the project • To select a number of sample scripts that • have been written in response to a variety of prompts • demonstrate English language proficiency at STANAG levels 1-3 (4) • could later be used as • benchmark performances in assessing writing and in rater training • sample performances for teachers and test takers • To study the possibility of carrying out standardisation via email.

Procedure and timeline • Each participating country/institution selects 4 scripts, including problem scripts, at levels 1-3 – end of May • Scripts are collected, coded and sent to all participants – middle of June • Scripts are marked following the procedures established in each country – end of September • STANAG level descriptors used • Weak, standard and strong performances at each level identified • Comments provided • Results analysed; decisions taken

Participants • Denmark (1) • Estonia (5) • Latvia (4) • Lithuania (3) • SHAPE (2) • Slovenia (5)

Council of Europe: A manual Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment (CEFR) Pilot version: September 2003 Final version: January 2009 ‘Relating an examination or test to the CEFR can best be seen as a process of “building an argument” based on a theoretical rationale.’ (p 9) Familiarisation Specification Standardisation training/benchmarking Standard setting Validation Standard setting procedures

Table 5.2: Time Management for Assessing Written Performance Samples

Familiarisation: Raters rating descriptors Mean correlation: 0.89 (SD =.04) Range: 0.83 (R14) to 0.98 (R05)

27 scripts: 12 letters: 4 (+ 5) essays: 1 report: 1 memorandum: -------------------- A first draft of a lecture (2): Paper for a newsletter (1): Paper/letter/essay (1): 6 L1, 14 L2, 7 L3 3 L1, 8 L2, 1 L3 2 L1, 4 L2, 3 L3 L3 L2 1 L2, 1 L3 L1 L3 Task types and original ratings

Rating scripts • Task: • Use STANAG 6001 writing descriptors, NOT your own rating scale. • If the script was written for a STANAG 6001 test in your country/ institution, which level would it be awarded? • Do you consider it a weak, standard or strong performance at the awarded level? • Why?

Analysis of ratings • Coding: • L1 weak = 1 • L1 standard = 2 • L1 strong = 3 • L2 weak = 4 • L2 standard = 5 • L2 strong = 6 • L3 weak = 7 • L3 standard = 8 • L3 strong = 9

Scripts recoded • MILSIGPR_01–MILSIGPR_12a = MSP-01–MSP-12 • MILSIGPR_12b = MSP-13 • MILSIGPR_12c = MSP-14 • MILSIGPR_12d = MSP-15 • MILSIGPR_12e = MSP-16 • MILSIGPR_12f = MSP-17 • MILSIGPR_12g = MSP-18 • MILSIGPR_12h = MSP-19 • MILSIGPR_13 = MSP-20 • MILSIGPR_14 = MSP-21 • etc

Script ratings • Mean rating: 2.8–7.8 (St dev: 0.00-1.47) • 1-3 (L1): 1 script (6 scripts) • 4-6 (L2): 24 scripts (12 scripts) • 7-9 (L3): 2 scripts (7 scripts) • 15 scripts (55.6%) – agreement on the level, though usually not on whether it is weak, standard or strong performance at that level

Three examples • MILSIGPR_07 (MSP-07) • A lot of grammatical mistakes, spelling, very basic range. Not enough for Level 2. • MILSIGPR_13 (MSP-20) • task at level 3, but the writing is not coherent, very incorrect, sometimes difficult to understand the meaning and very uninteresting – getting even worse towards the end • MILSIGPR_14 (MSP-21) • well written with control of grammar, good vocabulary and abstract concepts and arguments clearly conveyed, the person might be able to write at a high level 3, but does not quite prove it here

Mean ratings for scripts Mean rating: 5.2 (SD = 1.44)

Script ratings by country

Correlations between country ratings N = 27; N = 23 All significantat 0.01 level

Mean ratings by task type

Conclusions • Such a project is indeed needed!

Way forward • 1 L1 script, 12 L2 scripts, 2 L3 scripts • Analysis of scripts  good benchmarks? • Collecting more scripts, particularly at L3 • Scripts based on a variety of task types • Did we start at the wrong end? • Looking at scripts that caused disagreement • Can we reach agreement? • What features make them problematic? • Expanding the circle to include more countries

References • EALTA website: http://www.ealta.eu.org • Council of Europe. 2009.Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment (CEFR): http://www.coe.int/t/dg4/linguistic/Manuel1_EN.asp

EALTA MILSIG: Standardising the assessment of writing across nations