300 likes | 527 Views
Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens May 9-11, 2008. Claudia Harsch, IQB Guido Martin, IEA DPC. Overview. Background - Standards-based assessment in Germany here: Writing in EFL Writing tasks and rating approach Feasibility Studies
E N D
Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens May 9-11, 2008 Claudia Harsch, IQBGuido Martin, IEA DPC
Overview • Background - Standards-based assessment in Germanyhere: Writing in EFL • Writing tasks and rating approach • Feasibility Studies - Feasibility Study I, May 2007trial scales and approach • Feasibility Study II, June 2007trial holistic vs. analytic approach • Pilot Study, July/August 2007 - Training - Comparison FS II vs. Pilot Study Training
Overview • Background - Standards-based assessment in Germanyhere: Writing in EFL • Writing tasks and rating approach • Feasibility Studies - Feasibility Study I, May 2007trial scales and approach • Feasibility Study II, June 2007trial holistic vs. analytic approach • Pilot Study, July/August 2007 - Training • Comparison FS II vs. summer training
Background Assessing ES in Germany • Evaluation of Educational Standards for grades 9 and 10 by IQB Berlin • In Foreign Languages, standards are linked to the CEF, targetingA2 for lower track of secondary schoolB1 for middle track of secondary school • Assessment of “4 skills”:reading, listening, writing and speaking (under development) • Tasks based on CEF-levels A1 to C1;uni-level approach
Assessment of Writing Tasks Criteria of assessment, each defined by descriptors based on CEF, Manual, Into Europe: • task fulfilment • organisation • grammar • vocabulary • overall impression Rating approach • A uni-level approach to grading the tasks in line with the specific target level • Performance to be graded on a below / pass / pass plus basis • "Holistic approach": Ratings are the result of a weighted assessment of several descriptors per criterion
Overview • Background - Standards-based assessment in Germanyhere: Writing in EFL • Writing tasks and rating approach • Feasibility Studies - Feasibility Study I, May 2007trial scales and approach • Feasibility Study II, June 2007trial holistic vs. analytic approach • Pilot Study, July/August 2007 - Training • Comparison FS II vs. summer training
Feasibility Study I May 2007 Aims • Trial training / rating approach with student teachers • Gain insight into scales and criteria • Get feedback on accessibility of handbooks, benchmarks, coding software Procedure • 2 tasks: A2 “Lost dog” / B1 “Keeper for a day” • 6 raters: student teachers of English, proficient in writing English • First training session (1day): introduction to CEF, scales and tasks • Practice 1: 30 scripts per task (over 1 week) • Second training session (1day): evaluation & discussion of practice results • Practice 2: 28 scripts per task (over 1 week) • Evaluation of results in terms of rating reliability
Feasibility Study I May 2007 Evaluation: Assessing Rater Reliability • Index used: PercentAgreement with Mode • Measures the percentage of agreement with the value most often awarded on the level of individual ratings • Can be aggregated on item (variable) and rater level • Easily interpreted • No assumptions about scale level • No assumptions about value distributions • No estimation errors • Can be interpreted as a proxy for validity
Outcome Feasibility Study I, May 2007 Reliability per Item
Outcome Feasibility Study I, May 2007 Reliability per Rater & Item
Outcome Feasibility Study I, May 2007 • Approach appears feasible • Scales seem to be usable and applicable • BUT: We do not know what raters do on the sub-criterion-level • Need to further explore behaviour at descriptor level=> Feasibility Study II
Overview • Background - Standards-based assessment in Germanyhere: Writing in EFL • Writing tasks and rating approach • Feasibility Studies - Feasibility Study I, May 2007trial scales and approach • Feasibility Study II, June 2007trial holistic vs. analytic approach • Pilot Study, July/August 2007 - Training • Comparison FS II vs. summer training
Feasibility Study II, June 2007 Comparison: • Holistic scores for the five criteria (FS I) • Scoring each descriptor on its own and in addition scoring the criteria “holistically” (FS II) Reasons behind: • “below” – “pass” – “pass plus” in a uni-level approach targeting a specific population: tendency towards the “pass” value • Similar outcomes can be achieved by purely random value distributions at the descriptor level • Data on scoring each descriptor show whether raters interpret descriptors uniformly before using them to compile the weighted overall criterion rating • Reliable usage of descriptors is a precondition for valid ratings on the criterion-level
Outcome Feasibility Study II, June 2007 • Fairly high agreement on criterion-level ratings is NOT the result of uniform interpretation of descriptors … • BUT rather results from cancellation of deviations on the descriptor-level during the compilation of the criterion ratings • Rating holistic criteria by evaluation of several pre-defined descriptors can only be valid if descriptors are understood uniformly by all raters • Descriptors need to be revised • Training and assessment of pilot study has to be conducted on the descriptor level in order to be able to control rating behavior
Overview • Background - Standards-based assessment in Germanyhere: Writing in EFL • Writing tasks and rating approach • Feasibility Studies - Feasibility Study I, May 2007trial scales and approach • Feasibility Study II, June 2007trial holistic vs. analytic approach • Pilot Study, July/August 2007 - Training • Comparison FS II vs. summer training
Background Pilot Study • Sample Size: N = 2932 • Number of Items: • Listening: 349 • Reading: 391 • Writing: 19 Tasks • n = 300 – 370 / item (M = 330) • All Länder • All school types • 8th, 9th and 10th graders
Summer Training • 13 Raters, selected on the basis of English language proficiency, study background and DPC coding test • Challenge of piloting tasks, rating approach and scales simultaneously • First one-week seminar: - Introduction of CEF, scales and tasks - Introduction of rating procedures - Introduction of benchmarks
Summer Training • 6 one-day sessions: - Weekly practice - Discussion & Evaluation of practice results - Introduction of further tasks / levels - Revision of scale descriptors • Five levels, 19 tasks:Simultaneous introduction of several levels and tasks necessary in order to control level and task interdependencies • Three rounds of practice per task ideal:1. Intro – practice2. Feedback – practice3. Feedback – practice4. Evaluation of reliabilities • …
Summer Training • Second one-week seminar: • Feedback on last round of practice • Addition of benchmarks for borderline cases - Addition of detailed justifications for benchmarks - Finalisation of scale descriptors - Revision of rating handbooks
Conclusion Training concept for the future • Materials prepared – weekly seminars not necessary • Training and rating on descriptor level • Multiple one-day sessions, one per week to give time for practice - Introduction - Practice: 3 rounds per task ideal - Feedback
Claudia Harsch Phone+ 49 + (0)30 + 2093 - 5508 Telefax+ 49 + (0)30 + 2093 - 5336 E-mail Claudia.Harsch@IQB.hu-berlin.de Website www.IQB.hu-berlin.de Mail Address Humboldt-Universität zu Berlin Unter den Linden 6 10099 Berlin GERMANY Guido Martin Phone+ 49 + (0)40 + 48 500 612 E-mail guido.martin@iea-dpc.de Website www.iea-dpc.de Mail Address IEA DPC Mexikoring 37 D-22297 HamburgGERMANY