140 likes | 229 Views
Investigating the impact of empirically-developed rating scales on the assessment of students’ writing in the FPE Part I- Report Writing A Presentation for PDD Language Centre May 9, 2012 by Farah Bahrouni bahrouni@squ.edu.om. Outline: Claim & Questions Study Methodology Data collection
E N D
Investigating the impact of empirically-developed rating scales on the assessment of students’ writing in the FPE Part I- Report WritingA Presentation forPDDLanguage CentreMay 9, 2012byFarah Bahrounibahrouni@squ.edu.om
Outline: • Claim & Questions • Study • Methodology • Data collection • Analysis: FACETS & One-Way ANOVA • Results • Conclusion • Implication & Significance
Claim Analytic scales recognize a priori thatlge L. may not learn lge skills @ same pace & may display different levels in different skills at a given time (Hamp-Lyons, 1991; Weigle 1994, 2002) diagnose sts’ strong and weak areas (Bachman, 1990, 2007, Bachman & Palmer 1996; Fulcher, 2000, 2010, 2011; North, 2003) help Ts give constructive feedback (Alderson, 1991, 2007; Alderson et al. 1995) help Ts & sts focus on weaknesses give a picture of how much of the curriculum/LOs a st has achieved help Ts reflect on their teaching = positive washback (Alderson, 1991, 2007; Alderson et al., 1995; Bachman, 1990, 2007; Bachman & Palmer 1996; North, 2000, 2003; North & Schneider, 1998)
However, I argue that analytic scales as they are presented in the literature (see Bachman & Palmer 1996, pp.214-216, 275-280; Hamp-Lyons, 1992, in North, 2003, pp.78-79; Hughey et al., 1983; Shohamy, 1992, p. 33), to mention only these, are still holistic in nature in terms of construct definitions as well as band descriptors of the language components being assessed.
In a multi-cultural context, similar to ours, where more than 230 teachers from about 30 different countries, such scales are still not enough to do away with idiosyncrasies in assessing writing in general. I am inclined to believe that more rigorous scales would leave minimum leeway for raters to call on their personal experience to interpret vague descriptors (Brindley, 1998).
Continuum of scoring methods (Hunter et al. 1996, p.64) ____________________________________________________________________________________________ Holistic Approaches Analytic Approaches <_________Ʌ____________________________________Ʌ___ _______________________________________> General impression Holistic Primary Trait Analytic Atomistic Scoring Scoring Scoring Scoring Scoring ____________________________________________________________________________________________ The more to the right on this continuum a scale is, the better it is in a multi-cultural context yields more reliable results, hence this study: Action: Develop a new set of rating scales and compare it to the current one.
Questions: • Which of the two sets of rating scales, the one currently in use or the newly-devised one, functions better in terms of the attributes that follow on the next slide? 2) Are there any significant differences between the two sets of scales? a. taken as whole sets (when looking at the total mark = the sum of the scores from the 4 components)? b. in terms of categories? 3) Which of the two sets of rating scales yields less variation among raters?
2.Study What are we after? A rating scale functioning properly is expected to show: • significant discrimination between candidates, i.e. good distribution of abilities(used ANOVA to show this) • higher inter-rater and intra-rater consistency: we look at the SD (descriptive stats) & rater separation ratio (FACETS): the lower the better, as big differences between raters are not welcome. The closer to 0 the better. • fewer raters marking either inconsistently (misfits) or over consistently (overfits) by overusing the central categories of the rating scale (measures from FACETS: 1.6 - .6) • the measurement values increasing as the scale points get higher = a higher score means a higher ability in the construct being assessed • all points on the scale being used
Studied the FPE LOs: 22 testable writing LOs Q. writing Features to be assessed: CCs & Prog. Ts Likert scale Selection: items ranked at 3 & 4 confirm with CCs Scrutinized 65 sts’ live reports for extra features = 38 Fs 2.1 Data collection (I) QUAL data: # categ. Ts suggest whtfs. go in each cat. scale (# lvls/pnts) Ts sug. for each cat. Ts’ description/definition = wht each lvl/pnt means
Write: • construct definitions based on Bachman & Palmer’s (1996) communicative approach • definitions of performance levels based on LOs, Ts’ responses and the 65 studied reports Piloting: 7 teachers scored 10 samples twice ( in the analysis, discarded 2 = left with 5 2.2 Data collection (II) RS1 RS2 Analysis: FACETS + One-Way ANOVA
Results • Descriptive Statistics 1.1 EXCEL (Appendix 1) 1.2 ANOVA (Appendix 2) 2. Categories Statistics from FACETS & ANOVA (Appendix 3)
3. Implication & significance: • Analysis indicates that • RS2 function more effectively than RS1 in most investigated areas in the study • the optimum number of points/levels to have on the scale is 5 • Language Use category should be divided into more sub-categories • title and length (2 sub-categories of Content) are ‘over weighted’ (changes made in version 3, which needs to be piloted) • For good results, RS2 needs to be used along with anchor papers representing different levels. Teacher training would also be helpful as research suggests that it can reduce, but cannot eliminate raters’ tendency for overall severity or leniency in assessing performance (Lumley and McNamara, 1995 ; McNamara, 1996; McNamara and Adams, 1991; Weigle, 1998)
Ts’ involvement in defining what they think should be assessed in sts’ writing & describing the levels of performance (what those labels as Excellent, Good, or Poor’ stand for) helped them reach a more common understanding of the lge aspects being assessed and a shared interpretation of the score descriptions • The rating scales I have developed are ‘home-made’, based on LOs and tailored to FPE, and therefore to the LC needs. They can be extrapolated, with some adaptation to essay writing • They can be generalised to any similar multi-cultural context to produce a less personalized and more institutionalized objective assessment of students’ writing performance.
REFERENCES Alderson, J. C. (1991). Bands and Scores. In J. C. Alderson & B. North (Eds.), Language Testing in the 1990s: The Communicative Legacy (Vol. 71 - 86). London and Basingstoke: Macmillan Publishers Limited. Alderson, J. C., Clapham, C., & Wall, D. (1995). Language Test Construction and Evaluation: Cambridge University Press. Bachman, L. F. (1990). Fundamental Considerations in Language Testing: Oxford: Oxford University Press. Bachman, L. F., & Palmer, A. S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests.: Oxford: Oxford University Press. Brindley, G. (1998). Describing language development? Rating scales and SLA. In: L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research. CUP. Fulcher, G. (2000). The 'communicative' legacy in language testing. System, 28, 483 -497. Fulcher, G. (2010). Practical Language Testing. Hodder Education, An Hachette UK Company Fulcher, G., Davidson, F. & Kemp, J. (2011) Effective rating scale development for speaking tests: Performance decision trees. Language Testing 28 (1) 5-29 Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.), Assessing Second Language Writing in Academic Contexts (pp. 241-276). Norwood, NJ: Ablex Publishing Corporation. Hunter, D. M., Jones, R. M., & Randhawa, B. S. (1996). The Use of Holistic versus Analytic Scoring for Large-Scale Assessment of Writing. The Canadian Journal of Program Evaluation, 11(2), 61 - 85. North, B. (2000) The development of a Common Framework Scale of Language Proficiency: Theoretical Studies in Second Language Acquisition P. Lang. North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation formats. TOEFL Monograph, 24. North, B. & Schneider, G. (1998) Scaling descriptors for language proficiency scales. Language Testing 15 (2) 217-263 Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and Qualitative approaches. University of California, Los Angeles. Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press. Thank you