ESL essay raters’ cognitive processes

ESL essay raters’ cognitive processes Paula Winke and Hyojung Lim Michigan State University winke@msu.edu hyojung@msu.edu

This is a study of rater behavior

This is a study of rater behavior • How does a rater make scoring decisions? What does a rater pay attention to when rating? My essay

This is a study of rater behavior • Language testers need to know if construct-irrelevant variation in scores stem from how raters approach and think about a rubric. My essay

This is a study of rater behavior • Empirical studies on raters’ cognitive processes are scarce (especially with analytic scoring), and findings are not consistent. My essay

Previous findings • Raters focus on different features in essays when scoring; weight the different scoring categories differently (Cumming et al., 2002; Eckes, 2008; Orr, 2002). My essay

Previous findings • Sometimes they consider external features that are not even described in a rubric (Barkaoui, 2010; Lumley, 2005; Vaughan, 1991). My essay

Previous findings • Raters may have different attentional foci when scoring, and their foci may depend on • the scale type (holistic vs. analytic), • the rater’s experience (expert vs. novice rater), • the raters’ L1 and even L2 background. My essay

The current study We’d like to know… • How raters cognitively process (i.e., use) an analytic rubric while rating ESL essays • Whether variability in processing (difference in rubric usage) is associated with lower inter-rater reliability

Research Questions • To which parts of an analytic rubric do raters pay the most attention (measured as total fixation duration and visit count)? • Are inter-rater reliability statistics on the subcomponents of an analytic rubric related to the amount of attention paid to those subcomponents?

Method • 9 raters, all ESL instructors in the same English-language program at a large, Midwestern university and native speakers of English. • Each rated 40 essays (4 prompts * 10 essays). • Analytic rating scale: Currently used at the language program; it is a modified version from Jacobs et al. (1981) – content, organization, vocabulary, language use, and mechanics • Tobii TX300 eye-tracker: The rubric was installed in the Tobii Studio program.

Content Organization Language Use Mechanics Vocabulary

Method • 9 raters, all ESL instructors in the same English-language program at a large, Midwestern university and native speakers of English. • Each rated 40 essays (4 prompts * 10 essays). • Analytic rating scale: Currently used at the language program; it is a modified version from Jacobs et al. (1981) – content, organization, vocabulary, language use, and mechanics • Tobii TX300 eye-tracker: The rubric was installed in the Tobii Studio program.

The data collection set-up Rubric 64cm Essay Score

Procedure

The data

Data Analysis • To quantify attention: total fixation duration (divided by the number of words in each category) and visit count • To observe a rating process: time to first fixation, gaze plots, and heat maps (Bax & Weir, 2012) • Inter-rater reliability: the intraclass coefficient (ICC) and reliability adjusted by the Spearman-Brown prophecy formula • Statistics: the Kruskal-Wallis and Mann-Whitney (post hoc) test

Results • In general, raters read the rubric from left to right, starting from content, organization, vocabulary, language use to mechanics. Oftentimes (71 times, to be specific), mechanics were overlooked.

Results • Organizationreceived the most attention (in terms of fixation duration and visit count) and showed the highest inter-rater reliability; raters attended least to and agreed least on mechanics. r = .75 r = .90

Results • From a qualitative review of the videos and heatmaps in comparison with each rater’s inter-rater reliability estimate, we believe that raters who agreed the most had common attentional foci, whereas those who agreed the least did not.

Incongruous Raters • Raters 1 and 7 were found to be most incongruous, given their lowest inter-rater reliability for the total score (.45), and the second lowest reliability for content (.36) and for mechanics (.28). • Because the scores for Essay 2 had the largest standard deviation, we looked at the heat maps for essay 2 for raters 1 and 7.

Essay 2 Rater 1

Essay 2 Rater 7

Agreeing Raters • Raters 6 and 8 had the highest correlation coefficient in total scores (r=.79) as well as on the sub-scores for content (r=.75) and mechanics (r=.67). • Given that the scores of Essay 8 shows the smallest standard deviation, the heat maps for the essay 8 were compared between rater 6 and 9.

Essay 8 Rater 6

Essay 8 Rater 8

Discussion • Raters’ attention and inter-rater reliability • More attention leads to higher inter-rater reliability with analytic scoring. (<-> greater care and attention decrease reliability with holistic scoring, Wolfe, 1997) • Those who showed higher inter-rater reliability showed similar reading patterns – reading a relatively large area of the rubric, and having common patterns of attentional foci.

Discussion • The effect of the layout • With an analytic scale, raters’ decision-making behaviors tend to operate within the scope of the given guidelines (Smith, 2000). • Part of the guidelines is the order of the categories. We think that raters gave their most attention to content and organization and their least attention to mechanicsbecause of a primacy effect. • It has to do with rubric real estate.

Discussion • In Lumley’s (2005) study, the conventions of presentation (spelling, punctuation, script layout) received the second most attention after content, more attention than organization and grammar. • In his study, the conventions of presentation came second after content in the rubric. • May also be evidence of this primacy effect.

Discussion • Raters may use the rubric mainly to justify or adjust the scores for an essay on which they have already made decisions. When finishing reading an essay, raters seemed to know where the quality of the essay would fall in the grid of the analytic rubric. • Those who showed higher inter-rater agreement appeared to look through more descriptors for various levels; those who didn’t seemed to stick to their initial judgment.

Limitations & Future Directions • The eye-movement data don’t fully explain why raters paid more attention to certain categories or whether raters considered non-criterion features. -> analysis of our stimulated-recall interview data is needed. • We don’t know if there was any halo effect across essaysin the rating process. • Information is lacking on how raters read the essays and how they went back and forth between the essays and the rating scale. • We have collected data for a second study in which both the rubric and essay are on screen, and data for a third study to investigate potential halo effects.

Questions or comments?Paula Winke winke@msu.eduHyojungLim hyojung@msu.edu

Notes on Essays • We assembled a stratified sample of 40 essays from prior ESL placement tests at a large Midwestern university. We culled four sets of 10 essays, each set from one of four scoring bands (64 and below, 65-69, 70-74, and 75 and above: see supplemental material that accompanies the online version of this manuscript). We balanced the selection of the 40 essays equally across four prompts as follows, with two to three essays at each score band being a response to one of these prompts: • Do you think it is better for people to make their purchases online or to go shopping in stores and malls? Use specific details and examples to explain your answer. • Some people say that all international students who are studying English should have an American roommate for at least one year. What is your opinion on this topic? • Some employees have bosses that they really like working for, while others have bosses that they absolutely hate. What are the most important qualities of a good boss at work, and why? • If you had the choice, would you rather take a college course online or have the same class face to face with an instructor and classmates in a classroom? Use specific details and examples to explain your answer. • The length of student essays was limited to one page so that raters did not need to flip over pages while rating. The order of 10 essays within each prompt set was randomized, and the order of the four prompt sets was counterbalanced across raters. A packet of 40 copied essays were ready for each rater, and raters were allowed to write on the essays while rating. Additionally, we selected two more essays for norming, and the essays were from the middle two score bands of 65-74.

The mean rank is the result of the Kruskai-Wallis test. Notes on Time to 1st Fixation

Eye fixation duration with number of words controlled Note. Measurement units are seconds (e.g. 10.720 seconds). Mean ranks are the result of the Kruskal-Wallis test.

ESL essay raters’ cognitive processes

ESL essay raters’ cognitive processes

Presentation Transcript

Introduction to Cognitive Psychology

Cognitive Psychology

Inferencing , allusions, intertextuality and metaphors

PART 3 Random Processes

Chapter 19

Processes and Threads

Investigation of Primary User Emulation Attack in Cognitive Radio Networks

The CLARION Cognitive Architecture: A Tutorial

Grading Symbols and Essay Tips

The Extended Essay

Cognitive Dimensions

Argumentative/ Persuasive Essay

Processes and Threads

Memory and Cognition

Earth Science Review

Essay #1 Introduction

Cognition: Memory Chapter 8

Cognitive Investigations of Comprehension Processes

Geomorphic Processes: II. Exogenous

EXAMPLE ESSAY

Autobiographical Essay