170 likes | 316 Views
Using an Error-Annotated Learner Corpus to Develop an ESL/EFL Error Correction System. Na-Rae Han (University of Pittsburgh), Joel Tetreault (ETS), Soo-Hwa Lee ( Chungdahm Learning, Inc.), Jin-Young Ha ( Kangwon University). May 19 2010, LREC 2010. Objective.
E N D
Using an Error-Annotated Learner Corpus to Develop an ESL/EFL Error Correction System Na-Rae Han (University of Pittsburgh), Joel Tetreault (ETS), Soo-Hwa Lee (Chungdahm Learning, Inc.), Jin-Young Ha (Kangwon University) May 19 2010, LREC 2010
Objective • A feedback tool for detecting and correcting preposition errors • I wait /foryou. (<NULL,p>: omitted prep) • So I go to/ home quickly. (<p,NULL>: extraneous prep) • Adult give money at/on birthday. (<p1,p2>: selection error) • Why preposition errors? • Preposition usage is one of the most difficult aspects of English for non-native speakers • 18% of sentences from ESL essays contain a preposition error (Dalgish, 1985) • 8-10% of all prepositions in TOEFL essays are used incorrectly (Tetreault and Chodorow, 2008)
Diagnosing L2 Errors • Statistical modeling on large corpora. But what kind? • General corpora composed of well-edited texts by native speakers (“native speaker corpora”) Currently dominant approach • Error-annotated learner corpora: consist of texts written by ESL learners Our approach
Our Learner Corpus • Chungdahm English Learner Corpus • A collection of English essays written by Korean-speaking students of Chungdahm Institute, operated in S. Korea • 130,754,000 words in 861, 481 essays, written on 1,545 prompts • Over 6.6 million error annotations in 4 categories: • grammar, strategy, style, substance • Non-exhaustive error marking (more on this later)
The Preposition Data Set • Our preposition data set • The 11 “preposition” types: NULL, about, at, by, for, from, in, of, on, to, with represents 99% of student error tokens in data • Text set consists of 20.5 mil words • 117,665 preposition errors • 1,104,752 preposition non-errors • Preposition error rate as marked in the data: 9.6%
Method • Cast error correction as a classification problem • Train an 11-way Maximum Entropy classifier on preposition events extracted from the Chungdahm corpus • A preposition annotationis represented as <s,c> (s: student’s prep choice, c: correct preposition) where s and c range over: { NULL, about, at, by, for, from, in, of, on, to, with } • s≠c for prep errors; s=c for non-errors • A preposition eventconsists of: • Outcome (prediction target): c • Contextual features extracted from immediate contexts surrounding preposition tokens, including the student’s original preposition choice (i.e., s)
Preposition Context • Student prep choice + 3 words to left and right • MOD: Head of the phrase modified by the prep phrase • ARG: Noun argument of the preposition Identified using Stanford Parser • Example text and annotation:
Event Representation • Represented as an event: • Outcome: in • Features: (24 total)
Training and Testing • Training set: 978,000 events • The rest is set aside for evaluation and development • Creating an evaluation set for testing • Error annotation in Chungdahm corpus is not exhaustive: Many student errors are left unmarked by tutors • This necessitates creating a re-annotated evaluation set • 1,000 preposition contexts annotated by 3 trained annotators • Inter-annotator agreement (0.860~0.910), kappa (0.662~0.804)
Evaluation Results • 11-way classification - works as error correction (multi-outcome decision) model - can be backed-off to an error detection (binary decision) model • Omission errors (I wait /foryou. ) *Error detection is trivial for this type • Extraneous prep errors (So I go to/ home quickly.) • Selection errors (Adult give money at/on birthday.)
Related Work • Chodorow et al. (2007) • Error detection model targeting 34 prepositions • Trained on San Jose Mercury news + Lexile data • 0.88 (precision) 0.16 (recall) for detecting selection errors • Gamon et al. (2008) • Error detection and correction model of 13 prepositions • One classifier to determine whether a preposition/article should be present; another for correct choice; an additional filter • Trained on MS Encarta data, tested on Chinese learner writing • 80% precision; recall not reported • Izumi et al. (2003, 2004) • Trained on Standard Speaking Test Corpus (Japanese) • 56 speakers, 6,216 sentences • 25% precision and 7% recall on 13 grammatical error types
Comparison: Native-Corpus-Trained Models • Question: Will models trained on native-speaker-produced texts outperform our model? • The advantage of native corpora: They are plentiful. We allowed these models to have a larger training size. • Experimental setup: • Build models on native corpora, using varying training set sizes (1mil – 5mil) • Data: the Lexile Corpus, 7th and 8th grade reading levels • A comparable feature set was employed
Learner Model vs. Native Models • Testing results on learner data (replacement errors only): • Learner model outperforms all native models • Native models: performance gain with larger size insignificant beyond 2-3mil point
What Does This Prove? • Are the native models flawed? Bad feature set? • No. In-set testing (against held-out native text) shows performance levels comparable to those in published studies • Could some of the performance gaps be due to genre differences? • Highly likely. However, 7th-8th grade reading materials were the closest match we could find to student essays. • In sum: Native models’ advantage of larger training size does not outweigh those of the learner model’s: genre/text similarity and error-annotation
Discussion: Learner language vs. native corpora • Modeling on native corpora: • Produces a one-size-fits-all model of “native” English • More generic & universally applicable? • Modeling on a learner corpus: • Produces a model specific to the particular learner language • Can it be applied to the language of other learner groups? • ex. French citizens? Japanese-speaking English learners? • Combining two approaches: • A system with specific models for different L1 background • Plus a back-off “generic” model, built on native corpora
Discussion: The Problem of Partial Error Annotation • Partial error annotation problem: • 57% of replacement errors and 85% of extraneous prepositions are unchecked by Chungdahm tutors • Training data includes conflicting evidence. • Our model’s low recall/high precision are impacted by it • Model assumes a lower-than-true error rate • Model has to reconcile between conflicting sets of evidence • When the model does flag an error, it does so with high confidence and accuracy • Solution? Bootstrapping, relabeling of unannotated errors
Conclusions • As language instruction turns digital, more and more (partially) error-annotated learner corpora like the Chungdahm corpus will become available • Building a direct model of L2 errors, whenever available, offers an advantage over models based on native corpora, despite the partial annotation problem (if any) • Exhaustive annotation is not necessary for learner-corpus-trained models to outperform standard native-text-trained models with much larger training data set