Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection Joel Tetreault [Educational Testing Service] Martin Chodorow [Hunter College of CUNY]

Preposition Cloze Examples • “We sat _____ the sunshine” • in • under • by • at • “…the force of gravity causes the sap to move _____ the underside of the stem.” • to • on • toward • onto • through

Motivation • Number of non-native speakers in English schools rising in the past decade • Currently 300 million Chinese ESL learners! • Highlights need for NLP tools to assist in language learning • Evaluation of NLP learner tools • Important for development • Error-annotation: time consuming and costly • Usually one rater • [Izumi ’03; Eeg-Olofsson ‘02; Han ’06; Nagata ’06; Gamon ’08]

Objective • Problem • Single human annotation has been used as the gold standard • Sidesteps the issue of annotator reliability • Objective • Show that rating preposition usage is actually a very contentious task • Recommend an approach to make annotation more feasible

Experiments in Rater Reliability • Judgments of Native Usage • Difficulty of preposition selection with cloze and choice experiments • Judgments of Non-Native Usage • Double-annotate a large ESL corpus • Show that one rater can be unreliable and skew system results • Sampling Approach • Propose an approach to alleviate the cost and time associated with double annotation

Background • Use system for preposition error detection developed • [Tetreault and Chodorow ’08] • Performance: • Native text: as high as 79% • TOEFL essays: P=84%, R=19% • State-of-the-art when compared with other methods: [Gamon ’08] [De Felice ’08] • Raters: • Two native speakers (East Coast US) • Ages: 26,29 • Two years experience with other NLP annotation

(1) Human Judgments of Native Usage • Is the task of preposition selection in native texts difficult for human raters? • Cloze Test: • Raters presented with 200 Encarta sentences, with one preposition replaced with a blank • Asked to fill in the blank with the best preposition • “We sat _____ the sunshine”

(1) Native Usage – Cloze Test * System’s mismatch suggestions were often not as good as Raters’

(1) Native Usage – Choice Test • Many contexts that license multiple prepositions • Using an exact match can underestimate performance • Choice Test: • Raters presented with 200 Encarta sentences • Asked to blindly choose between system’s and writer’s preposition • “We sat {in/under} the sunshine”

(1) Native Usage – Choice Test • Results: • Both Raters 1 & 2 considered the system’s preposition equal to or better than the writer’s 28% of the time • So a system that performs at 75% with exact-metric, is actually performing as high as 82% • 28% of the 25% mismatch rate = +7%

(2) Human Judgments of Non-Native Usage • Using one rater can be problematic • linguistic drift, age, location, fatigue, task difficulty • Question: is using only one rater reliable? • Experiment: • Two raters double-annotate TOEFL essays for preposition usage errors • Compute Agreement/Kappa measures • Evaluate system performance vs. two raters

Annotation: Error Targeting • Schemes target many different types of errors [Izumi ’03] [Granger ’03] • Problematic: • High cognitive load on rater to keep track of dozens of error types • Some contexts have several different errors (many different ways of correcting) • Can degrade reliability • Targeting one error type reduces the effects of these issues

Annotation Scheme • Annotators were presented sentences from TOEFL essays with each preposition flagged • Preposition Annotation: • Extraneous • Wrong Choice – if incorrect preposition is used (and list substitution(s)) • OK – preposition is perfect for that context • Equal – preposition is perfect, but there are others that are acceptable as well (list others) • Then mark confidence in judgment (binary)

Procedure • Raters given blocks of 500 preposition contexts • Took roughly 5 hours per block • After two blocks each, raters did an overlap set of ~100 contexts (1336 contexts total) • Every overlap set was adjudicated by two other human raters: • Sources of disagreement were discussed with original raters • Agreement and Kappa computed

How well do humans compare? • For all overlap segments: • OK and Equal are collapsed to OK • Agreement = 0.952 • Kappa = 0.630 • Kappa ranged from 0.411 to 0.786

Confusion Matrix Rater 1 Rater 2

Implications for System Evaluation • Comparing a system [Chodorow et al. ’07] to one rater’s judgments can skew evaluation results • Test: 2 native speakers rated 2,000 prepositions from TOEFL essays: • Diff. of 10% precision, 5% recall (rater as gold standard)

Implications of Using Multiple Raters • Advantages of multiple raters: • Can indicate the variability of system evaluation • Allows listing of more substitutions • Standard annotation with multiple annotators is problematic: • Expensive • Time-Consuming (training, adjudication) • Is there an approach that can make annotation more efficient?

(3) Sampling Approach OK 90% • Sampling Approach: • Sample system’s output classifications • Annotate smaller error-skewed corpus • Estimate rates of hits, false positives, and misses •  Can calculate precision and recall 9000 OK 2000 Error Error 10% 1000 1000 Problem: to make an eval corpus of 1000 errors can take 100hrs!

Sampling Methodology Learner Corpus System Sys Flags Error Sys Accepts OK “Error Sub-Corpus” “OK Sub-Corpus” Random Error Sample Random OK Sample Annotation Corpus

Sampling Methodology Learner Corpus 1000 System Sys Flags Error Sys Accepts OK 100 900 “Error Sub-Corpus” “OK Sub-Corpus” Sample Rate = 0.33 Sample Rate = 1.0 Random Error Sample Random OK Sample 100 300 400 Hits = 70 FP = 30 Both OK = 200 Misses = 100 Annotation Corpus

Sampling Results • Two raters working in tandem on sampled corpus • Compare against standard annotation • Results: • Standard: P = 0.79, R = 0.18 • Sampling: P = 0.79, R = 0.16 • Related Work • [Chodorow & Leacock ’00] – usage of targeted words • Active Learning [Dagan & Engelson ’95] – finding the most informative training examples for ML

Summary • Are two or more annotators better than one? • Annotators vary in their judgments of usage errors • Evaluation based on a single annotator under- or over-estimates system performance • Value of multiple annotators: • Gives information about the range of performance • Dependent on number of annotators • Multiple prep’s per context handled better • Issues not unique to preposition task: • Collocation kappa scores: 0.504 to 0.554

Summary • Sampling Approach: shown to be a good alternative strategy to exhaustive annotation approach Advantages • Less costly & time-consuming • Results are similar to exhaustive annotation • Avoid fatigue problem Drawbacks • Less reliable estimate of recall • Hard to re-test system • System comparison difficult

Future Work • Do another sampling comparison to validate results • Leverage confidence annotations

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection

Presentation Transcript

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection

Non-Native Species in the Antarctic Workshop

Detection of structural ambiguity in humor by non-native English speakers

Albizia julibrissin (mimosa tree) Non-Native

Native

Non-Native Android Development

Keystone and Non-Native Species

Non-native species

Native A Native Americans

Non-Native Invasive Plant Removal

Non-native-speaker’s map

Non-Native Species in the Antarctic Workshop

How native and non-native English speakers adapt to humor in intercultural interaction

Native and Non-native Wetland Plants Found in Utah

REMOVAL REVERSED : Native/non-Native joint management of reclaimed lands

The partner effect in non-native speech

Native vs. Non-native ‘th‘

Non - Native Species

The Ups and Downs of Preposition Error Detection in ESL Writing

NON-NATIVE ENGLISH SPEAKERS

A Corpus Study of Native and Non-native Accented Speech

NON-NATIVE SALMOINDS IN LAKE SUPERIOR