250 likes | 261 Views
Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection. Joel Tetreault [Educational Testing Service] Martin Chodorow [Hunter College of CUNY]. Preposition Cloze Examples. “We sat _____ the sunshine” in under by at
E N D
Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection Joel Tetreault [Educational Testing Service] Martin Chodorow [Hunter College of CUNY]
Preposition Cloze Examples • “We sat _____ the sunshine” • in • under • by • at • “…the force of gravity causes the sap to move _____ the underside of the stem.” • to • on • toward • onto • through
Motivation • Number of non-native speakers in English schools rising in the past decade • Currently 300 million Chinese ESL learners! • Highlights need for NLP tools to assist in language learning • Evaluation of NLP learner tools • Important for development • Error-annotation: time consuming and costly • Usually one rater • [Izumi ’03; Eeg-Olofsson ‘02; Han ’06; Nagata ’06; Gamon ’08]
Objective • Problem • Single human annotation has been used as the gold standard • Sidesteps the issue of annotator reliability • Objective • Show that rating preposition usage is actually a very contentious task • Recommend an approach to make annotation more feasible
Experiments in Rater Reliability • Judgments of Native Usage • Difficulty of preposition selection with cloze and choice experiments • Judgments of Non-Native Usage • Double-annotate a large ESL corpus • Show that one rater can be unreliable and skew system results • Sampling Approach • Propose an approach to alleviate the cost and time associated with double annotation
Background • Use system for preposition error detection developed • [Tetreault and Chodorow ’08] • Performance: • Native text: as high as 79% • TOEFL essays: P=84%, R=19% • State-of-the-art when compared with other methods: [Gamon ’08] [De Felice ’08] • Raters: • Two native speakers (East Coast US) • Ages: 26,29 • Two years experience with other NLP annotation
(1) Human Judgments of Native Usage • Is the task of preposition selection in native texts difficult for human raters? • Cloze Test: • Raters presented with 200 Encarta sentences, with one preposition replaced with a blank • Asked to fill in the blank with the best preposition • “We sat _____ the sunshine”
(1) Native Usage – Cloze Test * System’s mismatch suggestions were often not as good as Raters’
(1) Native Usage – Choice Test • Many contexts that license multiple prepositions • Using an exact match can underestimate performance • Choice Test: • Raters presented with 200 Encarta sentences • Asked to blindly choose between system’s and writer’s preposition • “We sat {in/under} the sunshine”
(1) Native Usage – Choice Test • Results: • Both Raters 1 & 2 considered the system’s preposition equal to or better than the writer’s 28% of the time • So a system that performs at 75% with exact-metric, is actually performing as high as 82% • 28% of the 25% mismatch rate = +7%
(2) Human Judgments of Non-Native Usage • Using one rater can be problematic • linguistic drift, age, location, fatigue, task difficulty • Question: is using only one rater reliable? • Experiment: • Two raters double-annotate TOEFL essays for preposition usage errors • Compute Agreement/Kappa measures • Evaluate system performance vs. two raters
Annotation: Error Targeting • Schemes target many different types of errors [Izumi ’03] [Granger ’03] • Problematic: • High cognitive load on rater to keep track of dozens of error types • Some contexts have several different errors (many different ways of correcting) • Can degrade reliability • Targeting one error type reduces the effects of these issues
Annotation Scheme • Annotators were presented sentences from TOEFL essays with each preposition flagged • Preposition Annotation: • Extraneous • Wrong Choice – if incorrect preposition is used (and list substitution(s)) • OK – preposition is perfect for that context • Equal – preposition is perfect, but there are others that are acceptable as well (list others) • Then mark confidence in judgment (binary)
Procedure • Raters given blocks of 500 preposition contexts • Took roughly 5 hours per block • After two blocks each, raters did an overlap set of ~100 contexts (1336 contexts total) • Every overlap set was adjudicated by two other human raters: • Sources of disagreement were discussed with original raters • Agreement and Kappa computed
How well do humans compare? • For all overlap segments: • OK and Equal are collapsed to OK • Agreement = 0.952 • Kappa = 0.630 • Kappa ranged from 0.411 to 0.786
Confusion Matrix Rater 1 Rater 2
Implications for System Evaluation • Comparing a system [Chodorow et al. ’07] to one rater’s judgments can skew evaluation results • Test: 2 native speakers rated 2,000 prepositions from TOEFL essays: • Diff. of 10% precision, 5% recall (rater as gold standard)
Implications of Using Multiple Raters • Advantages of multiple raters: • Can indicate the variability of system evaluation • Allows listing of more substitutions • Standard annotation with multiple annotators is problematic: • Expensive • Time-Consuming (training, adjudication) • Is there an approach that can make annotation more efficient?
(3) Sampling Approach OK 90% • Sampling Approach: • Sample system’s output classifications • Annotate smaller error-skewed corpus • Estimate rates of hits, false positives, and misses • Can calculate precision and recall 9000 OK 2000 Error Error 10% 1000 1000 Problem: to make an eval corpus of 1000 errors can take 100hrs!
Sampling Methodology Learner Corpus System Sys Flags Error Sys Accepts OK “Error Sub-Corpus” “OK Sub-Corpus” Random Error Sample Random OK Sample Annotation Corpus
Sampling Methodology Learner Corpus 1000 System Sys Flags Error Sys Accepts OK 100 900 “Error Sub-Corpus” “OK Sub-Corpus” Sample Rate = 0.33 Sample Rate = 1.0 Random Error Sample Random OK Sample 100 300 400 Hits = 70 FP = 30 Both OK = 200 Misses = 100 Annotation Corpus
Sampling Results • Two raters working in tandem on sampled corpus • Compare against standard annotation • Results: • Standard: P = 0.79, R = 0.18 • Sampling: P = 0.79, R = 0.16 • Related Work • [Chodorow & Leacock ’00] – usage of targeted words • Active Learning [Dagan & Engelson ’95] – finding the most informative training examples for ML
Summary • Are two or more annotators better than one? • Annotators vary in their judgments of usage errors • Evaluation based on a single annotator under- or over-estimates system performance • Value of multiple annotators: • Gives information about the range of performance • Dependent on number of annotators • Multiple prep’s per context handled better • Issues not unique to preposition task: • Collocation kappa scores: 0.504 to 0.554
Summary • Sampling Approach: shown to be a good alternative strategy to exhaustive annotation approach Advantages • Less costly & time-consuming • Results are similar to exhaustive annotation • Avoid fatigue problem Drawbacks • Less reliable estimate of recall • Hard to re-test system • System comparison difficult
Future Work • Do another sampling comparison to validate results • Leverage confidence annotations