Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26

A Comparative Evaluation ofDeep and Shallow Approaches to the Automatic Detection ofCommon Grammatical Errors Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26 National Centre for Language Technology School of Computing, Dublin City University

Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work

Why Judge the Grammaticality? • Grammar checking • Computer-assisted language learning • Feedback • Writing aid • Automatic essay grading • Re-rank computer-generated output • Machine translation

Why this Evaluation? • No agreed standard • Differences in • What is evaluated • Corpora • Error density • Error types

Deep Approaches • Precision grammar • Aim to distinguish grammatical sentences from ungrammatical sentences • Grammar engineers • Increase coverage • Avoid overgeneration • For English: • ParGram / XLE (LFG) • English Resource Grammar / LKB (HPSG) • RASP (GPSG to DCG influenced by ANLT)

Shallow Approaches • Real-word spelling errors • vs grammar errors in general • Part-of-speech (POS) n-grams • Raw frequency • Machine learning-based classifier • Features of local context • Noisy channel model • N-gram similarity, POS tag set

Artificial Error Corpus Real Error Corpus(Small) Chosen Error Types Automatic ErrorCreation Modules Error Analysis Applied to BNC(Big) Common Grammatical Error

Common Grammatical Errors • 20,000 word corpus • Ungrammatical English sentences • Newspapers, academic papers, emails, … • Correction operators • Substitute (48 %) • Insert (24 %) • Delete (17 %) • Combination (11 %)

Common Grammatical Errors • 20,000 word corpus • Ungrammatical English sentences • Newspapers, academic papers, emails, … • Correction operators • Substitute (48 %) • Insert (24 %) • Delete (17 %) • Combination (11 %) Agreement errorsReal-word spelling errors

Chosen Error Types Agreement: She steered Melissa around a corners. Real-word: She could no comprehend. Extra word: Was that in the summer in? Missing word: What the subject?

Automatic Error Creation Agreement: replace determiner, noun or verb Real-word: replace according to pre-compiled list Extra word: duplicate token or part-of-speech,or insert a random token Missing word: delete token (likelihood based onpart-of-speech)

BNC Test Data (1) BNC: 6.4 M sentences 4.2 M sentences (no speech, poems, captions and list items) Randomisation 1 2 3 4 5 10 10 sets with 420 Ksentences each …

1 2 3 4 5 10 1 2 3 4 5 10 … 1 2 3 4 5 10 … … 1 2 3 4 5 10 … 1 2 3 4 5 10 … BNC Test Data (2) Error corpus Error creation Agreement Real-word Extra word Missing word

1 1 1 1 1 … … … 10 10 10 10 10 BNC Test Data (3) Mixed error type ¼ each ¼ each

BNC Test Data (4) 5 error types: agreement, real-word, extra word, missing word, mixed errors 1 1 1 1 1 1 1 1 1 1 50 sets … … … … … 10 10 10 10 10 10 10 10 10 10 Each 50:50 ungrammatical:grammatical

Example:1st cross-validation runfor agreementerrors BNC Test Data (5) Testdata 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Trainingdata(if requiredby method) … … … … … 10 10 10 10 10 10 10 10 10 10

Evaluation Measures • Accuracy on ungrammatical data acc_ungram = # correctly flagged as ungrammatical # ungrammatical sentences • Accuracy on grammatical data acc_gram = # correctly classified as grammatical # grammatical sentences • Independent of error density of test data

Accuracy Graph

Region of Improvement

Region of Degradation

Undecided

Overview of Methods XLE Output POS n-graminformation M1 M2 M3 M4 M5 Basic methods Decision tree methods

M1 Method 1: Precision Grammar • XLE English LFG • Fragment rule • Parses ungrammatical input • Marked with * • Zero number of parses • Parser exceptions (time-out, memory)

M1 XLE Parsing 1 10 1 10 1 10 1 10 First 60 Ksentences 1 10 … … … … … XLE 50 x 60 K = 3 M parse results

M2 Method 2: POS N-grams • Flag rare POS n-grams as errors • Rare: according to reference corpus • Parameters: n and frequency threshold • Tested n = 2, …, 7 on held-out data • Best: n = 5 and frequency threshold 4

M2 POS N-gram Information 9 sets 1 10 1 10 1 10 1 10 1 10 … … … … … Reference n-gram table Rarest n-gram 3 M frequency values Repeated for n = 2, 3, …, 7

M3 Method 3: Decision Trees on XLE Output • Output statistics • Starredness (0 or 1) and parser exceptions (-1 = time-out, -2 = exceeded memory, …) • Number of optimal parses • Number of unoptimal parses • Duration of parsing • Number of subtrees • Number of words

M3 Decision Tree Example Star? >= 0 <0 Star? U <1 >= 1 Optimal? U <5 >= 5 U G U = ungrammaticalG = grammatical

M4 Method 4: Decision Trees on N-grams • Frequency of rarest n-gram in sentence • N = 2, …, 7 • feature vector: 6 numbers

M4 Decision Tree Example 5-gram? >= 4 <4 7-gram? U <1 >= 1 5-gram? G <45 >= 45 U G

M5 Method 5: Decision Trees on Combined Feature Sets Star? >= 0 <0 Star? U <1 >= 1 5-gram? U <4 >= 4 U G

XLE Parsing of the BNC • 600,000 grammatical sentences • 2.4 M ungrammatical sentences • Parse-testfile command • Parse-literally 1 • Max xle scratch storage 1,000 MB • Time-out 60 seconds • No skimming

Efficiency Time-out 10,000 BNCsentences (grammatical)

XLE Parse Results and Method 1

XLE Coverage 5 x 600 K Test data

Applying Decision Tree to XLE M1 M3

Overall Accuracy for M1 and M3

Varying Training Error Density M3 33% 20% 25% M3 40% M3 50% M1 M3 60% M3 67% M3 75%

Varying Training Error Density M1: XLEM3: withdecisiontree M3 40% M3 43% M1 M3 50%

N-Grams and DT (M2 vs M4) M2: NgramM4: DT M4 25% M4 50% M2 M4 67% M4 75%

Methods 1 to 4 M1: XLEM2: NgramM3/4: DT M3 43% M4 50% M1 M2 M3 50%

Combined Method (M5) 10%, 20% 25% 50% 67% 75% 80% 90%

All Methods M1: XLEM2: NgramM3/4: DTM5: comb M5 45% M3 43% M1 M5 50% M4 M2 M3 50%

Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26