720 likes | 879 Views
A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors. Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26. National Centre for Language Technology School of Computing, Dublin City University. Talk Outline. Motivation
E N D
A Comparative Evaluation ofDeep and Shallow Approaches to the Automatic Detection ofCommon Grammatical Errors Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26 National Centre for Language Technology School of Computing, Dublin City University
Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work
Why Judge the Grammaticality? • Grammar checking • Computer-assisted language learning • Feedback • Writing aid • Automatic essay grading • Re-rank computer-generated output • Machine translation
Why this Evaluation? • No agreed standard • Differences in • What is evaluated • Corpora • Error density • Error types
Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work
Deep Approaches • Precision grammar • Aim to distinguish grammatical sentences from ungrammatical sentences • Grammar engineers • Increase coverage • Avoid overgeneration • For English: • ParGram / XLE (LFG) • English Resource Grammar / LKB (HPSG) • RASP (GPSG to DCG influenced by ANLT)
Shallow Approaches • Real-word spelling errors • vs grammar errors in general • Part-of-speech (POS) n-grams • Raw frequency • Machine learning-based classifier • Features of local context • Noisy channel model • N-gram similarity, POS tag set
Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work
Artificial Error Corpus Real Error Corpus(Small) Chosen Error Types Automatic ErrorCreation Modules Error Analysis Applied to BNC(Big) Common Grammatical Error
Common Grammatical Errors • 20,000 word corpus • Ungrammatical English sentences • Newspapers, academic papers, emails, … • Correction operators • Substitute (48 %) • Insert (24 %) • Delete (17 %) • Combination (11 %)
Common Grammatical Errors • 20,000 word corpus • Ungrammatical English sentences • Newspapers, academic papers, emails, … • Correction operators • Substitute (48 %) • Insert (24 %) • Delete (17 %) • Combination (11 %) Agreement errorsReal-word spelling errors
Chosen Error Types Agreement: She steered Melissa around a corners. Real-word: She could no comprehend. Extra word: Was that in the summer in? Missing word: What the subject?
Automatic Error Creation Agreement: replace determiner, noun or verb Real-word: replace according to pre-compiled list Extra word: duplicate token or part-of-speech,or insert a random token Missing word: delete token (likelihood based onpart-of-speech)
Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work
BNC Test Data (1) BNC: 6.4 M sentences 4.2 M sentences (no speech, poems, captions and list items) Randomisation 1 2 3 4 5 10 10 sets with 420 Ksentences each …
1 2 3 4 5 10 1 2 3 4 5 10 … 1 2 3 4 5 10 … … 1 2 3 4 5 10 … 1 2 3 4 5 10 … BNC Test Data (2) Error corpus Error creation Agreement Real-word Extra word Missing word
1 1 1 1 1 … … … 10 10 10 10 10 BNC Test Data (3) Mixed error type ¼ each ¼ each
BNC Test Data (4) 5 error types: agreement, real-word, extra word, missing word, mixed errors 1 1 1 1 1 1 1 1 1 1 50 sets … … … … … 10 10 10 10 10 10 10 10 10 10 Each 50:50 ungrammatical:grammatical
Example:1st cross-validation runfor agreementerrors BNC Test Data (5) Testdata 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Trainingdata(if requiredby method) … … … … … 10 10 10 10 10 10 10 10 10 10
Evaluation Measures • Accuracy on ungrammatical data acc_ungram = # correctly flagged as ungrammatical # ungrammatical sentences • Accuracy on grammatical data acc_gram = # correctly classified as grammatical # grammatical sentences • Independent of error density of test data
Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work
Overview of Methods XLE Output POS n-graminformation M1 M2 M3 M4 M5 Basic methods Decision tree methods
M1 Method 1: Precision Grammar • XLE English LFG • Fragment rule • Parses ungrammatical input • Marked with * • Zero number of parses • Parser exceptions (time-out, memory)
M1 XLE Parsing 1 10 1 10 1 10 1 10 First 60 Ksentences 1 10 … … … … … XLE 50 x 60 K = 3 M parse results
M2 Method 2: POS N-grams • Flag rare POS n-grams as errors • Rare: according to reference corpus • Parameters: n and frequency threshold • Tested n = 2, …, 7 on held-out data • Best: n = 5 and frequency threshold 4
M2 POS N-gram Information 9 sets 1 10 1 10 1 10 1 10 1 10 … … … … … Reference n-gram table Rarest n-gram 3 M frequency values Repeated for n = 2, 3, …, 7
M3 Method 3: Decision Trees on XLE Output • Output statistics • Starredness (0 or 1) and parser exceptions (-1 = time-out, -2 = exceeded memory, …) • Number of optimal parses • Number of unoptimal parses • Duration of parsing • Number of subtrees • Number of words
M3 Decision Tree Example Star? >= 0 <0 Star? U <1 >= 1 Optimal? U <5 >= 5 U G U = ungrammaticalG = grammatical
M4 Method 4: Decision Trees on N-grams • Frequency of rarest n-gram in sentence • N = 2, …, 7 • feature vector: 6 numbers
M4 Decision Tree Example 5-gram? >= 4 <4 7-gram? U <1 >= 1 5-gram? G <45 >= 45 U G
M5 Method 5: Decision Trees on Combined Feature Sets Star? >= 0 <0 Star? U <1 >= 1 5-gram? U <4 >= 4 U G
Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work
XLE Parsing of the BNC • 600,000 grammatical sentences • 2.4 M ungrammatical sentences • Parse-testfile command • Parse-literally 1 • Max xle scratch storage 1,000 MB • Time-out 60 seconds • No skimming
Efficiency Time-out 10,000 BNCsentences (grammatical)
XLE Coverage 5 x 600 K Test data
Varying Training Error Density M3 33% 20% 25% M3 40% M3 50% M1 M3 60% M3 67% M3 75%
Varying Training Error Density M3 33% 20% 25% M3 40% M3 50% M1 M3 60% M3 67% M3 75%
Varying Training Error Density M1: XLEM3: withdecisiontree M3 40% M3 43% M1 M3 50%
Varying Training Error Density M1: XLEM3: withdecisiontree M3 40% M3 43% M1 M3 50%
N-Grams and DT (M2 vs M4) M2: NgramM4: DT M4 25% M4 50% M2 M4 67% M4 75%
Methods 1 to 4 M1: XLEM2: NgramM3/4: DT M3 43% M4 50% M1 M2 M3 50%
Combined Method (M5) 10%, 20% 25% 50% 67% 75% 80% 90%
All Methods M1: XLEM2: NgramM3/4: DTM5: comb M5 45% M3 43% M1 M5 50% M4 M2 M3 50%