1 / 72

Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26

A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors. Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26. National Centre for Language Technology School of Computing, Dublin City University. Talk Outline. Motivation

enye
Download Presentation

Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparative Evaluation ofDeep and Shallow Approaches to the Automatic Detection ofCommon Grammatical Errors Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26 National Centre for Language Technology School of Computing, Dublin City University

  2. Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work

  3. Why Judge the Grammaticality? • Grammar checking • Computer-assisted language learning • Feedback • Writing aid • Automatic essay grading • Re-rank computer-generated output • Machine translation

  4. Why this Evaluation? • No agreed standard • Differences in • What is evaluated • Corpora • Error density • Error types

  5. Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work

  6. Deep Approaches • Precision grammar • Aim to distinguish grammatical sentences from ungrammatical sentences • Grammar engineers • Increase coverage • Avoid overgeneration • For English: • ParGram / XLE (LFG) • English Resource Grammar / LKB (HPSG) • RASP (GPSG to DCG influenced by ANLT)

  7. Shallow Approaches • Real-word spelling errors • vs grammar errors in general • Part-of-speech (POS) n-grams • Raw frequency • Machine learning-based classifier • Features of local context • Noisy channel model • N-gram similarity, POS tag set

  8. Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work

  9. Artificial Error Corpus Real Error Corpus(Small) Chosen Error Types Automatic ErrorCreation Modules Error Analysis Applied to BNC(Big) Common Grammatical Error

  10. Common Grammatical Errors • 20,000 word corpus • Ungrammatical English sentences • Newspapers, academic papers, emails, … • Correction operators • Substitute (48 %) • Insert (24 %) • Delete (17 %) • Combination (11 %)

  11. Common Grammatical Errors • 20,000 word corpus • Ungrammatical English sentences • Newspapers, academic papers, emails, … • Correction operators • Substitute (48 %) • Insert (24 %) • Delete (17 %) • Combination (11 %) Agreement errorsReal-word spelling errors

  12. Chosen Error Types Agreement: She steered Melissa around a corners. Real-word: She could no comprehend. Extra word: Was that in the summer in? Missing word: What the subject?

  13. Automatic Error Creation Agreement: replace determiner, noun or verb Real-word: replace according to pre-compiled list Extra word: duplicate token or part-of-speech,or insert a random token Missing word: delete token (likelihood based onpart-of-speech)

  14. Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work

  15. BNC Test Data (1) BNC: 6.4 M sentences 4.2 M sentences (no speech, poems, captions and list items) Randomisation 1 2 3 4 5 10 10 sets with 420 Ksentences each …

  16. 1 2 3 4 5 10 1 2 3 4 5 10 … 1 2 3 4 5 10 … … 1 2 3 4 5 10 … 1 2 3 4 5 10 … BNC Test Data (2) Error corpus Error creation Agreement Real-word Extra word Missing word

  17. 1 1 1 1 1 … … … 10 10 10 10 10 BNC Test Data (3) Mixed error type ¼ each ¼ each

  18. BNC Test Data (4) 5 error types: agreement, real-word, extra word, missing word, mixed errors 1 1 1 1 1 1 1 1 1 1 50 sets … … … … … 10 10 10 10 10 10 10 10 10 10 Each 50:50 ungrammatical:grammatical

  19. Example:1st cross-validation runfor agreementerrors BNC Test Data (5) Testdata 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Trainingdata(if requiredby method) … … … … … 10 10 10 10 10 10 10 10 10 10

  20. Evaluation Measures • Accuracy on ungrammatical data acc_ungram = # correctly flagged as ungrammatical # ungrammatical sentences • Accuracy on grammatical data acc_gram = # correctly classified as grammatical # grammatical sentences • Independent of error density of test data

  21. Accuracy Graph

  22. Region of Improvement

  23. Region of Degradation

  24. Undecided

  25. Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work

  26. Overview of Methods XLE Output POS n-graminformation M1 M2 M3 M4 M5 Basic methods Decision tree methods

  27. M1 Method 1: Precision Grammar • XLE English LFG • Fragment rule • Parses ungrammatical input • Marked with * • Zero number of parses • Parser exceptions (time-out, memory)

  28. M1 XLE Parsing 1 10 1 10 1 10 1 10 First 60 Ksentences 1 10 … … … … … XLE 50 x 60 K = 3 M parse results

  29. M2 Method 2: POS N-grams • Flag rare POS n-grams as errors • Rare: according to reference corpus • Parameters: n and frequency threshold • Tested n = 2, …, 7 on held-out data • Best: n = 5 and frequency threshold 4

  30. M2 POS N-gram Information 9 sets 1 10 1 10 1 10 1 10 1 10 … … … … … Reference n-gram table Rarest n-gram 3 M frequency values Repeated for n = 2, 3, …, 7

  31. M3 Method 3: Decision Trees on XLE Output • Output statistics • Starredness (0 or 1) and parser exceptions (-1 = time-out, -2 = exceeded memory, …) • Number of optimal parses • Number of unoptimal parses • Duration of parsing • Number of subtrees • Number of words

  32. M3 Decision Tree Example Star? >= 0 <0 Star? U <1 >= 1 Optimal? U <5 >= 5 U G U = ungrammaticalG = grammatical

  33. M4 Method 4: Decision Trees on N-grams • Frequency of rarest n-gram in sentence • N = 2, …, 7 • feature vector: 6 numbers

  34. M4 Decision Tree Example 5-gram? >= 4 <4 7-gram? U <1 >= 1 5-gram? G <45 >= 45 U G

  35. M5 Method 5: Decision Trees on Combined Feature Sets Star? >= 0 <0 Star? U <1 >= 1 5-gram? U <4 >= 4 U G

  36. Talk Outline • Motivation • Background • Artificial Error Corpus • Evaluation Procedure • Error Detection Methods • Results and Analysis • Conclusion and Future Work

  37. XLE Parsing of the BNC • 600,000 grammatical sentences • 2.4 M ungrammatical sentences • Parse-testfile command • Parse-literally 1 • Max xle scratch storage 1,000 MB • Time-out 60 seconds • No skimming

  38. Efficiency Time-out 10,000 BNCsentences (grammatical)

  39. XLE Parse Results and Method 1

  40. XLE Coverage 5 x 600 K Test data

  41. Applying Decision Tree to XLE M1 M3

  42. Overall Accuracy for M1 and M3

  43. Varying Training Error Density M3 33% 20% 25% M3 40% M3 50% M1 M3 60% M3 67% M3 75%

  44. Varying Training Error Density M3 33% 20% 25% M3 40% M3 50% M1 M3 60% M3 67% M3 75%

  45. Varying Training Error Density M1: XLEM3: withdecisiontree M3 40% M3 43% M1 M3 50%

  46. Varying Training Error Density M1: XLEM3: withdecisiontree M3 40% M3 43% M1 M3 50%

  47. N-Grams and DT (M2 vs M4) M2: NgramM4: DT M4 25% M4 50% M2 M4 67% M4 75%

  48. Methods 1 to 4 M1: XLEM2: NgramM3/4: DT M3 43% M4 50% M1 M2 M3 50%

  49. Combined Method (M5) 10%, 20% 25% 50% 67% 75% 80% 90%

  50. All Methods M1: XLEM2: NgramM3/4: DTM5: comb M5 45% M3 43% M1 M5 50% M4 M2 M3 50%

More Related