1 / 27

Detecting Missing Hyphens in Learner Text

Detecting Missing Hyphens in Learner Text. Aoife Cahill, SusanneWolff, Nitin Madnani Educational Testing Service. Martin Chodorow Hunter College and the Graduate Center. ACL 2013. Outline. Introduction Baselines System Description Evaluation Conclusions. Introduction.

zariel
Download Presentation

Detecting Missing Hyphens in Learner Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Missing Hyphensin Learner Text Aoife Cahill, SusanneWolff, Nitin Madnani Educational Testing Service Martin Chodorow Hunter College and the Graduate Center ACL 2013

  2. Outline • Introduction • Baselines • System Description • Evaluation • Conclusions

  3. Introduction Missing Hyphens: • Schools may have more after school sports. • (2) I went to the dentist after school today. • (3) My father like play basketball with me.

  4. Outline Introduction Baselines System Description Evaluation Conclusions

  5. Baselines • Collins Dictionary • More than 1,000 times in Wikipedia • Probability of the hyphenated form as estimated from • Wikipedia is greater than 0.66

  6. Outline Introduction Baselines System Description Evaluation Conclusions

  7. System Description Learner text: Schools may have more after school sports.

  8. System Description Model: Logistic regression model Probability: Only predict a missing hyphen error when the probability of the prediction is >0.99

  9. System Description SJM-trained: - San Jose Mercury News corpus - For training, hyphenated words are automatically split (i.e. well-known becomes well known) - The training data contains 1% of thepositive examples and 3% of thenegative examples

  10. System Description Negative examples selected: Only contexts that occur more than 20 times are selected during training.

  11. System Description Wiki-revision-trained: - Wikipedia articles

  12. System Description

  13. System Description

  14. System Description Combined: - Combine both data sources

  15. Outline Introduction Baselines System Description Evaluation Conclusions

  16. Evaluation • Artificial Data: • - Brown corpus • - taking 24,243 sentences • - 2,072 hyphenated words

  17. Evaluation

  18. Evaluation

  19. Evaluation Evaluation 1 • Learner Text: • - CLC-FCE • - The corpus contains 1,244 exam scripts • - Totally 173 instances of missing hyphen errors

  20. Evaluation

  21. Evaluation

  22. Evaluation There are 131 true positives for the learner data reveal that 87 of these are cases of a single type, the word “make-up”.

  23. Evaluation Evaluation 2 • Learner Text: • - A data set of 1,000 student GRE and TOEFL essays • - Drawn from 295 prompts • - Ranged in length from 1 to 50 sentences • - Average of 378 words per essay

  24. Evaluation Learner Text (Cont.): - Manually inspect a random sample of 100 instances where each system detected a missing hyphen - Twonative-English speakers judge - Using the Chicago Manual of Style as a guide - High agreement

  25. Evaluation

  26. Outline Introduction Baselines System Description Evaluation Conclusions

  27. Conclusions 1 ) Automatically detecting missing hyphen errors in learner text 2 ) The classifiers generally performed better than the baseline systems 3 ) Taking context into account when detecting the errors is important.

More Related