1 / 24

LREC 2010

A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton. LREC 2010. Some real-word errors The sand-eel is the principle food for many birds and animals. Our teacher tort us to spell.

Download Presentation

LREC 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A large list of confusion sets for spellchecking assessed against a corpus of real-word errorsJenny Pedler,Roger Mitton LREC 2010

  2. Some real-word errors The sand-eel is the principle food for many birds and animals. Our teacher tort us to spell. Henley Regatta comes near the top of the English social calender.

  3. Spellchecker-induced real-word errors The Wine Bar Company is opening a chain of brassieres. The nightwatchman threw the switch and eliminated the backyard.

  4. Cupertino, California

  5. ... to encourage cooperation and ...

  6. ... to encourage cooperation and ...

  7. ... to encourage cooperation and ... Cupertino co-operation ....

  8. The original Cupertinos "reinforcing bilateral and multilateral Cupertino" "South Asian Association for regional Cupertino"

  9. Confusion sets {cite, sight, site} {form, from} {passed, past} {peace, piece} {principal, principle} {quiet, quite, quit} {their, there, they're} {weather, whether} {you're, your}

  10. He had quiet a young girl staying with him of 17 named Ethel Monticue.

  11. He had quiet a young girl staying with him quite? quit? of 17 named Ethel Monticue.

  12. The confusion-set approach has been demonstrated to work with (a) a short list of confusion sets, (b) artificial test data.

  13. To assess its potential for real, unrestricted text, we need: (1) a realistically-sized list of confusion sets, (2) a corpus of running text containing genuine real-word errors.

  14. A list of confusion sets • Tuned string-to-string edit-distance • ~ 6000 sets • Headword (confusables) • wright (right, write) • right (rite, write) • write (right, rite, writ) • Inflected forms • Proper nouns • Usage errors – e.g. <fewer, less>

  15. A corpus of real-word errors quit  quiet quit  quite

  16. Corpus mark-up example The collation of the information was <ERR targ = really> relay </ERR> <ERR targ = quite> quit </ERR> easy to do.

  17. Corpus profile: Frequent errors

  18. Corpus profile: Homophone errors 14% of distinct error/target pairs

  19. Corpus profile: Simple errors

  20. How would our list cope with our corpus?

  21. Non-detectable/non-correctable

  22. Using the list for spellchecking • Rules based on surrounding context • May be unreliable • 25% errors have another error within 2 words • 9% are another real-word error • Syntax-based methods • Easiest to implement • Shown to have good performance

  23. Syntax-based rules: potential

  24. Resources available for download www.dcs.bbk.ac.uk/~jenny/resources.html

More Related