240 likes | 361 Views
A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton. LREC 2010. Some real-word errors The sand-eel is the principle food for many birds and animals. Our teacher tort us to spell.
E N D
A large list of confusion sets for spellchecking assessed against a corpus of real-word errorsJenny Pedler,Roger Mitton LREC 2010
Some real-word errors The sand-eel is the principle food for many birds and animals. Our teacher tort us to spell. Henley Regatta comes near the top of the English social calender.
Spellchecker-induced real-word errors The Wine Bar Company is opening a chain of brassieres. The nightwatchman threw the switch and eliminated the backyard.
... to encourage cooperation and ... Cupertino co-operation ....
The original Cupertinos "reinforcing bilateral and multilateral Cupertino" "South Asian Association for regional Cupertino"
Confusion sets {cite, sight, site} {form, from} {passed, past} {peace, piece} {principal, principle} {quiet, quite, quit} {their, there, they're} {weather, whether} {you're, your}
He had quiet a young girl staying with him of 17 named Ethel Monticue.
He had quiet a young girl staying with him quite? quit? of 17 named Ethel Monticue.
The confusion-set approach has been demonstrated to work with (a) a short list of confusion sets, (b) artificial test data.
To assess its potential for real, unrestricted text, we need: (1) a realistically-sized list of confusion sets, (2) a corpus of running text containing genuine real-word errors.
A list of confusion sets • Tuned string-to-string edit-distance • ~ 6000 sets • Headword (confusables) • wright (right, write) • right (rite, write) • write (right, rite, writ) • Inflected forms • Proper nouns • Usage errors – e.g. <fewer, less>
A corpus of real-word errors quit quiet quit quite
Corpus mark-up example The collation of the information was <ERR targ = really> relay </ERR> <ERR targ = quite> quit </ERR> easy to do.
Corpus profile: Homophone errors 14% of distinct error/target pairs
Using the list for spellchecking • Rules based on surrounding context • May be unreliable • 25% errors have another error within 2 words • 9% are another real-word error • Syntax-based methods • Easiest to implement • Shown to have good performance
Resources available for download www.dcs.bbk.ac.uk/~jenny/resources.html