210 likes | 225 Views
Probabilistic Detection of Context-Sensitive Spelling Errors. Johnny Bigert Royal Institute of Technology, Sweden johnny@kth.se. What?. Context-Sensitive Spelling Errors Example: Nice whether today. All words found in dictionary
E N D
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden johnny@kth.se
What? Context-Sensitive Spelling Errors • Example:Nice whether today. • All words found in dictionary • If context is considered,the spelling of whether is incorrect
Why? Why do we need detection of context-sensitive spelling errors? • These errors are quite frequent (reports on 16-40% of all errors) • Larger dictionaries result in more errors undetected • They cannot be found by regular spell checkers!
Why not? What about proposing corrections for the errors? • An interesting topic,but not the topic of this article • Detection is imperative,correction is an aid
Related work? Are there no algorithms doing this already? • A full parser is perfect for the job Drawbacks: • high accuracy is required • not available for many languages • manual labor is expensive • not robust
Related work? Are there no other algorithms? • Several other algorithms (e.g. Winnow) • Some do correction Drawbacks: • They require a set of easily confused words • Normally, you don’t know your spelling errors beforehand
Why? What are the benefits of this algorithm? • Find any error • Avoid extensive manual work • Robustness
How? Prerequisites • We use PoS tag trigram frequenciesfrom an annotated corpus • We are given a sentence, and apply a PoS tagger
How? Basic assumption • If any tag trigram frequency is low, that part is probably ungrammatical
But? But don’t you often encounter rare or unseen trigrams? • Yes, unfortunately • We modify the notion of frequency • Find and use other, ”syntactically close” PoS trigrams
Close? What is the syntactic distance between two PoS tags? • A probability that one tag is replaceable by another • Retain grammaticality • Distances extracted from corpus • Unsupervised learning algorithm
Then? The algorithm • We have a generalized PoS tag trigtram frequency • If frequency below threshold, text is probably ungrammatical
Result? Summary so far • Unsupervised learning • Automatic algorithm • Detection of any error • No manual labor! • Alas, phrase boundaries cause problems
Phrases? What about phrases? • PoS tag trigrams overlapping two phrases are very productive • Rare phrases, rare trigrams • Transformations!
Transform? How do we transform a phrase? • Shallow parser • Transform phrases to most common form • Normally, the head • Benefits: retain grammaticality, less rare trigrams, longer tagger scope
Example? Example of phrase transformation • Only the paintings that are old are for sale • Only the paintings are for sale NP NP
Then what? How do we use the transformations? • Apply tagger to transformed sentence • Run first part of algorithm again • If any transformation yield only trigrams with high frequency,sentence ok • Otherwise, probable error
Result? Summary • Trigram part, fully automatic • Phrase part, could use machine learning of rules for shallow parser • Finds many difficult error types • Threshold determines precision/recall trade-off
Evaluation? Fully automatic evaluation • Introduce artificial context-sensitive spelling errors (using software Missplel) • Automated evaluation procedure for 1, 2, 5, 10 and 20% misspelled words(using software AutoEval)