260 likes | 277 Views
Explore various kinds of spelling mistakes, from typos to brainos, along with correction techniques like Damerau and probabilistic models.
E N D
ContemporarySpelling CorrectionDecoding the Noisy Channel Bob Carpenter Alias I, Inc. carp@alias-i.com
Kinds of Spelling Mistakes: Typos • Typos are wrong characters by mistake • Insertions • “appellate” as “appellare”, “prejudice” as “prejudsice” • Deletions • “plaintiff” as “paintiff”, “judgement” as “judment”, “liability” as “liabilty”, “discovery” as “dicovery”, “fourth amendment” as “fourthamendment” • Substitutions • “habeas” as “haceas” • Transpositions • “fraud” as “fruad”, “bankruptcy” as “banrkuptcy • “subpoena” as “subpeona” • “plaintiff” as “plaitniff”
Kinds of Spelling Mistakes: Brainos • Brainos are wrong characters “on purpose” • The kinds of mistakes found in lists of “common” misspellings • Very common in general web queries • Derive from either pronunciation or spelling or deep semantic confusions • English is particularly bad due to irregularity • Probably (?) common in other languages importing words
Brainos: Soundalikes • Latinates • “subpoena” as “supena”, “judicata” as “judicada”, “voir” as “voire” • Consonant Clusters & Flaps • “privelege” as “priveledge”, “rescision” as “recision”, “collateral” as “colaterall”, “latter” as “ladder”, “estoppel” as “estopple”, “withholding” as “witholding”, “recission” as “recision” • Vowel Reductions • “collateral” as “collaterel”, “punitive” as “punative” • Vowel Clusters • “respondeat” as “respondiat”, “lien” as “lein”; “estoppel” as “estopple”, “habeas” as “habeeas”, “conveniens” as “convieniens • Marker Vowels • “foreclosure” as “forclosure” • Multiples • “subpoena” as “supena” (two deletes)
Brainos: Confusions • Substitute more common or just plain different • Names: “Opperman” as “Oppenheimer”; “Eisenstein” as “Einstein” • Pronunciation Confusions • Transpositions; “preclusion” as “perclusion”, “meruit” as “meriut” • Irregular word forms • “juries” as “jurys” or “jureys”; “men” as “mans” • English is particularly bad for this, too • Tokenization issues • “AT&T” vs. “AT & T” vs. “A.T.&T.”, … • Correct variant (if unique) depends on search engine’s notion of “word” • Word Boundaries • “in camera” as “incamera”, “qui tam” as “quitam”, “injunction” as “in junction”, “foreclosure” as “for closure”, “dramshop” as “dram shop”
“Old School” Spelling Correction • Damerau, 1964, “A technique for computer detection and correction of spelling errors”. Comms. ACM. • One word (token) at a time • Only looked at unknown words not in dictionary • Suggest “closest” alternatives (first or multiple in order) • Closeness measured in number of edits (edit distance) • Deletions, Insertions, Substitutions, and sometimes Transpositions • Often results in ties • Good word game • With 50 characters and a 50-word query, get 5050 = 1084 alternatives • Can search whole space in linear time using dynamic programming • This technique lives on in many apps • Simple, fast and only requires a word list
Edit Distance (Damerau/Levenstein) • Quadratic time; linear space algorithm • Eg. D(“John”, “Jan”) = 2 [D(“John”, “Bob”) = 3] • Edits match ‘J’, subst ‘a’ for ‘o’; delete ‘h’, match ‘n’) score(I,J) = = Min (score(I-1,J-1) + match(I,J), score(I-1,J) + delete(J), score(I,J-1) + insert(I) )
“Middle Aged” Spelling Correction • Still look at single words not in a dictionary and list of common misspellings • Model Likely Edits • Whole words • “acceptable” as “acceptible”; “truant” as “truent”, etc. • Sound Sequences • “ie” “ei”; “mm” “m” • Typos • Closeness on keyboard (depends on your keyboard – mixtures) • “q” as “w”; “y” as “u” (substitutions) • “q” as “qw” or “wq” (insertions) • Position in Word • Edits more likely internally, next at end, least in front • Psychology of reading left-to-right & early resolution • “plantiff” (mid) > “plaintff” (end) > “laintiff” (front)
“Contemporary” Spelling Correction • Find most likely intended query given observed query • Integrated Probabilistic Model • Model of Query Likelihood (source): P(query) • Model of Edit Likelihood (channel): P(realization|query) • Shannon’s “Noisy Channel Model” (1940s): • Find most likely query (Q) given realization (R) ARGMAXQ P(Q | R) [Problem] = ARGMAXQ P(R | Q) * P(Q) / P(R) [Defn. Conditional] = ARGMAXQ P(R | Q) * P(Q) [R constant]
Simple Example of Correction • Query Likelihood Model • P(“hte”) = 1/1,000,000 • P(“the”) = 1/20 • Edit Likelihood Model • P(“hte” | “the”) = P(transpose(“th”)) * P(match(“e”)) = 1/500 * 99/100 = 99/50000 ~ 1/500 • P( “hte” | “hte”) = P(match(“h”)) * P(match(“t”)) * P(match(“e”)) ~ 1/1 • Therefore: • P(“hte” | “the” ) * P(“the”) = 1/500 * 1/20 = 1/10,000 >> P(“hte” | “hte” ) * P(“hte”) = 1/1 * 1/1,000,000
General Approach Solves Several Problems • Orders alternatives based on likelihood • First best or ranked n-best alternatives • N-best is a tricky user-interface issue for web search • Measures likelihood that query is in error • Allows tuning of rejection thresholds for precision/recall • Measures likelihood that correction is correct • As posterior probability in the Bayesian model • Principled balance of query vs. edit likelihoods • Empirical issue determined by measurable user behavior • E.g. Word processors and web search very different • Suggests Valid Word Substitutions in Phrases • “pro bono” as “per bono” • “Peter principle” as “Peter principal” • Google e.g. “fodr” “ford” but “fodr baggins” “frodo baggins”
Alias-i’s Approach • Models fully retrainable per application • Out-of-the-box solutions not feasible • Tailored query and edit models based on user application behavior • Scalable to gigabytes w/o pruning and to arbitrary amounts of data with selective pruning • Character-level model for queries: P(query) • Generalizes to subphrases of unknown tokens • E.g. “likelihoods” flagged as error by PowerPoint • E.g. “likelihood” not flagged as error by PowerPoint • Or Token-sensitive output (only output known words in corpus) • Allows efficient search based on prefixes • Flexible framework for edit likelihoods: P(realization|query) • Models likely substitutions in domain
Source Language Models • Character n-grams: P(c0,…,cn-1) = PROD i<n P(ci | c0,…,ci-1 ) [chain rule] ~ PROD i<n P’(ci | ci-n+1,…,ci-1 ) [n-gram] • Generalized Witten-Bell smoothing (~ state of the art): P’(d| c C) = lambda(c C) * PML(d | c C) + (1 – lambda(c C)) * P’(d | C) • where d,c are characters, and C a sequence of characters, • PML is the maximum likelihood estimator, • the recursion grounds on the uniform estimate, and • lambda(X) = count(X) / (count(X) + K * outcomes(X)) [in [0,1]]
Training Data for Query Model: P(Query) • Trained independently of edit model • Captures domain-specific features more than edits • Appropriate Text Corpus matches problem • Overall stats: “trt” “tart” or “tort” (depends on domain) • Phrasal Context: “linzer trt” vs. “trt reform” • Implicitly models number of possible “hits” for query • Can train per field for complex queries • E.g. author, institution, MeSH term, abstract in MEDLINE • Can retrain query models as new data arrives • Training data must match use data • e.g. all caps, mixed case, etc. • May normalize queries plus training data
Training Data for Edit Model: P(realization|query) 1. No training data • A priori typos: • Characters near each other on keyboard are likely typos • More careful typing near beginning and end of word • A priori brainos: • Vowel sequences confusable with vowel sequences • Consonants that sound alike easily confused (‘t’ vs. ‘d’, etc.) • Consonants likely doubled or undoubled in error • More common in unstressed syllables (approximately later) 2. Bootstrap raw query logs • Can do this step with simpler model, such as ispell • Better with the first approximation model above (like EM) • Estimates rate of various errors and likely substitutions
Training Data for Edit Model: P(realization|query) (cont.) • Sample of Correct/Error Classified Queries • Better estimate of error edit rates (not specific errors) • Estimate likely insert/delete/substitute/transpose errors • Requires unbiased sample of errors and correct queries • Search engines report 10-15% of queries have errors!!! • Need ~100 examples of each type of error type on average • Requires unbiased sample of errors (correct not necessary) • Need about 100 examples average per character, or about 5K examples total assuming 50 editable characters • We can find these using “active learning” or bootstrapping • Requires best guess of correction using simpler method
Training Data for Edit Model: P(realization|query) (cont.) • 4. Fully Supervised Learning • Same samples as in (3) above • Editor(s) provides correction for errors • Only a few days work with a halfway decent interface • Should use two editors on same sample to cross-validate • Multiple editors also provide a bound on human performance • Almost always significantly better than bootstrap methods
Evaluating Accuracy: Correcting the Right Queries • Need the labeled training data! • Are we correcting the right queries? • Confusion Matrix • True Positive: Error that is corrected • True Negative: Good query that is not corrected • False Positive: Good query that is corrected • False Negative: Error that is not corrected • Performance Metrics • Precision = TP / (TP + FP) % of corrections that were errors • Sensitivity = TN / (TN + FP) % of rejections that were not errors • Recall = TP / (TP + FN) % of errors that are corrected • Accuracy = (TP + TN) / (TP + TN + FP + FN) % of queries for which we do the right thing • Can balance false alarms and missed corrections
Evaluating Accuracy: Returning the Proper Correction • Correction Accuracy • % of corrections that were properly corrected • Combine with precision on the to-correct decision • Overall Accuracy • % of queries that are TN or TP with right correction
Evaluating Accuracy: MSN Case Study Cucerzan and Brill. 2004. Spelling Correction as an iterative process that explits the collective knowledge of web users. Proc. ACL. • 10-15% estimate of queries with errors • Training by Bootstrapping Query Logs (method 2) • Scoring one human against another: 90% • System accuracy against averaged humans: 82% • System accuracy on valid queries: 85% • System accuracy on queries with errors: 67% • System accuracy with baseline edit model • 80% total; 83% valid queries; 66% queries with errors • 8% lower estimates for auto-eval over sequential logs • 5% higher estimate for “reasonable” vs. exact correction • Good News • Web search is as hard as it gets – multi-topic and multi-lingual
Evaluating Efficiency • May trade accuracy for efficiency along received operating curve • Smaller model size by token or characters • Smaller search space • Higher rejection threshold increases efficiency, reduces recall, and increases precision • Standalone Server Deployment • Allows larger shared models in memory • Simple timeout robustness from web server • Models require CRSW synchronization • Any number of concurrent queries share same model w/o blocking • No queries can run while model is changing • Correction may be done in parallel to search (not pure latency) • Do not need to evaluate number of queries returned, • though this may be combined post-hoc with results for tighter rejection • Should easily scale to requirements • 1 million queries in 8 hours on a single multiprocessor server • That’s 25-50 queries/second • LMs run at 2 million characters/second on desktop
But wait, that’s not all for LingPipe 2.0 • Character and Token-level Language Models • Ranked Terminology Discovery • collocations within corpus (chi square independence test) • “what’s new” across corpora (binomial t-test) • Binary & Multiway Classification • Bayesian framework; language model implementations • Extensive probabilistic confusion matrix scoring • E.g. Topic (e.g. which newsgroup, which section of paper) • E.g. Sentiment (eg. Positive or negative product review) • E.g. Language (critical for multi-lingual applications) • E.g. De-duplication of message streams • E.g. Spam detection • Hierarchical Clustering • General framework; Language model implementations • E.g. Self-organizing web results • Chunking (high throughput Bayesian model) • E.g. Named entities, noun phrases and verb phrases • Implementations of standard evaluations and corpora
Design Standards • Extensive use of standard patterns • E.g. corpus visitors, abstract adapters, factories for runtime pluggable implementations • Mostly immutable & final (efficiency, state stability & testability) • Modules all support CRSW synchronization • Highly Modular Interfaces • Allows implementation plug and play • Most interfaces have abstract adapters • E.g. SpellChecker interface, AbstractSpellChecker adapter with abstract edit model, and ConstantSpellChecker and ProbabilisticSpellChecker implementations • Simple or Complex Tuning Parameterizations • Reasonable Defaults • M.S./Ph.D.-level tuning options (popular for theses) • Follows Sun’s coding standards
Engineering & Support Standards • Active and Responsive User Group Forum • Tutorial examples of all modules • Most include industry-standard evaluations • Thorough Unit Testing (JUnit) • More good examples of API usage • Windows XP & Linux for Java 1.4.2 and 1.5.0 • Profile-based tuning (JProfiler) • Speed, Memory and Disk access • Full javadoc of public/protected API • Classes are shy about their privates as a rule • Types are as specific as possible (many adapters) • Integration at command-line, XML or API levels
Other Applications • Case Restoration () • Source: Train on mixed case data • Channel: Case switching costs nothing; others infinite • E.g. “LOUISE MCNALLY TEACHES AT POMPEU FABRU” becomes “Louise McNally teaches at Pompeu Fabru” • Useful for speech output or some old teletype feeds • Vlad-Lita et al. 2003. tRuEasIng. ACL ’03. • Punctuation Restoration • Channel: Punctuation insertion costs nothing; others infinite • Also useful for speech output • Chinese Tokenization (Bill Teahan) • Source: Train on space-separated tokens • Channel: Spaces insert free; others infinite • Teahan et al. 2000. A compression-based algorithm for Chinese word segmentation CL Journal.
Decoding L33T-speak • “L33T” is l33t-speak for “elite” • Used by gamers (pwn4g3) and spammers (med|catiOn) • Substitute numbers (e.g. ‘E’ to ‘3’, ‘A’ to ‘4’, ‘O’ to ‘0’, ‘I’ to ‘1’) • Substitute punctuation (e.g. ‘/\’ for ‘A’, ‘|’ for ‘L’, ‘\/\/’ for ‘W’) • Some standard typos (e.g. ‘p’ for ‘o’) • De-duplicate or duplicate characters freely • Delete characters relatively freely • Insert/delete space or punctuation freely • Get creative • Examples from my Spam from this week: • VàLIUM CíAL1SS ViÁGRRA; MACR0MEDIA, M1CR0S0FT, SYMANNTEC $20 EACH; univers.ty de-gree online; HOt penny pick fue|ed by high demand; Fwd; cials-tabs, 24 hour sale online; HOw 1s yOur health; Your C A R D D E B T can be wipe clean; Savvy players wOuld be wise tO l0ad up early; Im fed up of my Pa|n med|catiOn pr0b|em; Y0ur wIfe needs tO cOpe with the PaIn; End your gIrlfr1end's Med!ca| prOcedures n0w; C,E*L.E*B,R.E'X 2oo m'gg • Piece of cake to correct (pwn4g3 = “ownage”, a popular taunt if you win) • More info: http://en.wikipedia.org/wiki/Leet