290 likes | 382 Views
Natural Language Processing. Conflation Algorithms. Acknowledgements. John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
E N D
Natural Language Processing Conflation Algorithms NLP: Conflation Algorithms
Acknowledgements • John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm • Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. [Vince has a copy of this] • Jurafsky & Martin appendix B pp 833-836. NLP: Conflation Algorithms
Conflation COMPUT COMPUTE COMPUTER COMPUTING COMPUTES COMPUTABILITY COMPUTATION NLP: Conflation Algorithms
Word Conflation Algorithms • Morphological analysis versus conflation • Notion of word class is application dependent • Genealogy: Phonetic similarity • Information Retrieval: Semantic similarity • Soundex • Porter NLP: Conflation Algorithms
Problems with Names • Names can be misspelt: Rossner • Same name can be spelt in different waysKirkop; Chircop • Same name appears differently in different cultures: Tchaikovsky; Chaicowski • To solve this problem, we need phonetically oriented algorithms which can find similar sounding terms and names. • Just such a family of algorithms exist and are called SoundExes, after the first patented version. NLP: Conflation Algorithms
The Soundex Algorithm • A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. • It is very handy for searching large databases • Originally developed 1918 by Margaret K. Odell and Robert C. Russell of the US Bureau of Archives, to simplify census-taking. NLP: Conflation Algorithms
Soundex Algorithm 1 The Soundex Algorithm uses the following steps to encode a word: • The first character of the word is retained as the first character of the Soundex code. • The following letters are discarded: a,e,i,o,u,h,w, and y. • Remaining consonants are given a code number. • If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23") NLP: Conflation Algorithms
Code Numbers NLP: Conflation Algorithms
Soundex Algorithm: Example The Soundex Algorithm uses the following steps to encode a word: [ROSNER] • The first character of the word is retained as the first character of the Soundex code [R] • The following letters are discarded: a,e,i,o,u,h,w, and y. [RSNR] • Remaining consonants are given a code number. [R256] • If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")[R256] NLP: Conflation Algorithms
Soundex Algorithm 2 • The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200") • If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243") NLP: Conflation Algorithms
Uses for the Soundex Code • Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it. • U.S. Census - As is noted above, the U.S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century. • Genealogy - In genealogy, the Soundex code is most often used to avoid obstacles when dealing with names that might have alternate spellings. NLP: Conflation Algorithms
Improvements • Preprocessing before applying the basic algorithm, e.g. identification of • DG with G • GH with H • GN with N (not 'ng') • KN with N • PH with F • Question: where to stop? • Question: how to evaluate? NLP: Conflation Algorithms
IR Applications • Information Retrieval:Query →→ Relevant Documents • “Bag of Terms” document model • What is a single term? NLP: Conflation Algorithms
Why Stemming is Necessary • Frequently we get collections of words of the following kind in the same documentcompute, computer, computing, computation, computability …. • Performance of IR system will be improved if all of these terms are conflated. • Less terms to worry about • More accurate statistics NLP: Conflation Algorithms
Issues • Is a dictionary available? • Stems • Affixes • Motivation: linguistic credibility or engineering performance? • When to remove a affix versus when to leave it alone • Porter (1980): W1 and W2 should be conflated if there appears to be no difference between the statements "this document is about W1/W2"relate/relativity vs. radioactive/radioactivity NLP: Conflation Algorithms
Consonants and Vowels • A consonant is a letter other than a,e,i,o,u and other than y preceded by a consonant: sky, toy • If a letter is not a consonant it is a vowel. • A sequence of consonants (cc..c) or vowels (vv..v) will be represented by C or V respectively. • For example the word troubles maps to C V C V C • Any word or part of a word, therefore has one of the following forms:(CV)n….C(CV)n….V(VC)n….C(VC)n….V NLP: Conflation Algorithms
Measure • All the above patterns can be replaced bythe following regular expression(C) (VC)m (V) • m is called the measure of any word or word part. • m=0: tr, ee, tree, y, bym=1: trouble, oats, trees, ivym=2: troubles; private NLP: Conflation Algorithms
Rules • Rules for removing a suffix are given in the form(condition) S1 → S2 • i.e. if a word ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example(m > 1) EMENT → • Example: enlargement → enlarg NLP: Conflation Algorithms
Conditions • *S - stem ends with s • *Z - stem ends with z • *T – stem ends with t • *v* - stem contains a vowel • *d - stem ends with a double consonant • *o - stem ends cvc, where second c is not w, x or y e.g. –wil, -hop • In conditions, Boolean operators are possible e.g. (m>1 and (*S or *T)) • Sets of rules applied in 7 steps. Within each step, rule matching longest suffix applies. NLP: Conflation Algorithms
Organisation -s Step 1 Plurals and Third Person Singular Verbs -ed, -ing fly/flies Step 2 Verbal Past Tense and Progressive Step 3: Y to I Noun Inflections Steps 4 and 5 Derivational Morphology Multiple Suffixes visualisation → visualise Steps 6 Derivational Morphology Single Suffixes Step 7 Cleanup NLP: Conflation Algorithms
Step 1:Plural Nouns and 3rd Person Singular Verbs NLP: Conflation Algorithms
Step 2a Verbal Past Tense and Progressive Forms NLP: Conflation Algorithms
Step 2b: CleanupIf 2nd or 3rd of last step succeeds NLP: Conflation Algorithms
Step 3: Y to I NLP: Conflation Algorithms
INPUTin the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management Porter Example NLP: Conflation Algorithms
Porter Output NLP: Conflation Algorithms
Summary • Conflation serves different purposes • Generally, motivation is to achieve an engineering goal rather than linguistic fidelity. • This can cause errors in the bag of words model. • Soundex and Porter very well established and easily available. NLP: Conflation Algorithms