320 likes | 486 Views
Human Language Technology. Conflation Algorithms. Acknowledgements. John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
E N D
Human Language Technology Conflation Algorithms HLT: Conflation Algorithms
Acknowledgements • John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm • Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. [Vince has a copy of this] • Jurafsky & Martin appendix B pp 833-836. HLT: Conflation Algorithms
Conflation COMPUT COMPUTE COMPUTER COMPUTING COMPUTES COMPUTABILITY COMPUTATION HLT: Conflation Algorithms
Types of Conflation Algorithm • Stemming • Process based - e.g. affix stripping • Lemmatisation • Attempt to map to same lemma • POS dependent • Morphological Analysis • Includes morpho-syntactic information HLT: Conflation Algorithms
Word Conflation Algorithms • Morphological analysis versus conflation • Notion of word class used is application dependent • Genealogy: Phonetic similarity • Information Retrieval: Semantic similarity • Based on written language (not phonetic transcription) • Well known algorithms • Soundex • Porter HLT: Conflation Algorithms
Soundex:Problems with Names • Names can be misspelt: Rossner • Same name can be spelt in different waysKirkop; Chircop • Same name appears differently in different cultures: Tchaikovsky; Chaicowski • To solve this problem, we need phonetically oriented algorithms which can find similar sounding terms and names. • Just such a family of algorithms exist and are called SoundExes, after the first patented version. HLT: Conflation Algorithms
The Soundex Algorithm • A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. • It is very handy for searching large databases • Originally developed 1918 by Margaret K. Odell and Robert C. Russell of the US Bureau of Archives, to simplify census-taking. HLT: Conflation Algorithms
Soundex Algorithm 1 The Soundex Algorithm uses the following steps to encode a word: • The first character of the word is retained as the first character of the Soundex code. • The following letters are discarded: a,e,i,o,u,h,w, and y. • Remaining consonants are given a code number. • If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23") HLT: Conflation Algorithms
Code Numbers HLT: Conflation Algorithms
Soundex Algorithm: Example The Soundex Algorithm uses the following steps to encode a word: [ROSNER] • The first character of the word is retained as the first character of the Soundex code [R] • The following letters are discarded: a,e,i,o,u,h,w, and y. [RSNR] • Remaining consonants are given a code number. [R256] • If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")[R256] HLT: Conflation Algorithms
Soundex Algorithm 2 • The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200") • If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243") HLT: Conflation Algorithms
Uses for the Soundex Code • Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it. • U.S. Census - As is noted above, the U.S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century. • Genealogy - In genealogy, the Soundex code is most often used to avoid problems when dealing with names that might have alternate spellings. HLT: Conflation Algorithms
Improvements • Preprocessing before applying the basic algorithm, e.g. identification of • DG with G • GH with H • GN with N (not 'ng') • KN with N • PH with F • Question: where to stop? • Question: how to evaluate? HLT: Conflation Algorithms
IR Applications • Information Retrieval:Query →→ Relevant Documents • “Bag of Terms” document model • What is a single term? HLT: Conflation Algorithms
Why Stemming is Necessary • Frequently we get collections of words of the following kind in the same documentcompute, computer, computing, computation, computability …. • Performance of IR system will be improved if all of these terms are conflated. • Less terms to worry about • More accurate statistics HLT: Conflation Algorithms
Issues • Is a dictionary available? • Stems • Affixes • Motivation: linguistic credibility or engineering performance? • When to remove a affix versus when to leave it alone • Porter (1980): W1 and W2 should be conflated if there appears to be no difference between the statements "this document is about W1/W2"relate/relativity vs. radioactive/radioactivity HLT: Conflation Algorithms
Consonants and Vowels • A consonant is a letter other than a,e,i,o,u and other than y preceded by a consonant: sky, (nb. y in toy is not regarded as a consonant). • If a letter is not a consonant it is a vowel. • A sequence of consonants (cc..c) or vowels (vv..v) will be represented by C or V respectively. • For example the word troubles maps to C V C V C • Any word or part of a word, therefore has one of the following forms:(CV)n….C(CV)n….V(VC)n….C(VC)n….V HLT: Conflation Algorithms
Measure • All the above patterns can be replaced bythe following regular expression(C) (VC)m (V) • m is called the measure of any word or word part. • m=0: tr, ee, tree, y, bym=1: trouble, oats, trees, ivym=2: troubles; private HLT: Conflation Algorithms
Rules • Rules for removing a suffix are given in the form(condition) S1 → S2 • i.e. if a word ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example(m > 1) EMENT → • Example: enlargement → enlarg HLT: Conflation Algorithms
Conditions • *S - stem ends with s • *Z - stem ends with z • *T – stem ends with t • *v* - stem contains a vowel • *d - stem ends with a double consonant • *o - stem ends cvc, where second c is not w, x or y e.g. –wil, -hop • In conditions, Boolean operators are possible e.g. (m>1 and (*S or *T)) • Sets of rules applied in 7 steps. Within each step, rule matching longest suffix applies. HLT: Conflation Algorithms
Organisation -s Step 1 Plurals and Third Person Singular Verbs -ed, -ing fly/flies Step 2 Verbal Past Tense and Progressive Step 3: Y to I Noun Inflections Steps 4 and 5 Derivational Morphology Multiple Suffixes visualisation → visualise Steps 6 Derivational Morphology Single Suffixes Step 7 Cleanup HLT: Conflation Algorithms
Step 1:Plural Nouns and 3rd Person Singular Verbs HLT: Conflation Algorithms
Step 2a Verbal Past Tense and Progressive Forms HLT: Conflation Algorithms
Step 2b: CleanupIf 2nd or 3rd of last step succeeds HLT: Conflation Algorithms
Step 3: Y to I HLT: Conflation Algorithms
INPUTin the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management Porter Example HLT: Conflation Algorithms
Porter Output HLT: Conflation Algorithms
Stemming Errors • Under-stemming • the error of taking off too small a suffix • croulons croulon • since croulons is a form of the verb crouler • Over-stemming • the error of taking off too much • example: croûtons croût • since croûtons is the plural of croûton • Miss-stemming • taking off what looks like an ending, but is really part of the stem • reply rep HLT: Conflation Algorithms
Summary • Conflation serves different purposes • Generally, motivation is to achieve an engineering goal rather than linguistic fidelity. • This can cause errors in the bag of words model. • Soundex and Porter very well established and easily available. HLT: Conflation Algorithms