1 / 31

Human Language Technology

Human Language Technology. Conflation Algorithms. Acknowledgements. John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm

linus-barry
Download Presentation

Human Language Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Human Language Technology Conflation Algorithms HLT: Conflation Algorithms

  2. Acknowledgements • John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm • Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. [Vince has a copy of this] • Jurafsky & Martin appendix B pp 833-836. HLT: Conflation Algorithms

  3. Conflation COMPUT COMPUTE COMPUTER COMPUTING COMPUTES COMPUTABILITY COMPUTATION HLT: Conflation Algorithms

  4. Types of Conflation Algorithm • Stemming • Process based - e.g. affix stripping • Lemmatisation • Attempt to map to same lemma • POS dependent • Morphological Analysis • Includes morpho-syntactic information HLT: Conflation Algorithms

  5. Word Conflation Algorithms • Morphological analysis versus conflation • Notion of word class used is application dependent • Genealogy: Phonetic similarity • Information Retrieval: Semantic similarity • Based on written language (not phonetic transcription) • Well known algorithms • Soundex • Porter HLT: Conflation Algorithms

  6. Soundex:Problems with Names • Names can be misspelt: Rossner • Same name can be spelt in different waysKirkop; Chircop • Same name appears differently in different cultures: Tchaikovsky; Chaicowski • To solve this problem, we need phonetically oriented algorithms which can find similar sounding terms and names. • Just such a family of algorithms exist and are called SoundExes, after the first patented version. HLT: Conflation Algorithms

  7. The Soundex Algorithm • A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. • It is very handy for searching large databases • Originally developed 1918 by Margaret K. Odell and Robert C. Russell of the US Bureau of Archives, to simplify census-taking. HLT: Conflation Algorithms

  8. Soundex Algorithm 1 The Soundex Algorithm uses the following steps to encode a word: • The first character of the word is retained as the first character of the Soundex code. • The following letters are discarded: a,e,i,o,u,h,w, and y. • Remaining consonants are given a code number. • If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23") HLT: Conflation Algorithms

  9. Code Numbers HLT: Conflation Algorithms

  10. Soundex Algorithm: Example The Soundex Algorithm uses the following steps to encode a word: [ROSNER] • The first character of the word is retained as the first character of the Soundex code [R] • The following letters are discarded: a,e,i,o,u,h,w, and y. [RSNR] • Remaining consonants are given a code number. [R256] • If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")[R256] HLT: Conflation Algorithms

  11. Soundex Algorithm 2 • The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200") • If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243") HLT: Conflation Algorithms

  12. Uses for the Soundex Code • Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it. • U.S. Census - As is noted above, the U.S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century. • Genealogy - In genealogy, the Soundex code is most often used to avoid problems when dealing with names that might have alternate spellings. HLT: Conflation Algorithms

  13. Improvements • Preprocessing before applying the basic algorithm, e.g. identification of • DG with G • GH with H • GN with N (not 'ng') • KN with N • PH with F • Question: where to stop? • Question: how to evaluate? HLT: Conflation Algorithms

  14. IR Applications • Information Retrieval:Query →→ Relevant Documents • “Bag of Terms” document model • What is a single term? HLT: Conflation Algorithms

  15. Why Stemming is Necessary • Frequently we get collections of words of the following kind in the same documentcompute, computer, computing, computation, computability …. • Performance of IR system will be improved if all of these terms are conflated. • Less terms to worry about • More accurate statistics HLT: Conflation Algorithms

  16. Issues • Is a dictionary available? • Stems • Affixes • Motivation: linguistic credibility or engineering performance? • When to remove a affix versus when to leave it alone • Porter (1980): W1 and W2 should be conflated if there appears to be no difference between the statements "this document is about W1/W2"relate/relativity vs. radioactive/radioactivity HLT: Conflation Algorithms

  17. Consonants and Vowels • A consonant is a letter other than a,e,i,o,u and other than y preceded by a consonant: sky, (nb. y in toy is not regarded as a consonant). • If a letter is not a consonant it is a vowel. • A sequence of consonants (cc..c) or vowels (vv..v) will be represented by C or V respectively. • For example the word troubles maps to C V C V C • Any word or part of a word, therefore has one of the following forms:(CV)n….C(CV)n….V(VC)n….C(VC)n….V HLT: Conflation Algorithms

  18. Measure • All the above patterns can be replaced bythe following regular expression(C) (VC)m (V) • m is called the measure of any word or word part. • m=0: tr, ee, tree, y, bym=1: trouble, oats, trees, ivym=2: troubles; private HLT: Conflation Algorithms

  19. Rules • Rules for removing a suffix are given in the form(condition) S1 → S2 • i.e. if a word ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example(m > 1) EMENT → • Example: enlargement → enlarg HLT: Conflation Algorithms

  20. Conditions • *S - stem ends with s • *Z - stem ends with z • *T – stem ends with t • *v* - stem contains a vowel • *d - stem ends with a double consonant • *o - stem ends cvc, where second c is not w, x or y e.g. –wil, -hop • In conditions, Boolean operators are possible e.g. (m>1 and (*S or *T)) • Sets of rules applied in 7 steps. Within each step, rule matching longest suffix applies. HLT: Conflation Algorithms

  21. Organisation -s Step 1 Plurals and Third Person Singular Verbs -ed, -ing fly/flies Step 2 Verbal Past Tense and Progressive Step 3: Y to I Noun Inflections Steps 4 and 5 Derivational Morphology Multiple Suffixes visualisation → visualise Steps 6 Derivational Morphology Single Suffixes Step 7 Cleanup HLT: Conflation Algorithms

  22. Step 1:Plural Nouns and 3rd Person Singular Verbs HLT: Conflation Algorithms

  23. Step 2a Verbal Past Tense and Progressive Forms HLT: Conflation Algorithms

  24. Step 2b: CleanupIf 2nd or 3rd of last step succeeds HLT: Conflation Algorithms

  25. Step 3: Y to I HLT: Conflation Algorithms

  26. HLT: Conflation Algorithms

  27. HLT: Conflation Algorithms

  28. INPUTin the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management Porter Example HLT: Conflation Algorithms

  29. Porter Output HLT: Conflation Algorithms

  30. Stemming Errors • Under-stemming • the error of taking off too small a suffix • croulons  croulon • since croulons is a form of the verb crouler • Over-stemming • the error of taking off too much • example: croûtons  croût • since croûtons is the plural of croûton • Miss-stemming • taking off what looks like an ending, but is really part of the stem • reply  rep HLT: Conflation Algorithms

  31. Summary • Conflation serves different purposes • Generally, motivation is to achieve an engineering goal rather than linguistic fidelity. • This can cause errors in the bag of words model. • Soundex and Porter very well established and easily available. HLT: Conflation Algorithms

More Related