1 / 25

Fuzzy Translation of Cross-Lingual Spelling Variants

This paper presents a two-step fuzzy translation technique for improving Cross-Lingual Information Retrieval (CLIR) performance by handling cross-lingual spelling variants. The technique utilizes transformation rule based translation (TRT) and fuzzy matching to translate intermediate forms into the target language.

santini
Download Presentation

Fuzzy Translation of Cross-Lingual Spelling Variants

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fuzzy Translation of Cross-Lingual Spelling Variants Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management SIGIR’03

  2. Outline • Motivation • Objective • Introduction • Method & Data • Findings • Discussion & Conclusions

  3. Motivation • The limitation on CLIR performance. • Some terms not in translation dictionaries. • Fuzzy matching ~ n-gram method.

  4. Objective • Two-step fuzzy translation technique for cross-lingual spelling variants to improve the CLIR performance • Transformation rule based translation, TRT. • Translate the intermediate forms into a target language using fuzzy matching.

  5. Introduction • Technical terms and proper names are important text elements, but not generally found in electronic translation dictionaries utilized by MT and CLIR. • Non-identical translatable spelling variant forms, e.g., Chernobyl – Tshernobyl. • Similarity measure • N-gram • Fuzzy matching • Transliteration

  6. Introduction • In this paper, the technique transformation rule based translation, TRT • Close to transliteration, but no phonetic elements. • It’s suitable for cross-lingual spelling variants. • Example: Spanish embriologia =>English embryology • Problem: How to automatically find this rule? • Equivalent term pairs extracted from a translation dictionary and aligned pairwise. • Edit distance.

  7. Introduction • Two-step fuzzy translation • Source words are translated into intermediate forms based on TRT, in order to render a source word more similar to its target equivalent. • The intermediate forms are translated into target language equivalents through approximate string matching, i.e. fuzzy matching, n-gram based matching.

  8. Method & Data - Overview Translation dictionary Translation Strategies (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … High confidence factor, HCF TRT Low confidence factor, LCF Intermediate form N-gram Matching Example: konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) => convection

  9. Method & Data - TRT threshold Translation dictionary Edit distance (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … (embriologia, embryology) (embriolagia, embryology) (embrialagia, embryology) … Selection of proper terms and error value 0, the same character at the same position 1, consonant-consonant, vowel-vowel substitution 1, insertion or deletion of a character 2, consonant-vowel, vowel-consonant substitution minimum ED One transformation was selected which have the smallest sum of error values Rule: on->oughn at middle position

  10. Transformation Rule based Translation • Edit Distance • Automatic Generation of Rules • Extracting similar terms from a dictionary with edit distance threshold. • Selection of proper terms with the smallest sum of error values. • Generation of transformation rules • Context Information, Frequency, and Confidence Factor • Sample Rules

  11. Edit Distance ED(A, B) = min{Nsub + Nins + Ndel} {d[i – 1,j] + 1, d[i,j - 1] + 1, d[i – 1, j - 1] + cost}, where cost = 0, if A[i] = B[i], and cost = 1, if A[i] ≠ B[i].

  12. A sample of Spanish-to-English rules

  13. Translation Resources • Multilingual medical dictionary by Andre Fairchild. • A Finnish list of medical terms (n=5970) • A Swedish list of medical terms (n=657) • Language pairs • Finnish-English • French-English • German-English • Spanish-English • Swedish-English

  14. Target Word List and Source Words • Target word list • The index of CLEF’s LA Time collection, which contains 189000 words. • Source words • First source word list, 217 word tuples • 72 training word tuples, 145 test word tuples. • Second source word list • 126 test word tuples. • Experiments dataset • 5(language)*(145+126)words =1355 words

  15. N-gram Matching • Similarity measure between the source and target words w1 and w2. where Ni refers to the set of n-grams derived from the word w1 and w2. • Digrams v.s. Trigrams • Trigrams performed worse than digrams, but sometimes gave better results than digrams.

  16. Translation Strategies - High confidence factor (HCF) strategy • A relatively high confidence factor threshold, 50%, to minimize the number of incorrect transformations. • Reading order • The location of the rules in source words: end, beginning, and middle. • The source string length: the longestfirst. • Confidence factor: the highestfirst. • Example konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) convetcion

  17. Translation Strategies - Low confidence factor (LCF) strategy • A threshold confidence factorof 10% was used to filter out unreliable rules. • Even more intermediate forms were obtained, but it may be incorrect transformations. • Both in HCF and LCF the rules whose frequency was < 50 were removed.

  18. Evaluation • For each word precision was calculated by considering the position of the correct equivalent (pce) in the ranked result list of n-gram matching • More words share the same SIM value • Worst position: the last word • Average position precision: the middle of the set of the words

  19. Findings • Four test word types • Medical, biological, and chemical terms (Bio terms), n=90 • Place names, n=55 • Economics, n=31 • Technology, n=36 • Miscellaneous, n=59 • Five language pairs • Finnish-English • French-English • German-English • Spanish-English • Swedish-English

  20. Findings – 1/3

  21. Findings – 2/3

  22. Findings – 3/3

  23. Discussion & Conclusion • Technical terms and proper names are often untranslatable due to the limited coverage of translation dictionaries. • In this study, two-step fuzzy translation • Automatically generated transformation rules, TRT • Fuzzy matching • Two translation strategies were tested, HCF & LCF • Digram and trigam matching were tesed in combination with TRT

  24. Discussion & Conclusion • Effectiveness of fuzzy translation depends on • The frequency of identical terms shared by a source and a target language. • The extent of variation in the spelling variants between a source and a target language. • Fuzzy translation is well suited for language pairs with a high percentage of similar but non-identical terms.

  25. Personal opinion • How did we apply this ideas to our lab.? • TRT?

More Related