250 likes | 255 Views
This paper presents a two-step fuzzy translation technique for improving Cross-Lingual Information Retrieval (CLIR) performance by handling cross-lingual spelling variants. The technique utilizes transformation rule based translation (TRT) and fuzzy matching to translate intermediate forms into the target language.
E N D
Fuzzy Translation of Cross-Lingual Spelling Variants Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management SIGIR’03
Outline • Motivation • Objective • Introduction • Method & Data • Findings • Discussion & Conclusions
Motivation • The limitation on CLIR performance. • Some terms not in translation dictionaries. • Fuzzy matching ~ n-gram method.
Objective • Two-step fuzzy translation technique for cross-lingual spelling variants to improve the CLIR performance • Transformation rule based translation, TRT. • Translate the intermediate forms into a target language using fuzzy matching.
Introduction • Technical terms and proper names are important text elements, but not generally found in electronic translation dictionaries utilized by MT and CLIR. • Non-identical translatable spelling variant forms, e.g., Chernobyl – Tshernobyl. • Similarity measure • N-gram • Fuzzy matching • Transliteration
Introduction • In this paper, the technique transformation rule based translation, TRT • Close to transliteration, but no phonetic elements. • It’s suitable for cross-lingual spelling variants. • Example: Spanish embriologia =>English embryology • Problem: How to automatically find this rule? • Equivalent term pairs extracted from a translation dictionary and aligned pairwise. • Edit distance.
Introduction • Two-step fuzzy translation • Source words are translated into intermediate forms based on TRT, in order to render a source word more similar to its target equivalent. • The intermediate forms are translated into target language equivalents through approximate string matching, i.e. fuzzy matching, n-gram based matching.
Method & Data - Overview Translation dictionary Translation Strategies (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … High confidence factor, HCF TRT Low confidence factor, LCF Intermediate form N-gram Matching Example: konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) => convection
Method & Data - TRT threshold Translation dictionary Edit distance (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … (embriologia, embryology) (embriolagia, embryology) (embrialagia, embryology) … Selection of proper terms and error value 0, the same character at the same position 1, consonant-consonant, vowel-vowel substitution 1, insertion or deletion of a character 2, consonant-vowel, vowel-consonant substitution minimum ED One transformation was selected which have the smallest sum of error values Rule: on->oughn at middle position
Transformation Rule based Translation • Edit Distance • Automatic Generation of Rules • Extracting similar terms from a dictionary with edit distance threshold. • Selection of proper terms with the smallest sum of error values. • Generation of transformation rules • Context Information, Frequency, and Confidence Factor • Sample Rules
Edit Distance ED(A, B) = min{Nsub + Nins + Ndel} {d[i – 1,j] + 1, d[i,j - 1] + 1, d[i – 1, j - 1] + cost}, where cost = 0, if A[i] = B[i], and cost = 1, if A[i] ≠ B[i].
Translation Resources • Multilingual medical dictionary by Andre Fairchild. • A Finnish list of medical terms (n=5970) • A Swedish list of medical terms (n=657) • Language pairs • Finnish-English • French-English • German-English • Spanish-English • Swedish-English
Target Word List and Source Words • Target word list • The index of CLEF’s LA Time collection, which contains 189000 words. • Source words • First source word list, 217 word tuples • 72 training word tuples, 145 test word tuples. • Second source word list • 126 test word tuples. • Experiments dataset • 5(language)*(145+126)words =1355 words
N-gram Matching • Similarity measure between the source and target words w1 and w2. where Ni refers to the set of n-grams derived from the word w1 and w2. • Digrams v.s. Trigrams • Trigrams performed worse than digrams, but sometimes gave better results than digrams.
Translation Strategies - High confidence factor (HCF) strategy • A relatively high confidence factor threshold, 50%, to minimize the number of incorrect transformations. • Reading order • The location of the rules in source words: end, beginning, and middle. • The source string length: the longestfirst. • Confidence factor: the highestfirst. • Example konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) convetcion
Translation Strategies - Low confidence factor (LCF) strategy • A threshold confidence factorof 10% was used to filter out unreliable rules. • Even more intermediate forms were obtained, but it may be incorrect transformations. • Both in HCF and LCF the rules whose frequency was < 50 were removed.
Evaluation • For each word precision was calculated by considering the position of the correct equivalent (pce) in the ranked result list of n-gram matching • More words share the same SIM value • Worst position: the last word • Average position precision: the middle of the set of the words
Findings • Four test word types • Medical, biological, and chemical terms (Bio terms), n=90 • Place names, n=55 • Economics, n=31 • Technology, n=36 • Miscellaneous, n=59 • Five language pairs • Finnish-English • French-English • German-English • Spanish-English • Swedish-English
Discussion & Conclusion • Technical terms and proper names are often untranslatable due to the limited coverage of translation dictionaries. • In this study, two-step fuzzy translation • Automatically generated transformation rules, TRT • Fuzzy matching • Two translation strategies were tested, HCF & LCF • Digram and trigam matching were tesed in combination with TRT
Discussion & Conclusion • Effectiveness of fuzzy translation depends on • The frequency of identical terms shared by a source and a target language. • The extent of variation in the spelling variants between a source and a target language. • Fuzzy translation is well suited for language pairs with a high percentage of similar but non-identical terms.
Personal opinion • How did we apply this ideas to our lab.? • TRT?