Fuzzy Translation of Cross-Lingual Spelling Variants

Fuzzy Translation of Cross-Lingual Spelling Variants Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management SIGIR’03

Outline • Motivation • Objective • Introduction • Method & Data • Findings • Discussion & Conclusions

Motivation • The limitation on CLIR performance. • Some terms not in translation dictionaries. • Fuzzy matching ~ n-gram method.

Objective • Two-step fuzzy translation technique for cross-lingual spelling variants to improve the CLIR performance • Transformation rule based translation, TRT. • Translate the intermediate forms into a target language using fuzzy matching.

Introduction • Technical terms and proper names are important text elements, but not generally found in electronic translation dictionaries utilized by MT and CLIR. • Non-identical translatable spelling variant forms, e.g., Chernobyl – Tshernobyl. • Similarity measure • N-gram • Fuzzy matching • Transliteration

Introduction • In this paper, the technique transformation rule based translation, TRT • Close to transliteration, but no phonetic elements. • It’s suitable for cross-lingual spelling variants. • Example: Spanish embriologia =>English embryology • Problem: How to automatically find this rule? • Equivalent term pairs extracted from a translation dictionary and aligned pairwise. • Edit distance.

Introduction • Two-step fuzzy translation • Source words are translated into intermediate forms based on TRT, in order to render a source word more similar to its target equivalent. • The intermediate forms are translated into target language equivalents through approximate string matching, i.e. fuzzy matching, n-gram based matching.

Method & Data - Overview Translation dictionary Translation Strategies (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … High confidence factor, HCF TRT Low confidence factor, LCF Intermediate form N-gram Matching Example: konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) => convection

Method & Data - TRT threshold Translation dictionary Edit distance (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … (embriologia, embryology) (embriolagia, embryology) (embrialagia, embryology) … Selection of proper terms and error value 0, the same character at the same position 1, consonant-consonant, vowel-vowel substitution 1, insertion or deletion of a character 2, consonant-vowel, vowel-consonant substitution minimum ED One transformation was selected which have the smallest sum of error values Rule: on－＞oughn at middle position

Transformation Rule based Translation • Edit Distance • Automatic Generation of Rules • Extracting similar terms from a dictionary with edit distance threshold. • Selection of proper terms with the smallest sum of error values. • Generation of transformation rules • Context Information, Frequency, and Confidence Factor • Sample Rules

Edit Distance ED(A, B) = min{Nsub + Nins + Ndel} {d[i – 1,j] + 1, d[i,j - 1] + 1, d[i – 1, j - 1] + cost}, where cost = 0, if A[i] = B[i], and cost = 1, if A[i] ≠ B[i].

A sample of Spanish-to-English rules

Translation Resources • Multilingual medical dictionary by Andre Fairchild. • A Finnish list of medical terms (n=5970) • A Swedish list of medical terms (n=657) • Language pairs • Finnish-English • French-English • German-English • Spanish-English • Swedish-English

Target Word List and Source Words • Target word list • The index of CLEF’s LA Time collection, which contains 189000 words. • Source words • First source word list, 217 word tuples • 72 training word tuples, 145 test word tuples. • Second source word list • 126 test word tuples. • Experiments dataset • 5(language)*(145+126)words =1355 words

N-gram Matching • Similarity measure between the source and target words w1 and w2. where Ni refers to the set of n-grams derived from the word w1 and w2. • Digrams v.s. Trigrams • Trigrams performed worse than digrams, but sometimes gave better results than digrams.

Translation Strategies - High confidence factor (HCF) strategy • A relatively high confidence factor threshold, 50%, to minimize the number of incorrect transformations. • Reading order • The location of the rules in source words: end, beginning, and middle. • The source string length: the longestfirst. • Confidence factor: the highestfirst. • Example konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) convetcion

Translation Strategies - Low confidence factor (LCF) strategy • A threshold confidence factorof 10% was used to filter out unreliable rules. • Even more intermediate forms were obtained, but it may be incorrect transformations. • Both in HCF and LCF the rules whose frequency was < 50 were removed.

Evaluation • For each word precision was calculated by considering the position of the correct equivalent (pce) in the ranked result list of n-gram matching • More words share the same SIM value • Worst position: the last word • Average position precision: the middle of the set of the words

Findings • Four test word types • Medical, biological, and chemical terms (Bio terms), n=90 • Place names, n=55 • Economics, n=31 • Technology, n=36 • Miscellaneous, n=59 • Five language pairs • Finnish-English • French-English • German-English • Spanish-English • Swedish-English

Findings – 1/3

Discussion & Conclusion • Technical terms and proper names are often untranslatable due to the limited coverage of translation dictionaries. • In this study, two-step fuzzy translation • Automatically generated transformation rules, TRT • Fuzzy matching • Two translation strategies were tested, HCF & LCF • Digram and trigam matching were tesed in combination with TRT

Discussion & Conclusion • Effectiveness of fuzzy translation depends on • The frequency of identical terms shared by a source and a target language. • The extent of variation in the spelling variants between a source and a target language. • Fuzzy translation is well suited for language pairs with a high percentage of similar but non-identical terms.

Personal opinion • How did we apply this ideas to our lab.? • TRT?

Fuzzy Translation of Cross-Lingual Spelling Variants

Fuzzy Translation of Cross-Lingual Spelling Variants

Presentation Transcript

Cross-Lingual Image Search on the Web

THE GRAMMAR TRANSLATION METHOD THE AUDIO-LINGUAL METHOD

Cross-lingual projection of Semantics

The use of machine translation tools for cross-lingual text-mining

Cross-lingual Information Access by Natural Language

Cross-Lingual IR

A Mixed Model for Cross Lingual Opinion Analysis

Cross-Lingual Linking of News Stories using ESA

Very Large Cross-lingual Resources at OAEI 2008

Variants of parsimony

Variants of HMMs

A Cross-Lingual Grammar Model and its Application to Japanese-Spanish Machine Translation

Cross Lingual Information Retrieval (CLIR)

Handling of Variants

Cross-Lingual Query Suggestion Using Query Logs of Different Languages

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese

The use of machine translation tools for cross-lingual text-mining

Cross-lingual Information Extraction System Evaluation

A Repository System for Cross-lingual Documents

Handling of Variants

English Spelling: Fuzzy Logic and its implications.

The use of machine translation tools for cross-lingual text-mining