Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Building Bilingual Lexicons Using Lexical Translation Probabilitiesvia Pivot Language Takashi Tsunakawa1 Naoaki Okazaki1 Jun’ichi Tsujii1,2 LREC 2008 29 May, 2008 1Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo 2School of Computer Science, University of Manchester / National Centre for Text Mining

Introduction • Building bilingual lexicons via pivot languages 计步器 (jìbùqì) C-E lexicon CHINESE ENGLISH E-J lexicon odometer pedometer オドメーター万歩計 (mampokei) (odomētā) ペドメータ歩数計 (pedomēta) JAPANESE ペドメーター (hosūkei) (pedomētā)

Introduction • Building bilingual lexicons via pivot languages 计步器 (jìbùqì) • オドメーター (odomētā) • ペドメータ(pedomēta)，ペドメーター(pedomētā)，歩数計(hosūkei)，万歩計(mampokei) odometer pedometer Creative CommonsAttribution ShareAlike 2.0 License by skippy13

Constructing Japanese-Chinese lexicon from Japanese-English and English-Chinese lexicons through English terms • J-E and E-C lexicons are well-supported for many terms and domains, compared to J-C lexicons • Especially for technical terms, there are few J-C lexicons because technical terms are first written by English in most cases Advantages of the pivotal approach The pivotal approach could help us to (semi-) automatically find J-C translation term pairs

Mismatch problem • We cannot find a Chinese-Japanese term pair that does not share the identical English translations. Is it possible to generate the following lexical item?

Merging Two Bilingual Lexicons • “Exact merging” • cannot merge pairs that do not share the identical English translations mismatch problem • Challenges to merge more terms • “Word-based merging” • “Alignment-based merging”

Word-based merging • Tokenize a term into word tokens, and • Translate each word by the bilingual lexicon 全球变暖 globalheating (qúanqíu-bìannŭan) 地球温暖化 (chikyū - ondanka)

Alignment-based merging:Overview • Align each word, • Calculate word translation probabilities, and • Translate each word by the probabilities 全球变暖 warming global heating global heating 温暖化地球温暖化

Alignment-based merging:Overview C-E translation word pairs (with probabilities) C-E lexicon phrase J-C translation word pairs (with probabilities) phrase-based SMT Word-by-word translation Merging word pairs & re-calculating probabilities phrase J-E translation word pairs (with probabilities) J-E lexicon Japanese translations of C-E lexicon (Add term frequencies on Web) phrase

Alignment-based merging • Apply word alignment (GIZA++) (Och & Ney, 2003) for all term pairs • Calculate word translation probabilities from co-occurrence frequencies For both of the bilingual lexicons, source(f)-pivot(p) and pivot(p)-target(e) C(wp,wf; ap-f): Co-occurrence frequency of wp and wf, which are aligned by GIZA++

Alignment-based merging • Calculate word translation probabilities from a target-language word to a source-language word (Utiyama & Isahara, 2007):

Alignment-based merging • Calculate the translation probabilities (scores) based on the noisy channel model (Brown et al., 1990) i-th word of we • The language model p(we) is calculated by using the number of Web searching results (Google) of the term we • p(we) ∝ (hit count of we) • Generate the merged lexicon with translation probabilities are greater than zero. • New_Lexicon = {(wf,we)|Pr(we|wf)>0 and Pr(wf|we) > 0}

Experimental settings • Used lexicons: Bilingual lexicons that consist of technical terms • C-E： Wanfang Data E-C & C-E Science and Technology Dictionary • J-E: JST Machine Translation Dictionary • By “exact merging,” we can translate about 22% of Japanese (or Chinese) terms Utilization ratio

Experimental results • Utilization ratio • Alignment-based merging drastically improved the utilization ratio, and the size of merged lexicon also increased • Accuracy (by manual evaluation) • MRR: Mean Reciprocal Rank (Voorhees, 1999) calculates the mean of reciprocal ranks over all source terms • Prec1: Precision of the highest ranked terms • Prec10: Precision that the 10-best outputs include the correct one

Experimental results: Examples (1/2) (jiăomó - shízhì - yán) • A Chinese-to-Japanese example of “角膜实质炎” (keratitis parenchymatosa)

Experimental results: Examples (2/2) (hatsuiku - jōtai) • A J-to-C example of “発育状態” (growth status)

Conclusion • Alignment-based merging of two bilingual lexicons via a pivot language is proposed • The alignment-based merging could achieve at least 75% utilization ratio in our experiments • The precision still remains 0.14 (Japanese-to-Chinese) and 0.20 (Chinese-to-Japanese), which would be improved by sophisticated scoring method • Future directions • To choose the correct translation with examining the context or semantic classes of source and target terms • To evaluate a machine translation system with this lexicon integrated

Thank you for your attention • Acknowledgments • MEXT, Japan • Japan Science and Technology Agency (JST), Japan • NICT, Japan • Wanfang Data, China

Experimental Results • Our system could generate at least one Japanese translations into 73.4% (385509/525259) of the C-E lexicons Correct Japanese translations are highlighted Japanese reference translation Chinese input term (infectious hepatitis virus, 感染性肝炎ウイルス) (coliphage, 大腸菌ファージ）

Experimental Results same character but the meanings are not identical (acoustic delay line storage, 音響遅延線記憶装置) (complement form, 補数形式)

Manual evaluation • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations

Manual evaluation • 1. 数组元素 – array element – 配列元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations

Manual evaluation • 1. 数组元素 – array element – 配列元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • 2. 计算机化管理学会– ICM • – 特発性心筋障害 • The Chinese means “Institution for Computerization Management”, • and the Japanese means “Idiopathic Cardiomyopathy” • Possible solutions: • Special treatment for acronyms • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations

Manual evaluation • 1. 数组元素 – array element – 配列元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • 2. 计算机化管理学会– ICM • – 特発性心筋障害 • The Chinese means “Institution for Computerization Management”, • and the Japanese means “Idiopathic Cardiomyopathy” • Possible solutions: • Special treatment for acronyms • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations • 3. 信息量– information content – 量 • The Japanese dropped the translation of “information” • Possible solutions: • Add parallel corpora for training

Manual evaluation • 1. 数组元素 – array element – 配列元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • 2. 计算机化管理学会– ICM • – 特発性心筋障害 • The Chinese means “Institution for Computerization Management”, • and the Japanese means “Idiopathic Cardiomyopathy” • Possible solutions: • Special treatment for acronyms • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations • 3. 信息量– information content – 量 • The Japanese dropped the translation of “information” • Possible solutions: • Add parallel corpora for training 4. 转镜式激光束影像记录仪 – laser beam rotating mirror image recorder – (No Japanese translations) All English words seem to be common but failed to generate Japanese translations (maybe because the score was below the threshold for searching hypotheses)

Conclusion • We proposed the method using phrase-based SMT for constructing J-C lexicon from J-E and C-E lexicons. • We could obtain J translations for 73.4% of items in the C-E lexicon, and it outperformed the “exact matching” (22.2%). • 36.5% of the top J translations were correct and that 67.5% of the top-10 J translations included the correct one. • We could apply this method for support of manual construction of bilingual dictionaries and use this lexicon for MT. • Future work • Parameter optimization of SMT by using existing J-C lexicons • Chinese character similarity considering each similarity between individual characters • More sophisticated reordering model (considering parts-of-speech) • Other translation directions (EJ, JC, EC)

Acquisition of Translation Pairs of Technical Terms • Large-scale translation dictionaries (lexicons) of technical terms are required for translating technical documents • For constructing such dictionaries, we must ask the experts who can deal with both languages • It requires huge costs • We must support rapid increase of new terms • Automatic acquisition of translation candidates of technical terms • Support for constructing the dictionary • Improvement of the performance of machine translation systems

J-E bilingual lexicon • 527,206translation pairs • Numbers of distinct terms： 465,565J terms, 509,259E terms

C-E bilingual lexicon • Wanfang Data E-C & C-E Science and Technology Dictionary • 525,259 pairs

Construction of the C-J bilingual lexicon • Attach Japanese translations for each lexical item of C-E lexicon

Overview of constructing J-C lexicon • We assume the C-E and J-E lexicons as parallel corpora, and use them for training data for constructing a J-C SMT system • Word/phrase-level merging in English can be available by applying an SMT approach for the C-E and J-E lexicons • We apply C-J phrase-based SMT for Chinese terms in the C-E lexicon • Statistical approaches seem to be effective because of similarities of semantics and word order between C and J • Easy to introduce other clues such as Chinese character similarity

Collecting J-E & C-E translation phrase pairs • Apply morphological analyzers, and obtain word alignments by GIZA++ (Och and Ney, 2003) for J-E and C-E lexicons • Collect phrase pairs by “Grow-diag-final” method (using Moses, Koehn et al., 2007) and calculate the probabilities by the relative frequencies J-E lexicon ころがり　　　疲れ　　　寿命 rolling fatigue life

Merging phrase pairs(Utiyama & Isahara,2007) (J-E & E-C phrases to J-C phrases)

Merging phrase pairs(Utiyama & Isahara,2007) (J-E & E-C phrases to J-C phrases) (Zeis a normalized factor)

Features for learning of the log-linear model • We employ the following features h1-h4 for the log-linear model: • Phrase translation prob. • where are the i-th phrase pair for the translation • 3-gram language model of the target language • where p(we) is a language model probability from other monolingual corpora • Phrase reordering penalty(Koehn et al., 2003) • Chinese character similarity(Zhang et al., 2005)

Feature 3: Phrase reordering penalty(Koehn et al., 2003) • The feature value is the sum of penalties d defined by the following formula for the phrase pairs we, wf • where aiis the position of the first word of wfand bi-1 is the position of the last word of wftranslated in the previous step d(e1 e2, f1 f2 f3) = 0 f1 f2 f3 f4 f5 f6 f7 f8 d(e3, f8) = – |8 – 3 – 1| = – 4 d(e4, f6 f7) = – |6 – 8 – 1| = – 3 d(e5 e6, f4 f5) = – |4 – 7 – 1| = – 4 h3(e1…e6, f1…f8) = – 11 e1 e2 e3 e4 e5 e6

Feature 4: Chinese character similarity • Chinese and Japanese writing systems both have Chinese characters, and their similarity should be a powerful clue to derive the translation phrase pairs (Zhang et al., 2005) • We define the feature value h4 between we and wf as follows: • Differences of Chinese and Japanese forms of characters are ignored • Example：h4(万歩計,计步器) = h4(万歩計, 計歩器) = h4(ABC,CBD) = 1 – 2 / 3 = 0.333

Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language

Presentation Transcript

Humor, Translation, and Bilingual Issues

Humor, Translation, and Bilingual Issues

Learning Translation Lexicons from Comparable Corpora

Lexicons

Generalising lexical translation strategies for MT using comparable corpora

Learning Bilingual Lexicons from Monolingual Corpora

Graded Lexicons!

Language Translation

Building Local Consensus Ontologies Via Autonomous Merging Using a Lexical Database

Machine Translation, Language Divergence and Lexical Resources

6.3Find Probabilities Using Combinations

Building Lexicons

Using Pivot Tables

Multilingual Lexical Acquisition by Bootstrapping Cognate Seed Lexicons

Dynamic Building of Domain Specific Lexicons Using Emergent Semantics

Language Translation Using an Artificial Intelligence

Language Translation

Language translation

Humor, Translation, and Bilingual Issues

Lexicons

Language Translation