370 likes | 517 Views
Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language. Takashi Tsunakawa 1 Naoaki Okazaki 1 Jun’ichi Tsujii 1,2. LREC 2008 29 May, 2008. 1 Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo
E N D
Building Bilingual Lexicons Using Lexical Translation Probabilitiesvia Pivot Language Takashi Tsunakawa1 Naoaki Okazaki1 Jun’ichi Tsujii1,2 LREC 2008 29 May, 2008 1Department of Computer Science, Graduate School of Information Science and Technology, University of Tokyo 2School of Computer Science, University of Manchester / National Centre for Text Mining
Introduction • Building bilingual lexicons via pivot languages 计步器 (jìbùqì) C-E lexicon CHINESE ENGLISH E-J lexicon odometer pedometer オドメーター 万歩計 (mampokei) (odomētā) ペドメータ 歩数計 (pedomēta) JAPANESE ペドメーター (hosūkei) (pedomētā)
Introduction • Building bilingual lexicons via pivot languages 计步器 (jìbùqì) • オドメーター (odomētā) • ペドメータ(pedomēta),ペドメーター(pedomētā),歩数計(hosūkei),万歩計(mampokei) odometer pedometer Creative CommonsAttribution ShareAlike 2.0 License by skippy13
Constructing Japanese-Chinese lexicon from Japanese-English and English-Chinese lexicons through English terms • J-E and E-C lexicons are well-supported for many terms and domains, compared to J-C lexicons • Especially for technical terms, there are few J-C lexicons because technical terms are first written by English in most cases Advantages of the pivotal approach The pivotal approach could help us to (semi-) automatically find J-C translation term pairs
Mismatch problem • We cannot find a Chinese-Japanese term pair that does not share the identical English translations. Is it possible to generate the following lexical item?
Merging Two Bilingual Lexicons • “Exact merging” • cannot merge pairs that do not share the identical English translations mismatch problem • Challenges to merge more terms • “Word-based merging” • “Alignment-based merging”
Word-based merging • Tokenize a term into word tokens, and • Translate each word by the bilingual lexicon 全球变暖 globalheating (qúanqíu-bìannŭan) 地球 温暖化 (chikyū - ondanka)
Alignment-based merging:Overview • Align each word, • Calculate word translation probabilities, and • Translate each word by the probabilities 全球 变暖 warming global heating global heating 温暖化 地球 温暖化
Alignment-based merging:Overview C-E translation word pairs (with probabilities) C-E lexicon phrase J-C translation word pairs (with probabilities) phrase-based SMT Word-by-word translation Merging word pairs & re-calculating probabilities phrase J-E translation word pairs (with probabilities) J-E lexicon Japanese translations of C-E lexicon (Add term frequencies on Web) phrase
Alignment-based merging • Apply word alignment (GIZA++) (Och & Ney, 2003) for all term pairs • Calculate word translation probabilities from co-occurrence frequencies For both of the bilingual lexicons, source(f)-pivot(p) and pivot(p)-target(e) C(wp,wf; ap-f): Co-occurrence frequency of wp and wf, which are aligned by GIZA++
Alignment-based merging • Calculate word translation probabilities from a target-language word to a source-language word (Utiyama & Isahara, 2007):
Alignment-based merging • Calculate the translation probabilities (scores) based on the noisy channel model (Brown et al., 1990) i-th word of we • The language model p(we) is calculated by using the number of Web searching results (Google) of the term we • p(we) ∝ (hit count of we) • Generate the merged lexicon with translation probabilities are greater than zero. • New_Lexicon = {(wf,we)|Pr(we|wf)>0 and Pr(wf|we) > 0}
Experimental settings • Used lexicons: Bilingual lexicons that consist of technical terms • C-E: Wanfang Data E-C & C-E Science and Technology Dictionary • J-E: JST Machine Translation Dictionary • By “exact merging,” we can translate about 22% of Japanese (or Chinese) terms Utilization ratio
Experimental results • Utilization ratio • Alignment-based merging drastically improved the utilization ratio, and the size of merged lexicon also increased • Accuracy (by manual evaluation) • MRR: Mean Reciprocal Rank (Voorhees, 1999) calculates the mean of reciprocal ranks over all source terms • Prec1: Precision of the highest ranked terms • Prec10: Precision that the 10-best outputs include the correct one
Experimental results: Examples (1/2) (jiăomó - shízhì - yán) • A Chinese-to-Japanese example of “角膜 实质 炎” (keratitis parenchymatosa)
Experimental results: Examples (2/2) (hatsuiku - jōtai) • A J-to-C example of “発育 状態” (growth status)
Conclusion • Alignment-based merging of two bilingual lexicons via a pivot language is proposed • The alignment-based merging could achieve at least 75% utilization ratio in our experiments • The precision still remains 0.14 (Japanese-to-Chinese) and 0.20 (Chinese-to-Japanese), which would be improved by sophisticated scoring method • Future directions • To choose the correct translation with examining the context or semantic classes of source and target terms • To evaluate a machine translation system with this lexicon integrated
Thank you for your attention • Acknowledgments • MEXT, Japan • Japan Science and Technology Agency (JST), Japan • NICT, Japan • Wanfang Data, China
Experimental Results • Our system could generate at least one Japanese translations into 73.4% (385509/525259) of the C-E lexicons Correct Japanese translations are highlighted Japanese reference translation Chinese input term (infectious hepatitis virus, 感染性肝炎ウイルス) (coliphage, 大腸菌ファージ)
Experimental Results same character but the meanings are not identical (acoustic delay line storage, 音響遅延線記憶装置) (complement form, 補数形式)
Manual evaluation • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations
Manual evaluation • 1. 数 组 元素 – array element – 配列 元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations
Manual evaluation • 1. 数 组 元素 – array element – 配列 元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • 2. 计算机 化 管理 学会– ICM • – 特 発 性 心筋 障害 • The Chinese means “Institution for Computerization Management”, • and the Japanese means “Idiopathic Cardiomyopathy” • Possible solutions: • Special treatment for acronyms • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations
Manual evaluation • 1. 数 组 元素 – array element – 配列 元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • 2. 计算机 化 管理 学会– ICM • – 特 発 性 心筋 障害 • The Chinese means “Institution for Computerization Management”, • and the Japanese means “Idiopathic Cardiomyopathy” • Possible solutions: • Special treatment for acronyms • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations • 3. 信息量– information content – 量 • The Japanese dropped the translation of “information” • Possible solutions: • Add parallel corpora for training
Manual evaluation • 1. 数 组 元素 – array element – 配列 元素 • The Japanese translation is not used in real texts. • Possible solutions: • Strengthen the language model • Adjusting weights of the features • 2. 计算机 化 管理 学会– ICM • – 特 発 性 心筋 障害 • The Chinese means “Institution for Computerization Management”, • and the Japanese means “Idiopathic Cardiomyopathy” • Possible solutions: • Special treatment for acronyms • A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon • Terms that could be translated into Japanese: 181 (90.5%) • Terms that the top-10 translations included the correct one: 135 (67.5%) • Terms that the top translation was correct: 73 (36.5%) • MRR (mean reciprocal rank) = 0.466 • The average of the inverses of the ranks that are the highest correct translations • 3. 信息量– information content – 量 • The Japanese dropped the translation of “information” • Possible solutions: • Add parallel corpora for training 4. 转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (No Japanese translations) All English words seem to be common but failed to generate Japanese translations (maybe because the score was below the threshold for searching hypotheses)
Conclusion • We proposed the method using phrase-based SMT for constructing J-C lexicon from J-E and C-E lexicons. • We could obtain J translations for 73.4% of items in the C-E lexicon, and it outperformed the “exact matching” (22.2%). • 36.5% of the top J translations were correct and that 67.5% of the top-10 J translations included the correct one. • We could apply this method for support of manual construction of bilingual dictionaries and use this lexicon for MT. • Future work • Parameter optimization of SMT by using existing J-C lexicons • Chinese character similarity considering each similarity between individual characters • More sophisticated reordering model (considering parts-of-speech) • Other translation directions (EJ, JC, EC)
Acquisition of Translation Pairs of Technical Terms • Large-scale translation dictionaries (lexicons) of technical terms are required for translating technical documents • For constructing such dictionaries, we must ask the experts who can deal with both languages • It requires huge costs • We must support rapid increase of new terms • Automatic acquisition of translation candidates of technical terms • Support for constructing the dictionary • Improvement of the performance of machine translation systems
J-E bilingual lexicon • 527,206translation pairs • Numbers of distinct terms: 465,565J terms, 509,259E terms
C-E bilingual lexicon • Wanfang Data E-C & C-E Science and Technology Dictionary • 525,259 pairs
Construction of the C-J bilingual lexicon • Attach Japanese translations for each lexical item of C-E lexicon
Overview of constructing J-C lexicon • We assume the C-E and J-E lexicons as parallel corpora, and use them for training data for constructing a J-C SMT system • Word/phrase-level merging in English can be available by applying an SMT approach for the C-E and J-E lexicons • We apply C-J phrase-based SMT for Chinese terms in the C-E lexicon • Statistical approaches seem to be effective because of similarities of semantics and word order between C and J • Easy to introduce other clues such as Chinese character similarity
Collecting J-E & C-E translation phrase pairs • Apply morphological analyzers, and obtain word alignments by GIZA++ (Och and Ney, 2003) for J-E and C-E lexicons • Collect phrase pairs by “Grow-diag-final” method (using Moses, Koehn et al., 2007) and calculate the probabilities by the relative frequencies J-E lexicon ころがり 疲れ 寿命 rolling fatigue life
Merging phrase pairs(Utiyama & Isahara,2007) (J-E & E-C phrases to J-C phrases)
Merging phrase pairs(Utiyama & Isahara,2007) (J-E & E-C phrases to J-C phrases) (Zeis a normalized factor)
Features for learning of the log-linear model • We employ the following features h1-h4 for the log-linear model: • Phrase translation prob. • where are the i-th phrase pair for the translation • 3-gram language model of the target language • where p(we) is a language model probability from other monolingual corpora • Phrase reordering penalty(Koehn et al., 2003) • Chinese character similarity(Zhang et al., 2005)
Feature 3: Phrase reordering penalty(Koehn et al., 2003) • The feature value is the sum of penalties d defined by the following formula for the phrase pairs we, wf • where aiis the position of the first word of wfand bi-1 is the position of the last word of wftranslated in the previous step d(e1 e2, f1 f2 f3) = 0 f1 f2 f3 f4 f5 f6 f7 f8 d(e3, f8) = – |8 – 3 – 1| = – 4 d(e4, f6 f7) = – |6 – 8 – 1| = – 3 d(e5 e6, f4 f5) = – |4 – 7 – 1| = – 4 h3(e1…e6, f1…f8) = – 11 e1 e2 e3 e4 e5 e6
Feature 4: Chinese character similarity • Chinese and Japanese writing systems both have Chinese characters, and their similarity should be a powerful clue to derive the translation phrase pairs (Zhang et al., 2005) • We define the feature value h4 between we and wf as follows: • Differences of Chinese and Japanese forms of characters are ignored • Example:h4(万歩計,计步器) = h4(万歩計, 計歩器) = h4(ABC,CBD) = 1 – 2 / 3 = 0.333