180 likes | 283 Views
Acquisition of English-Japanese proper nouns from noisy-parallel newswire articles using KATAKANA matching. Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management. Toshiba Corp. R&D Center. Outline. Motivation Objective Introduction Background Method
E N D
Acquisition of English-Japanese proper nouns from noisy-parallel newswire articles using KATAKANA matching Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management Toshiba Corp. R&D Center
Outline • Motivation • Objective • Introduction • Background • Method • Simulations • Discussion • Conclusion
Motivation • Limitation of statistical approaches
Objective • Superiority of linguistic approaches
Introduction • A tool for extracting bilingual knowledge from noisy-parallel English-Japanese text • Dynamic programming • Phonetic similarities • Partial matching of English-Japanese • Extract a small reliable bilingual lexicon of anchor points • Establish further bilingual correspondences
Introduction • Type of bilingual knowledge acquisition from parallel corpora • Statistical • Internal distributional evidence of bilingual word pairs • Linguistic • External evidence provided by bilingual lexicons to establish anchor points between pairs of bilingual phrases
Background • The challenge for establishing a bilingual correspondance between English-Katakana • Lose information when English-Katakana • `r' and `l' or `b' and `v' • Redundant vowel sounds when Katakana-English • `fra' in “Frankfurt” • `フラ‘ translate into ‘fura’
Background • Deal with these problems in previous researches • Transcribe into intermediate representations and match these. • The matching knowledge may be biased towards English pronunciation. “Chirac” => “シラク” `シ' is pronounced as shi.
“パレスチナ” “Palestinians” “Palestine” “Palestinian” Background • A neutral intermediate representation allows for partial matching • When intermediate representation match above a certain threshold then they are in a translation relation.
Method • NPT (Nearest Phonetic Transliteration) • Takes each Katakana word and converts it to a phonetic string representing all English spelling combinations of the word. • “ブルンジ” which is “Burundi” in English ‘ル ー> rloue’ “buorlouenmgesdjgiou”
Method – NPT_score “Burundi” “buorlouenmgesdjgiou” npt: NPT string e: English string md: maximum depth d: depth count s: score
Method • Save search time and detect substrings • Several heuristics • First letter is in upper case for obtaining candidate proper nouns in the English text. • Limit the minimum length of Katakana words available for matching. “クリスマス” (=“Christmas”) and “Mass”
Simulations • Two corpora of English and Japanese headline newswire articles. • The test corpus had 150 aligned articles • 1730 English paragraphs and • 771 Japanese paragraphs • 871 Katakana words • 9742 potential English proper nouns • 65 comparisons for each Katakana word in each article.
Simulations • Baseline • Soundex algorithm • K&H • Convert the Katakana and the English word to a simplified disjunctive phonetic form. • Does not allow either partial matches or matching of substrings.
Results F-measure 81% 58% 39%
Discussion • NPT yielded the best result overall. • Higher threshold and higher precision. • K&H can’t handle partial match and intermediate form may lose information. • Partial matching • Finding substrings • Identify cognatively connectd translation pairs “インドネシア” => “Indonesia” “Indonesian”, “Indonesians”, “Indonesias"
Conclusion • Back-transliterating from Katakana to English is unexpectedly difficult. • The set of matching rules is quite small, it could be improved. • Future research • Induce the rules automatically from a corpus of examples.