250 likes | 499 Views
Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus. Toshiaki Nakazawa, Daisuke Kawahara Sadao Kurohashi University of Tokyo. 2005/10/13 IJCNLP2005. Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus. Japanese Character Set About Word Segmentation
E N D
Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus Toshiaki Nakazawa, Daisuke Kawahara Sadao Kurohashi University of Tokyo 2005/10/13IJCNLP2005
Automatic Acquisition of Basic Katakana Lexiconfrom a Given Corpus • Japanese Character Set • About Word Segmentation • Proposed Method • Method using a Japanese-English Dictionary • Method using a Huge Corpus and a Dictionary • Method using Relation in WOD • Evaluation and Discussion • Conclusion
Japanese Character Set • Kanji (ideogram): 6,000 • Noun: 東京, 大学, 情報… • Stems of verbs/adjectives: 書く, 美しい … • Hiragana (phonogram): 83 • Function words: が, を, れる, られる… • Endings of verbs/adjectives: 書く, 美しい… • Katakana (phonogram): 86 • Loan words: コンピュータ, ドイツ… (Tokyo) (information) (university) (write) (beautiful) (write) (beautiful) (Germany) (computer)
Katakana Set (86 Characters) H B S Z W R Y M N T D K P a i u e o
Word Segmentation • Kanji andHiragana Ex.彼は大学に通う (kare-wa-daigaku-ni-kayo-u) HeppUniv. ppgoes • Katakana Ex. エクストラバージンオリーブオイル extra virgin olive oil ジャパンカップサイクルロードレース Japan cup cycle road race
Why Katakana Word Segmentation is Necessary? ○ トマトソース (tomato sauce) tomato so-su kinds of “sauce” ○ ホワイトソース (white sauce) howaito so-su × something to put to some dishes something to put to some dishes something to put to some dishes リソース (resource) riso-su
Similar Problem in German Lebensversicherungsgesellschaftsangestellter “life insurance company employee” Donaudampfschleppschiffahrtgesellschaftskapitän “Captain of Danube steam tow company”
Word Segmentation so far • A lot of studies about word segmentation • No study aiming at Katakana words • In word segmentation task so far for Katakana words: • Use a dictionary with some manually registered Katakana words • Consider a whole continuous Katakana string as a word for unknown words or so
ラーメン スープ トマト ソース ・・・ ・・・ あとは粉を付けてバターで焼 いたムニエルや、白ワインで 蒸し直したり、パン粉をまぶし てフライにしたり、ホワイトソー スやトマトソースをかけたグラ タンにもなります。 ・・・ ラーメン スープ ・・・ トマトソース ・・・ トマトスープ ・・・ トマト ソース ・・・ 28727 20808 ・・・ 11641 ・・・ 8435 ・・・ 7887 7570 ・・・ Basic Vocabulary Corpus Word-Occurrence data (WOD) Problem Setting Japanese-English Translation Information
Table of Contents • Japanese Character Set • About Word Segmentation • Proposed Method • Method using a Japanese-English Dictionary • Method using a Huge Corpus and a Dictionary • Method using Relation in WOD • Evaluation and Discussion • Conclusion
Overview of the Method Highly Reliable English Corpus Basic Vocabulary Dictionary WOD Freq. WOD Corpus
JE Dictionary トマトソース = “tomato sauce” トマト = “tomato” ソース = “sauce”, “source” Method using Dictionary • Segmentation using aJE dictionary Ex. トマトソース • Translation is one word → single-word Ex. サンドウィッチ = “sandwich” • Entries of Japanese Dic. → single-word Ex. インゲン (= いんげん) = “tomato sauce” = = トマト ソース (a kidney bean)
Overview of the Method High Precision More coverage Highly Reliable Low coverage English Corpus Basic Vocabulary Dictionary WOD Freq. WOD Corpus
Method using a Huge English Corpus • All possible segmentation to Katakana words in the JE Dictionary • Translation → possible English phrases • # of Phrasal Hits of Web search engine Ex. パセリソース (parsley sauce) parsley source parsley sauce pase resource → 554 Hits → 20600 Hits ◎ → 3 Hits (i) パセリ:ソース (ii) パセ:リソース
Threshold for Hit Number • Even an inappropriate segmentation and its mad translation has some frequency in the web Ex. デミ:グラス → demi glass :207 バン:バンジー → van bungee:159 • The longer the Katakana word is, the more probable it is a compund C / N L L : the length of the Katakana word C : 400,000 N : 2 (demi-glace) (Chinese food “ban-ban-ji”)
Overview of the Method Depends on the JE-Dic., Natural English Compounds High Precision More coverage Highly Reliable Low coverage English Corpus Basic Vocabulary Dictionary WOD Freq. WOD High Recall ハイビジョン (hai-bijyon) × high vision → 11,000 Hits ○ high definition → 5,450,000 Hits ペーパーテスト (pe-pa-tesuto) × paper test → 45,400 Hits ○ witten test → 415,000 Hits Corpus
WOD 652 トースト 515 ガーリック 159 ガーリックトースト 60 スト 32 ガー 9 リック 5 トー Method using Relation in WOD • Try to find compounds only based on the information in a WOD • Geometric mean of freq. of possible constituent words ⇔ Freq. of the original word Ex. ガーリックトースト 159 (garlic toast) ガー : リック : トースト (32×9×652)= ガーリック : トースト (515×652)= ガー : リック : トー : スト (32×9×5×60)= ガーリック : トー : スト (515×5×60)= 1/3 57 579 17 54 1/2 1/4 1/3
Threshold for Geometric Mean Fo < Fg’ , Fg’ = Fg / (C / N l + α) Fo : Freq. of the original word Fg: Geometric mean of freq. of constituents Fg’ : Modified Geometric mean l : Average length of constituents C : 2,500 N : 4 α: 0.7
Table of Contents • Japanese Character Set • About Word Segmentation • Proposed Method • Method using a Japanese-English Dictionary • Method using a Huge Corpus and a Dictionary • Method using Relation in WOD • Evaluation and Discussion • Conclusion
Experiments • Data • 87K Katakana types in 5.8M sentences of newspaper articles (12-year volume) • 43K Katakana types in 2.8M sentences of cooking-related web pages • Evaluation • 500-word test set for each data set : manually assign correct segmentation • Automatic segmentation is compared with the gold-standard data → precision/recall
Experimental Results(2/2) Katakana words : Freq. ≧ 10
Discussion(1/2) • Precision : No entry in JE dictinonary • Neologisms or very rare words シュレッドチーズ → シュ : レッド : チーズ × shred cheese • Proper nouns パスツール → パス : ツール × Pasteur • Recall • Criteria for compounds プールサイド = poolside • No entry in JE dictionary シュガーローフ sugar loaf
Discussion(2/2) • Context dependency • Segmentation タコスライス → タコス + ライス or タコ + スライス tacos rice Tako slice • Compound or not カラーリング or カラー + リング coloring color ring (= octopus)
Conclusion • Segmentation of Japanese Katakana compounds • Dictionary • Huge English Corpus and JE-Dictionary • Relation in WOD • Future plan • Integration with NE detection • Use of automatic transliteration