270 likes | 369 Views
Learning Formulation and Transformation Rules for Multilingual Named Entities. Advisor : Dr. Hsu Reporter : Chun Kai Chen Author : Hsin-Hsi Chen, Changhua Yang and Ying Lin. Proceedings of the ACL 2003. Outline. Motivation Objective Introduction Multilingual Named Entity Corpora
E N D
Learning Formulation and Transformation Rules for Multilingual Named Entities Advisor :Dr. Hsu Reporter:Chun Kai Chen Author:Hsin-Hsi Chen, Changhua Yang and Ying Lin Proceedings of the ACL 2003
Outline • Motivation • Objective • Introduction • Multilingual Named Entity Corpora • Rule Mining • Experimental Results • Conclusions • Personal Opinion
Motivation • The past works on multilingualnamed entitiesemphasizes on the transliteration issues • However, the transformation between named entities in different languages is not transliteration only • Victoria Fall-維多利亞瀑布 • Little Rocky Mountains-小落磯山脈 • Kenmare-康美爾 • East Chicago-東芝加哥
Objective • Propose a method extract • formulation rules of named entities for individual languages • transformation rules for mapping among languages • Application of the results on cross language information retrieval (CLIR)
Introduction(1/3) • In the past, named entity extraction • mainly focuses on general domains • employed to various applications such as information retrieval, question-answering
Introduction(2/3) • Most of the previous approaches • dealt with monolingual named entity extraction • Chen et al.(1998) extended it to cross-language information retrieval (CLIR) • A grapheme-based model was (字母) • proposed to compute the similarity between Chinese transliteration name and English name. • Lin and Chen (2000) further classified the works into two directions • forward transliteration (Wan and Verspoor, 1998) • backward transliteration (Chen et al., 1998; Knight and Graehl, 1998), • proposed a phoneme-based model
Introduction(3/3) • This paper will study • the issues of languages and named entity types on the choices of translation and transliteration. • We focus on three more challenging named entities only, i.e., • named people • named locations • named organizations
Multilingual Named Entity Corpora • NICT location name corpus • Developed by Ministry of Education of Taiwan in 1995 • consists of three parts • Foreign location name, Chinese transliteration/translation name, country name • (Victoria Fall, “維多利亞瀑布” (wei duo li ya pu bu),South Africa) • CNA personal name and organization corpora • are used by news reporters to unify the name transliteration/translation in news stories
Rule Mining • Frequency-Based Approach with a Bilingual Dictionary • Keyword Extraction without a Bilingual Dictionary • Extraction of Transformation Rules • Extraction of Keywords at a Distance
Learning Formulation and Transformation Rules {Catalan Mountain ,卡太蘭山} {Catalan , 卡 太 蘭 山} {e1, 卡太 太蘭 蘭山} {e1, …} {e1, 卡太蘭山} {Mountain , 卡 太 蘭 山} {e2, 卡太 太蘭 蘭山} {e2, …} {e2, 卡太蘭山} (s6) {Catalan Mountain ,卡太蘭山} (s7) {Aletschhorn Mountain , 阿利奇赫恩山} World Taiwanese Association “世台會” “Mountain” ⇔ “ 山” Frequency-Based with a Bilingual Dictionary Keyword Extraction without a Bilingual Dictionary DecomposeE Victoria Fall Victoria, “維多利亞” Fall, “瀑布” Count the frequency (TFIDF) Generate candidates Dictionary {Mountain, “山” (shan)} {Aletschhorn Mountain , 阿利奇赫恩山} {Aletschhorn , 阿 利 奇 赫 恩 山} {e1, 阿利 利奇 奇赫 赫恩 恩山} {e1, …} {e1, 阿利奇赫恩山} {Mountain , 阿 利 奇 赫 恩 山} {e2, 阿利 利奇 奇赫 赫恩 恩山} {e2, …} Extraction of Transformation Rules Extraction of Keywords at a Distance (s6’) γ mountain ⇔ δ 山 (s7’) γ mountain ⇔ δ 山 (s8’) γ Strait ⇔ δ 海峽 (s9’) γ, Strait of ⇔ δ 海峽 “American Civil Liberties Union”. “American ∆ Liberties Union” “American Civil ∆ Union” “American ∆ Union”
Frequency-Based Approach with a Bilingual Dictionary • We postulate • transliterated term is usually an unknown wordandnot listed in a lexicon • translated term often appears in a lexicon • Under this postulation • translated term(翻譯詞) occurs more often in a corpus • Fall, “瀑布” • transliterated term(音譯詞) only appears very few • Victoria, “維多利亞”
Frequency-based method(1/2) • Simple frequency-based method will compute the frequencies of terms and use them to tell out the transliteration and translation parts in a named entity • Compute word frequencies of each word in the foreign name list • Keep those words • appear more than a threshold • appear in a common foreign dictionary • these words form candidates of simple keywords • Mountain • Examine the foreign word list again • Cluster the Chinese name list • based on foreign keywords • here a bilingual dictionary may be consulted • “Mountain” ⇔ “ 山”
Frequency-based method(2/2) • NICT location name corpus • River (河, he), Island (島, dao), Lake (湖,hu), Mountain (山, shan), Bay (灣, wan), Mountain (峰, feng), Peak (峰, feng) • “Mountain” ⇔ “ 山” (shan) and “ 峰” (feng) • “峰” (feng) ⇔ “Mountain” and “Peak” • CNA organization name corpus • Suffix • Association (協會, xie hui), University (大學, da xue) • Prefix • International (國際, guo ji), World (世界,shi jie), American (美國, mei guo)
Keyword Extraction without a Bilingual Dictionary (problem) • Abbreviation is common adopted in translation, dictionary-based approach is hard to capture this phenomenon • (World Taiwanese Association,“世台會”) • Here another approach without dictionary is proposed
(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山 {e1, s1s2 … st} {Aletschhorn , 阿 利 奇 赫 恩 山} {e1, 阿利 利奇 奇赫 赫恩 恩山} {e1, 阿利奇 利奇赫 奇赫恩 赫恩山} {e1, 阿利奇赫 利奇赫恩 奇赫恩山} {e1, 阿利奇赫恩 利奇赫恩山} {e1, 阿利奇赫恩山} {e2, s1s2 … st} {Mountain , 阿 利 奇 赫 恩 山} {e2, 阿利 利奇 奇赫 赫恩 恩山} {e2, 阿利奇 利奇赫 奇赫恩 赫恩山} {e2, 阿利奇赫 利奇赫恩 奇赫恩山} {e2, 阿利奇赫恩 利奇赫恩山} {e2, 阿利奇赫恩山} (s7) Catalan Mountain⇔ 卡太蘭山 {e1, s1s2 … st} {Catalan , 卡 太 蘭 山} {e1, 卡太 太蘭 蘭山} {e1, 卡太蘭 太蘭山} {e1, 卡太蘭山} {e2, s1s2 … st} {Mountain , 卡 太 蘭 山} {e2, 卡太 太蘭 蘭山} {e2, 卡太蘭 太蘭山} {e2, 卡太蘭山} Keyword Extraction without a Bilingual Dictionary (process) • {e, c} whose frequency > 2 are kept • {Mountain, “山” (shan)}
Keyword Extraction without a Bilingual Dictionary (algorithm) • {Ej, Cj} • Ej is a foreign named entity • Cj is a Chinese named entity • decompose the named entities • Ej • comprises m wordsw1·w2…wm • a candidate segment ep, q is defined as wp … wq • Cj • has n syllabless1·s2…sn • a candidate segment cx, y is defined as sx … sy • we can get pairs of {ep, q, cx, y} from {Ej, Cj}. • group and count • the pairs collected from the multilingual named entity list • count the frequency for each occurrence • pairs with higher frequency denote significant segment pairs
Keyword Extraction without a Bilingual Dictionary (example) • Example • All the pairs {e, c} whose frequency > 2 are kept • {Mountain, “山” (shan)} and {Strait, “海峽” (hai xia)} appear twice • (s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山 • (s7) Catalan Mountain ⇔ 卡太蘭山 • (s8) Cook Strait ⇔ 科克海峽 • (s9) Dover, Strait of ⇔多佛海峽
Keyword Extraction without a Bilingual Dictionary (problem) • Two issues have to be addressed • redundancy which may exist in the pairs of segments should be eliminated carefully • emay be translated to more than one synonym • “Association” ⇔“協會” (xie hui) and “聯誼會” (lian yi hui) • A metric to deal with the above issues is proposed
Extraction of Transformation Rules (s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山 (s7) Catalan Mountain ⇔ 卡太蘭山 (s8) Cook Strait ⇔ 科克海峽 (s9) Dover, Strait of ⇔多佛海峽 (s6’) γ mountain ⇔ δ 山 (s7’) γ mountain ⇔ δ 山 (s8’) γ Strait ⇔ δ 海峽 (s9’) γ, Strait of ⇔ δ 海峽 • Chinese location name keyword • tends to be located in the rightmost • the remaining part is a transliterated name • Foreign location name keyword • tends to be either located in the rightmost, or permuted by some prepositions, comma, and the transliterating part
Extraction of Keywords at a Distance • (s12) and (s13) • English compound keyword is separated and so is its corresponding Chinese counterpart • (s14) and (s15) • English compound keyword is connected in • but the corresponding Chinese translation is separated (s12) AmericanPodiatric medical Association ⇔ 美國足病醫療學會 (s13) AmericanPublic Health Association ⇔ 美國公共衛生學會 (s14) American Societyfor Industrial Security ⇔ 美國工業安全協會 (s15) American Societyof Newspaper Editors ⇔ 美國報紙編輯人協會
Extraction of Keywords at a Distance • Introduce a symbol ∆ to cope with the distance issue • “American Civil Liberties Union”. • “American ∆ Liberties Union” • “American Civil ∆ Union” • “American ∆ Union”
Experimental Analysis (corpus) • NICT location corpus • Total 122 keyword pairs are identified • Total 230 transformation rules • On the average, a keyword pair corresponds to 1.89 transformation rules • CNA personal names • are composed of more than one Word • (100 / 50,586) • the number of keywords extracted is only a few • De ⇔ 戴 (dai), La ⇔ 拉 (la), De La ⇔ 戴拉 (dai la), Du ⇔ 杜 (du), David ⇔ 大衛 (da wei) • CNA organization • are composed of more than one Word • (12,885 / 14,658) • 5,229 keyword pairs are extracted • most of the keyword pairs are meaning translated
Experimental Analysis (classify) • We classify these keyword pairs into the following types • Meaning translation • common location keywords • Bir ⇔ 井 (jing), Ain ⇔ 泉 (quan),Bahr ⇔ 河 (he), Cerro ⇔ 山 (shan) • Direction • Central ⇔ 中(zhong), East ⇔ 東(dong), etc.) • size (e.g., Big ⇔ 大(da)), length (e.g, Long ⇔ 長(zhang)), • color (e.g., Black ⇔ 黑(hei), Blue ⇔ 藍 (lan), etc.) • the specificity of place or area • Crystal ⇔ 結晶, Diamond⇔ 鑽石 (zuan shi) • Phoneme transliteration keywords • Dera ⇔ 德拉 (de la), Monte⇔ 蒙特 (meng te), Los ⇔ 洛斯 (luo si) • 伊利莎白 (yi li sha bai), Edward ⇔ 愛德華(ai de hua) • Total 39 terms belong to this type. It occupies 31.97%. • Some keywords in type (1) are transliterated • Bay ⇔ 貝(Bay), Beach ⇔ 比奇 (bi qi) • Total 14 keywords (11.48%) are extracted.
Experimental Results • NICT location corpus • Total 122 keyword pairs are identified • Total 230 transformation rules • On the average, a keyword pair corresponds to 1.89 transformation rules • keyword pair mountain ⇔ 山 (shan) • Four transformation rules • (1) γα ⇔ δβ (234) • (2) γ, α ⇔ δβ (45) • (3) γ, αγ ⇔ δβ (1) • (4) γαγ ⇔ δβ (1)
Conclusion and Remarks • This paper proposes corpus-based approaches • extract the formulation rules and the translation/transliteration rules among multilingual named entities • Two types of evaluation • partition the corpora into two parts, one for training and the other one for testing • integrating our method in a cross language information retrieval system • Further applications • will be explored in the future and the methodology will be extended to other types of named entities
Personal Opinion • Drawback • Lack analysis about time complexity • Application • Construct Chinese-English rules apply to IR • Future Work • Adopt transliterated / translated term issue