1 / 27

Learning Formulation and Transformation Rules for Multilingual Named Entities

Learning Formulation and Transformation Rules for Multilingual Named Entities. Advisor : Dr. Hsu Reporter : Chun Kai Chen Author : Hsin-Hsi Chen, Changhua Yang and Ying Lin. Proceedings of the ACL 2003. Outline. Motivation Objective Introduction Multilingual Named Entity Corpora

tanek-pena
Download Presentation

Learning Formulation and Transformation Rules for Multilingual Named Entities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Formulation and Transformation Rules for Multilingual Named Entities Advisor :Dr. Hsu Reporter:Chun Kai Chen Author:Hsin-Hsi Chen, Changhua Yang and Ying Lin Proceedings of the ACL 2003

  2. Outline • Motivation • Objective • Introduction • Multilingual Named Entity Corpora • Rule Mining • Experimental Results • Conclusions • Personal Opinion

  3. Motivation • The past works on multilingualnamed entitiesemphasizes on the transliteration issues • However, the transformation between named entities in different languages is not transliteration only • Victoria Fall-維多利亞瀑布 • Little Rocky Mountains-小落磯山脈 • Kenmare-康美爾 • East Chicago-東芝加哥

  4. Objective • Propose a method extract • formulation rules of named entities for individual languages • transformation rules for mapping among languages • Application of the results on cross language information retrieval (CLIR)

  5. Introduction(1/3) • In the past, named entity extraction • mainly focuses on general domains • employed to various applications such as information retrieval, question-answering

  6. Introduction(2/3) • Most of the previous approaches • dealt with monolingual named entity extraction • Chen et al.(1998) extended it to cross-language information retrieval (CLIR) • A grapheme-based model was (字母) • proposed to compute the similarity between Chinese transliteration name and English name. • Lin and Chen (2000) further classified the works into two directions • forward transliteration (Wan and Verspoor, 1998) • backward transliteration (Chen et al., 1998; Knight and Graehl, 1998), • proposed a phoneme-based model

  7. Introduction(3/3) • This paper will study • the issues of languages and named entity types on the choices of translation and transliteration. • We focus on three more challenging named entities only, i.e., • named people • named locations • named organizations

  8. Multilingual Named Entity Corpora • NICT location name corpus • Developed by Ministry of Education of Taiwan in 1995 • consists of three parts • Foreign location name, Chinese transliteration/translation name, country name • (Victoria Fall, “維多利亞瀑布” (wei duo li ya pu bu),South Africa) • CNA personal name and organization corpora • are used by news reporters to unify the name transliteration/translation in news stories

  9. Rule Mining • Frequency-Based Approach with a Bilingual Dictionary • Keyword Extraction without a Bilingual Dictionary • Extraction of Transformation Rules • Extraction of Keywords at a Distance

  10. Learning Formulation and Transformation Rules {Catalan Mountain ,卡太蘭山} {Catalan , 卡 太 蘭 山} {e1, 卡太 太蘭 蘭山} {e1, …} {e1, 卡太蘭山} {Mountain , 卡 太 蘭 山} {e2, 卡太 太蘭 蘭山} {e2, …} {e2, 卡太蘭山} (s6) {Catalan Mountain ,卡太蘭山} (s7) {Aletschhorn Mountain , 阿利奇赫恩山} World Taiwanese Association “世台會” “Mountain” ⇔ “ 山” Frequency-Based with a Bilingual Dictionary Keyword Extraction without a Bilingual Dictionary DecomposeE Victoria Fall Victoria, “維多利亞” Fall, “瀑布” Count the frequency (TFIDF) Generate candidates Dictionary {Mountain, “山” (shan)} {Aletschhorn Mountain , 阿利奇赫恩山} {Aletschhorn , 阿 利 奇 赫 恩 山} {e1, 阿利 利奇 奇赫 赫恩 恩山} {e1, …} {e1, 阿利奇赫恩山} {Mountain , 阿 利 奇 赫 恩 山} {e2, 阿利 利奇 奇赫 赫恩 恩山} {e2, …} Extraction of Transformation Rules Extraction of Keywords at a Distance (s6’) γ mountain ⇔ δ 山 (s7’) γ mountain ⇔ δ 山 (s8’) γ Strait ⇔ δ 海峽 (s9’) γ, Strait of ⇔ δ 海峽 “American Civil Liberties Union”. “American ∆ Liberties Union” “American Civil ∆ Union” “American ∆ Union”

  11. Frequency-Based Approach with a Bilingual Dictionary • We postulate • transliterated term is usually an unknown wordandnot listed in a lexicon • translated term often appears in a lexicon • Under this postulation • translated term(翻譯詞) occurs more often in a corpus • Fall, “瀑布” • transliterated term(音譯詞) only appears very few • Victoria, “維多利亞”

  12. Frequency-based method(1/2) • Simple frequency-based method will compute the frequencies of terms and use them to tell out the transliteration and translation parts in a named entity • Compute word frequencies of each word in the foreign name list • Keep those words • appear more than a threshold • appear in a common foreign dictionary • these words form candidates of simple keywords • Mountain • Examine the foreign word list again • Cluster the Chinese name list • based on foreign keywords • here a bilingual dictionary may be consulted • “Mountain” ⇔ “ 山”

  13. Frequency-based method(2/2) • NICT location name corpus • River (河, he), Island (島, dao), Lake (湖,hu), Mountain (山, shan), Bay (灣, wan), Mountain (峰, feng), Peak (峰, feng) • “Mountain” ⇔ “ 山” (shan) and “ 峰” (feng) • “峰” (feng) ⇔ “Mountain” and “Peak” • CNA organization name corpus • Suffix • Association (協會, xie hui), University (大學, da xue) • Prefix • International (國際, guo ji), World (世界,shi jie), American (美國, mei guo)

  14. Keyword Extraction without a Bilingual Dictionary (problem) • Abbreviation is common adopted in translation, dictionary-based approach is hard to capture this phenomenon • (World Taiwanese Association,“世台會”) • Here another approach without dictionary is proposed

  15. (s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山 {e1, s1s2 … st} {Aletschhorn , 阿 利 奇 赫 恩 山} {e1, 阿利 利奇 奇赫 赫恩 恩山} {e1, 阿利奇 利奇赫 奇赫恩 赫恩山} {e1, 阿利奇赫 利奇赫恩 奇赫恩山} {e1, 阿利奇赫恩 利奇赫恩山} {e1, 阿利奇赫恩山} {e2, s1s2 … st} {Mountain , 阿 利 奇 赫 恩 山} {e2, 阿利 利奇 奇赫 赫恩 恩山} {e2, 阿利奇 利奇赫 奇赫恩 赫恩山} {e2, 阿利奇赫 利奇赫恩 奇赫恩山} {e2, 阿利奇赫恩 利奇赫恩山} {e2, 阿利奇赫恩山} (s7) Catalan Mountain⇔ 卡太蘭山 {e1, s1s2 … st} {Catalan , 卡 太 蘭 山} {e1, 卡太 太蘭 蘭山} {e1, 卡太蘭 太蘭山} {e1, 卡太蘭山} {e2, s1s2 … st} {Mountain , 卡 太 蘭 山} {e2, 卡太 太蘭 蘭山} {e2, 卡太蘭 太蘭山} {e2, 卡太蘭山} Keyword Extraction without a Bilingual Dictionary (process) • {e, c} whose frequency > 2 are kept • {Mountain, “山” (shan)}

  16. Keyword Extraction without a Bilingual Dictionary (algorithm) • {Ej, Cj} • Ej is a foreign named entity • Cj is a Chinese named entity • decompose the named entities • Ej • comprises m wordsw1·w2…wm • a candidate segment ep, q is defined as wp … wq • Cj • has n syllabless1·s2…sn • a candidate segment cx, y is defined as sx … sy • we can get pairs of {ep, q, cx, y} from {Ej, Cj}. • group and count • the pairs collected from the multilingual named entity list • count the frequency for each occurrence • pairs with higher frequency denote significant segment pairs

  17. Keyword Extraction without a Bilingual Dictionary (example) • Example • All the pairs {e, c} whose frequency > 2 are kept • {Mountain, “山” (shan)} and {Strait, “海峽” (hai xia)} appear twice • (s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山 • (s7) Catalan Mountain ⇔ 卡太蘭山 • (s8) Cook Strait ⇔ 科克海峽 • (s9) Dover, Strait of ⇔多佛海峽

  18. Keyword Extraction without a Bilingual Dictionary (problem) • Two issues have to be addressed • redundancy which may exist in the pairs of segments should be eliminated carefully • emay be translated to more than one synonym • “Association” ⇔“協會” (xie hui) and “聯誼會” (lian yi hui) • A metric to deal with the above issues is proposed

  19. Extraction of Transformation Rules (s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山 (s7) Catalan Mountain ⇔ 卡太蘭山 (s8) Cook Strait ⇔ 科克海峽 (s9) Dover, Strait of ⇔多佛海峽 (s6’) γ mountain ⇔ δ 山 (s7’) γ mountain ⇔ δ 山 (s8’) γ Strait ⇔ δ 海峽 (s9’) γ, Strait of ⇔ δ 海峽 • Chinese location name keyword • tends to be located in the rightmost • the remaining part is a transliterated name • Foreign location name keyword • tends to be either located in the rightmost, or permuted by some prepositions, comma, and the transliterating part

  20. Extraction of Keywords at a Distance • (s12) and (s13) • English compound keyword is separated and so is its corresponding Chinese counterpart • (s14) and (s15) • English compound keyword is connected in • but the corresponding Chinese translation is separated (s12) AmericanPodiatric medical Association ⇔ 美國足病醫療學會 (s13) AmericanPublic Health Association ⇔ 美國公共衛生學會 (s14) American Societyfor Industrial Security ⇔ 美國工業安全協會 (s15) American Societyof Newspaper Editors ⇔ 美國報紙編輯人協會

  21. Extraction of Keywords at a Distance • Introduce a symbol ∆ to cope with the distance issue • “American Civil Liberties Union”. • “American ∆ Liberties Union” • “American Civil ∆ Union” • “American ∆ Union”

  22. Experimental Analysis (corpus) • NICT location corpus • Total 122 keyword pairs are identified • Total 230 transformation rules • On the average, a keyword pair corresponds to 1.89 transformation rules • CNA personal names • are composed of more than one Word • (100 / 50,586) • the number of keywords extracted is only a few • De ⇔ 戴 (dai), La ⇔ 拉 (la), De La ⇔ 戴拉 (dai la), Du ⇔ 杜 (du), David ⇔ 大衛 (da wei) • CNA organization • are composed of more than one Word • (12,885 / 14,658) • 5,229 keyword pairs are extracted • most of the keyword pairs are meaning translated

  23. Experimental Analysis (classify) • We classify these keyword pairs into the following types • Meaning translation • common location keywords • Bir ⇔ 井 (jing), Ain ⇔ 泉 (quan),Bahr ⇔ 河 (he), Cerro ⇔ 山 (shan) • Direction • Central ⇔ 中(zhong), East ⇔ 東(dong), etc.) • size (e.g., Big ⇔ 大(da)), length (e.g, Long ⇔ 長(zhang)), • color (e.g., Black ⇔ 黑(hei), Blue ⇔ 藍 (lan), etc.) • the specificity of place or area • Crystal ⇔ 結晶, Diamond⇔ 鑽石 (zuan shi) • Phoneme transliteration keywords • Dera ⇔ 德拉 (de la), Monte⇔ 蒙特 (meng te), Los ⇔ 洛斯 (luo si) • 伊利莎白 (yi li sha bai), Edward ⇔ 愛德華(ai de hua) • Total 39 terms belong to this type. It occupies 31.97%. • Some keywords in type (1) are transliterated • Bay ⇔ 貝(Bay), Beach ⇔ 比奇 (bi qi) • Total 14 keywords (11.48%) are extracted.

  24. Experimental Results • NICT location corpus • Total 122 keyword pairs are identified • Total 230 transformation rules • On the average, a keyword pair corresponds to 1.89 transformation rules • keyword pair mountain ⇔ 山 (shan) • Four transformation rules • (1) γα ⇔ δβ (234) • (2) γ, α ⇔ δβ (45) • (3) γ, αγ ⇔ δβ (1) • (4) γαγ ⇔ δβ (1)

  25. Application on CLIR

  26. Conclusion and Remarks • This paper proposes corpus-based approaches • extract the formulation rules and the translation/transliteration rules among multilingual named entities • Two types of evaluation • partition the corpora into two parts, one for training and the other one for testing • integrating our method in a cross language information retrieval system • Further applications • will be explored in the future and the methodology will be extended to other types of named entities

  27. Personal Opinion • Drawback • Lack analysis about time complexity • Application • Construct Chinese-English rules apply to IR • Future Work • Adopt transliterated / translated term issue

More Related