Learning Formulation and Transformation Rules for Multilingual Named Entities

Learning Formulation and Transformation Rules for Multilingual Named Entities Advisor ：Dr. Hsu Reporter：Chun Kai Chen Author：Hsin-Hsi Chen, Changhua Yang and Ying Lin Proceedings of the ACL 2003

Outline • Motivation • Objective • Introduction • Multilingual Named Entity Corpora • Rule Mining • Experimental Results • Conclusions • Personal Opinion

Motivation • The past works on multilingualnamed entitiesemphasizes on the transliteration issues • However, the transformation between named entities in different languages is not transliteration only • Victoria Fall-維多利亞瀑布 • Little Rocky Mountains-小落磯山脈 • Kenmare-康美爾 • East Chicago-東芝加哥

Objective • Propose a method extract • formulation rules of named entities for individual languages • transformation rules for mapping among languages • Application of the results on cross language information retrieval (CLIR)

Introduction(1/3) • In the past, named entity extraction • mainly focuses on general domains • employed to various applications such as information retrieval, question-answering

Introduction(2/3) • Most of the previous approaches • dealt with monolingual named entity extraction • Chen et al.(1998) extended it to cross-language information retrieval (CLIR) • A grapheme-based model was (字母) • proposed to compute the similarity between Chinese transliteration name and English name. • Lin and Chen (2000) further classified the works into two directions • forward transliteration (Wan and Verspoor, 1998) • backward transliteration (Chen et al., 1998; Knight and Graehl, 1998), • proposed a phoneme-based model

Introduction(3/3) • This paper will study • the issues of languages and named entity types on the choices of translation and transliteration. • We focus on three more challenging named entities only, i.e., • named people • named locations • named organizations

Multilingual Named Entity Corpora • NICT location name corpus • Developed by Ministry of Education of Taiwan in 1995 • consists of three parts • Foreign location name, Chinese transliteration/translation name, country name • (Victoria Fall, “維多利亞瀑布” (wei duo li ya pu bu),South Africa) • CNA personal name and organization corpora • are used by news reporters to unify the name transliteration/translation in news stories

Rule Mining • Frequency-Based Approach with a Bilingual Dictionary • Keyword Extraction without a Bilingual Dictionary • Extraction of Transformation Rules • Extraction of Keywords at a Distance

Learning Formulation and Transformation Rules {Catalan Mountain ,卡太蘭山} {Catalan , 卡太蘭山} {e1, 卡太太蘭蘭山} {e1, …} {e1, 卡太蘭山} {Mountain , 卡太蘭山} {e2, 卡太太蘭蘭山} {e2, …} {e2, 卡太蘭山} (s6) {Catalan Mountain ,卡太蘭山} (s7) {Aletschhorn Mountain , 阿利奇赫恩山} World Taiwanese Association “世台會” “Mountain” ⇔ “ 山” Frequency-Based with a Bilingual Dictionary Keyword Extraction without a Bilingual Dictionary DecomposeE Victoria Fall Victoria, “維多利亞” Fall, “瀑布” Count the frequency (TFIDF) Generate candidates Dictionary {Mountain, “山” (shan)} {Aletschhorn Mountain , 阿利奇赫恩山} {Aletschhorn , 阿利奇赫恩山} {e1, 阿利利奇奇赫赫恩恩山} {e1, …} {e1, 阿利奇赫恩山} {Mountain , 阿利奇赫恩山} {e2, 阿利利奇奇赫赫恩恩山} {e2, …} Extraction of Transformation Rules Extraction of Keywords at a Distance (s6’) γ mountain ⇔ δ 山 (s7’) γ mountain ⇔ δ 山 (s8’) γ Strait ⇔ δ 海峽 (s9’) γ, Strait of ⇔ δ 海峽 “American Civil Liberties Union”. “American ∆ Liberties Union” “American Civil ∆ Union” “American ∆ Union”

Frequency-Based Approach with a Bilingual Dictionary • We postulate • transliterated term is usually an unknown wordandnot listed in a lexicon • translated term often appears in a lexicon • Under this postulation • translated term(翻譯詞) occurs more often in a corpus • Fall, “瀑布” • transliterated term(音譯詞) only appears very few • Victoria, “維多利亞”

Frequency-based method(1/2) • Simple frequency-based method will compute the frequencies of terms and use them to tell out the transliteration and translation parts in a named entity • Compute word frequencies of each word in the foreign name list • Keep those words • appear more than a threshold • appear in a common foreign dictionary • these words form candidates of simple keywords • Mountain • Examine the foreign word list again • Cluster the Chinese name list • based on foreign keywords • here a bilingual dictionary may be consulted • “Mountain” ⇔ “ 山”

Frequency-based method(2/2) • NICT location name corpus • River (河, he), Island (島, dao), Lake (湖,hu), Mountain (山, shan), Bay (灣, wan), Mountain (峰, feng), Peak (峰, feng) • “Mountain” ⇔ “ 山” (shan) and “ 峰” (feng) • “峰” (feng) ⇔ “Mountain” and “Peak” • CNA organization name corpus • Suffix • Association (協會, xie hui), University (大學, da xue) • Prefix • International (國際, guo ji), World (世界,shi jie), American (美國, mei guo)

Keyword Extraction without a Bilingual Dictionary (problem) • Abbreviation is common adopted in translation, dictionary-based approach is hard to capture this phenomenon • (World Taiwanese Association,“世台會”) • Here another approach without dictionary is proposed

(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山 {e1, s1s2 … st} {Aletschhorn , 阿利奇赫恩山} {e1, 阿利利奇奇赫赫恩恩山} {e1, 阿利奇利奇赫奇赫恩赫恩山} {e1, 阿利奇赫利奇赫恩奇赫恩山} {e1, 阿利奇赫恩利奇赫恩山} {e1, 阿利奇赫恩山} {e2, s1s2 … st} {Mountain , 阿利奇赫恩山} {e2, 阿利利奇奇赫赫恩恩山} {e2, 阿利奇利奇赫奇赫恩赫恩山} {e2, 阿利奇赫利奇赫恩奇赫恩山} {e2, 阿利奇赫恩利奇赫恩山} {e2, 阿利奇赫恩山} (s7) Catalan Mountain⇔ 卡太蘭山 {e1, s1s2 … st} {Catalan , 卡太蘭山} {e1, 卡太太蘭蘭山} {e1, 卡太蘭太蘭山} {e1, 卡太蘭山} {e2, s1s2 … st} {Mountain , 卡太蘭山} {e2, 卡太太蘭蘭山} {e2, 卡太蘭太蘭山} {e2, 卡太蘭山} Keyword Extraction without a Bilingual Dictionary (process) • {e, c} whose frequency > 2 are kept • {Mountain, “山” (shan)}

Keyword Extraction without a Bilingual Dictionary (algorithm) • {Ej, Cj} • Ej is a foreign named entity • Cj is a Chinese named entity • decompose the named entities • Ej • comprises m wordsw1·w2…wm • a candidate segment ep, q is defined as wp … wq • Cj • has n syllabless1·s2…sn • a candidate segment cx, y is defined as sx … sy • we can get pairs of {ep, q, cx, y} from {Ej, Cj}. • group and count • the pairs collected from the multilingual named entity list • count the frequency for each occurrence • pairs with higher frequency denote significant segment pairs

Keyword Extraction without a Bilingual Dictionary (example) • Example • All the pairs {e, c} whose frequency > 2 are kept • {Mountain, “山” (shan)} and {Strait, “海峽” (hai xia)} appear twice • (s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山 • (s7) Catalan Mountain ⇔ 卡太蘭山 • (s8) Cook Strait ⇔ 科克海峽 • (s9) Dover, Strait of ⇔多佛海峽

Keyword Extraction without a Bilingual Dictionary (problem) • Two issues have to be addressed • redundancy which may exist in the pairs of segments should be eliminated carefully • emay be translated to more than one synonym • “Association” ⇔“協會” (xie hui) and “聯誼會” (lian yi hui) • A metric to deal with the above issues is proposed

Extraction of Transformation Rules (s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山 (s7) Catalan Mountain ⇔ 卡太蘭山 (s8) Cook Strait ⇔ 科克海峽 (s9) Dover, Strait of ⇔多佛海峽 (s6’) γ mountain ⇔ δ 山 (s7’) γ mountain ⇔ δ 山 (s8’) γ Strait ⇔ δ 海峽 (s9’) γ, Strait of ⇔ δ 海峽 • Chinese location name keyword • tends to be located in the rightmost • the remaining part is a transliterated name • Foreign location name keyword • tends to be either located in the rightmost, or permuted by some prepositions, comma, and the transliterating part

Extraction of Keywords at a Distance • (s12) and (s13) • English compound keyword is separated and so is its corresponding Chinese counterpart • (s14) and (s15) • English compound keyword is connected in • but the corresponding Chinese translation is separated (s12) AmericanPodiatric medical Association ⇔ 美國足病醫療學會 (s13) AmericanPublic Health Association ⇔ 美國公共衛生學會 (s14) American Societyfor Industrial Security ⇔ 美國工業安全協會 (s15) American Societyof Newspaper Editors ⇔ 美國報紙編輯人協會

Extraction of Keywords at a Distance • Introduce a symbol ∆ to cope with the distance issue • “American Civil Liberties Union”. • “American ∆ Liberties Union” • “American Civil ∆ Union” • “American ∆ Union”

Experimental Analysis (corpus) • NICT location corpus • Total 122 keyword pairs are identified • Total 230 transformation rules • On the average, a keyword pair corresponds to 1.89 transformation rules • CNA personal names • are composed of more than one Word • (100 / 50,586) • the number of keywords extracted is only a few • De ⇔ 戴 (dai), La ⇔ 拉 (la), De La ⇔ 戴拉 (dai la), Du ⇔ 杜 (du), David ⇔ 大衛 (da wei) • CNA organization • are composed of more than one Word • (12,885 / 14,658) • 5,229 keyword pairs are extracted • most of the keyword pairs are meaning translated

Experimental Analysis (classify) • We classify these keyword pairs into the following types • Meaning translation • common location keywords • Bir ⇔ 井 (jing), Ain ⇔ 泉 (quan),Bahr ⇔ 河 (he), Cerro ⇔ 山 (shan) • Direction • Central ⇔ 中(zhong), East ⇔ 東(dong), etc.) • size (e.g., Big ⇔ 大(da)), length (e.g, Long ⇔ 長(zhang)), • color (e.g., Black ⇔ 黑(hei), Blue ⇔ 藍 (lan), etc.) • the specificity of place or area • Crystal ⇔ 結晶, Diamond⇔ 鑽石 (zuan shi) • Phoneme transliteration keywords • Dera ⇔ 德拉 (de la), Monte⇔ 蒙特 (meng te), Los ⇔ 洛斯 (luo si) • 伊利莎白 (yi li sha bai), Edward ⇔ 愛德華(ai de hua) • Total 39 terms belong to this type. It occupies 31.97%. • Some keywords in type (1) are transliterated • Bay ⇔ 貝(Bay), Beach ⇔ 比奇 (bi qi) • Total 14 keywords (11.48%) are extracted.

Experimental Results • NICT location corpus • Total 122 keyword pairs are identified • Total 230 transformation rules • On the average, a keyword pair corresponds to 1.89 transformation rules • keyword pair mountain ⇔ 山 (shan) • Four transformation rules • (1) γα ⇔ δβ (234) • (2) γ, α ⇔ δβ (45) • (3) γ, αγ ⇔ δβ (1) • (4) γαγ ⇔ δβ (1)

Application on CLIR

Conclusion and Remarks • This paper proposes corpus-based approaches • extract the formulation rules and the translation/transliteration rules among multilingual named entities • Two types of evaluation • partition the corpora into two parts, one for training and the other one for testing • integrating our method in a cross language information retrieval system • Further applications • will be explored in the future and the methodology will be extended to other types of named entities

Personal Opinion • Drawback • Lack analysis about time complexity • Application • Construct Chinese-English rules apply to IR • Future Work • Adopt transliterated / translated term issue

Learning Formulation and Transformation Rules for Multilingual Named Entities

Learning Formulation and Transformation Rules for Multilingual Named Entities

Presentation Transcript

Towards a semantic extraction of named entities

Semantic Relation Extraction for Linking Named Entities to Biomedical Databases

Unit-testing Query Transformation Rules

Indexing concepts and/or named entities

Named Anchors and Named Destinations

Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams

Warm-up Transformation rules:

Transformation-Based Learning

LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge

Using Wikipedia for Hierarchical Finer Categorization of Named Entities

Parallel Corpora for Multilingual Ontology Learning

Named Entities in Domain Unlimited Speech Translation

Rules for learning assembly

Tips For Learning Trig Rules

Named Entity Discovery from Multilingual Corpora

Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development

Using WordNet Predicates for Multilingual Named Entity Recognition

Text Classification and Named Entities for New Event Detection

Leading for Transformation in Teaching, Learning, and Relationships

Named Entities in Czech Texts and Their Processing

Iterative Set Expansion of Named Entities using the Web