Chapter 13 Chinese Information Extraction Technologies

自然語言處理實驗室 資訊工程學研究所臺灣大學 Natural Language Processing Lab. National Taiwan University Chapter 13Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw

Outline • Introduction to Information Extraction (IE) • Chinese IE Technologies • Tagging Environment for Chinese IE • Applications • Summary

Introduction

Introduction • Information Extraction • the extraction or pulling out of pertinent information from large volumes of texts • Information Extraction System • an automated system to extract pertinent information from large volumes of text • Information Extraction Technologies • techniques used to automatically extract specified information from text (http://www.itl.nist.gov/iaui/894.02/related_projects/muc/)

An Example in Air Vehicle Launch • Original Document • Named-Entity-Tagged Document • Equivalence Classes • Co-Reference Tagged Document

<DOC> <DOCID> NTU-AIR_LAUNCH-中國時報-19970612-002 </DOCID> <DATASET> Air Vehicle Launch </DATASET> <DD> 1997/06/12 </DD> <DOCTYPE> 報紙報導 </DOCTYPE> <DOCSRC> 中國時報 </DOCSRC> <TEXT> 【本報綜合紐約、華盛頓十一日外電報導】在華盛頓宣布首度出售「刺針」肩射防空飛彈給南韓的第二天，美國與北韓今天在紐約恢復延擱已久的會談，這項預定三天的會談將以北韓的飛彈發展為重點，包括北韓準備部署射程可涵蓋幾乎日本全境的「蘆洞」一號長程飛彈的報導。美國國務院發言人柏恩斯說：「在有關北韓飛彈擴散問題上，美方的確有多項關切之處。」美國官員也長期懷疑北韓正對伊朗和敘利亞輸出飛彈，並希望平壤加入禁止擴散此種武器的 red: location name blue: date expression green: organization name purple: person name

國際公約。美國官員已知會北韓說，倘若北韓希望與美國建立國際公約。美國官員已知會北韓說，倘若北韓希望與美國建立正常的外交關係，就必須減少飛彈輸出。　這項有關北韓飛彈計劃的會談是雙方於一九九六年四月在德國柏林舉行的首度會談的後續談判。美國在該次會談中要求北韓停止生產、測試及出售飛彈給他國，尤其是敘利亞和伊朗兩國。美國副助理國務卿艾恩宏和北韓外交部對外事務局局長李衡哲分別為雙方的談判代表，會談預定在十三日結束。柏恩斯說：「美方非常關心所有北韓本身，或是北韓與中共、伊朗或其他國家的飛彈問題。我們認為就此與他們舉行會談是甚為重要。」　而為提昇南韓陸軍的自衛能力，美國於昨天宣布準備出售價值三億零七百萬美元的一千零六十五枚刺針飛彈與其他武器給南韓，它說，這項交易不會使朝鮮半島的緊張局勢惡化。

五角大廈說：「這項設備與支援的銷售不會影響該區基本軍事五角大廈說：「這項設備與支援的銷售不會影響該區基本軍事均勢。」國務院也表示全力支持此項包含兩百一十三座發射台、支援設備、零件與訓練的交易。柏恩斯說：「這項交易獲得政府內每一個人的全力支持，它符合我們在朝鮮半島的政策。」他強調：「我們的第一優先是防衛南韓。」　如果國會同意，這將是華府對南韓出售防空飛彈的第一筆交易。 </TEXT> </DOC>

<DOC> <DOCID> NTU-AIR_LAUNCH-中國時報-19970612-002 </DOCID> <DATASET> Air Vehicle Launch </DATASET> <DD> 1997/06/12 </DD> <DOCTYPE> 報紙報導 </DOCTYPE> <DOCSRC> 中國時報 </DOCSRC> <ISRELEVANT> NO </ISRELEVANT> <TITLE> <ENAMEX TYPE="LOCATION">美</ENAMEX>擬售<ENAMEX TYPE="LOCATION">南韓</ENAMEX>1065枚刺針飛彈 </TITLE> <TEXT> 【本報綜合<ENAMEX TYPE="LOCATION">紐約</ENAMEX>、<ENAMEX TYPE="LOCATION">華盛頓</ENAMEX><TIMEX TYPE="DATE">十一日</TIMEX>外電報導】在<ENAMEX TYPE="LOCATION">華盛頓</ENAMEX>宣布首度出售「刺針」肩射防空飛彈給<ENAMEX TYPE="LOCATION">南韓</ENAMEX>的<TIMEX TYPE="DATE">第二天</TIMEX>，<ENAMEX TYPE="LOCATION">美國</ENAMEX>與<ENAMEX TYPE="LOCATION">北韓</ENAMEX><TIMEX TYPE="DATE">今天</TIMEX>在<ENAMEX TYPE="LOCATION">紐約</ENAMEX>恢復延擱已久的會談，這項預定三天的會談將以<ENAMEX TYPE="LOCATION">北韓</ENAMEX>的飛彈發展為重點，包括<ENAMEX TYPE="LOCATION">北韓</ENAMEX>準備部署射程可涵蓋幾乎<ENAMEX TYPE="LOCATION">日本</ENAMEX>全境的「蘆洞」一號長程飛彈的報導。

<ID="3">十一日 <ID="4" REF="3" >今天 <ID="5“ REF="3">出售「刺針」肩射防空飛彈給南韓的第二天 <ID="63" >延擱已久的會談 <ID=“66” REF=“63”>一九九六年四月在德國柏林舉行的首度會談的後續談判 <ID="65" REF="63">這項有關北韓飛彈計劃的會談 <ID="70" REF="65">會談 <ID="69" REF="65">會談 <ID="64" REF="63">這項預定三天的會談

<DOC> <DOCID> NTU-AIR_LAUNCH-中國時報-19970612-002 </DOCID> <DATASET> Air Vehicle Launch </DATASET> <DD> 1997/06/12 </DD> <DOCTYPE> 報紙報導 </DOCTYPE> <DOCSRC> <COREF ID="1">中國時報</COREF> </DOCSRC> <ISRELEVANT> NO </ISRELEVANT> <TITLE> <COREF ID="6">美</COREF>擬售<COREF ID="23">南韓</COREF><COREF ID="45" REF="44" TYPE="IDENT" MIN="刺針飛彈">1065枚刺針飛彈</COREF> </TITLE> <TEXT> 【<COREF ID="2" REF="1" TYPE="IDENT">本報</COREF>綜合<COREF ID="61">紐約</COREF>、<COREF ID="8" STATUS="OPT" REF="6" TYPE="IDENT">華盛頓</COREF><COREF ID="3">十一日</COREF>外電報導】在<COREF ID="7" REF="6" TYPE="IDENT">華盛頓</COREF>宣布首度<COREF ID="5" STATUS="OPT" REF="3" TYPE="IDENT" MIN="第二天">出售「刺針」肩射防空飛彈給<COREF ID="24" REF="23" TYPE="IDENT">南韓</COREF>的第二天</COREF>，<COREF ID="77"><COREF ID="9" REF="6" TYPE="IDENT">美國</COREF>與<COREF ID="29">北韓</COREF></COREF><COREF ID="4" REF="3" TYPE="IDENT">今天</COREF>在<COREF ID="62" REF="61" TYPE="IDENT">紐約</COREF>恢復<COREF ID="63" MIN="會談">延擱已久的會談</COREF>，<COREF ID="64" REF="63" TYPE="IDENT" MIN="會談">這項預定三天的會談</COREF>將以<COREF ID="81" STATUS="OPT" REF="75" TYPE="IDENT" MIN="飛彈"><COREF ID="30" REF="29" TYPE="IDENT">北韓</COREF>的飛彈</COREF>發展為重點，包括<COREF ID="31" REF="29" TYPE="IDENT">北韓</COREF>準備部署射程可涵蓋幾乎日本全境的「蘆洞」一號長程飛彈的報導。

IE Evaluation in MUC-7 (1998) • Named Entity Task [NE]: Insert SGML tags into the text to mark each string that represents a person, organization, or location name, or a date or time stamp, or a currency or percentage figure • Multi-lingual Entity Task [MET]: NE task for Chinese and Japanese • Co-reference Task [CO]: Capture information on co-referring expressions: all mentions of a given entity, including those tagged in NE, TE tasks

IE Evaluation in MUC-7 (cont.) • Template Element Task [TE]: Extract basic information related to organization, person, and artifact entities, drawing evidence from anywhere in the text • Template Relation Task [TR]: Extract relational information on employee_of, manufacture_of, and location_of relations • Scenario Template Task [ST]: Extract pre-specified event information and relate the event information to particular organization, person, or artifact entities involved in the event.

Chinese IE Technologies • Segmentation • Named Entity Extraction • Part of Speech/Sense Tagging • Full/Partial Parsing • Co-Reference Resolution

Segmentation

Segmentation • Problem • A Chinese sentence is composed of characters without word boundary • 這名記者會說國語。 • 這名記者會說國語。 • 這名記者會說國語。 • Word Definition • A character string with an independent meaning and a specific syntactic function

Segmentation • Standard • China【信息處理用現代漢語分詞規範】 • Implemented in 1988 • National standard in 1992 (GB/T13715-92) • Taiwan【資訊處理用中文分詞標準草案】 • Proposed by ROCLING in 1996 • National standard in 1999 (CNS14366)

 把  他  的  確  實  行  動  作  了  分  析   電  子  計  算  機  是  會  計  算  題  目  的  機  器 Segmentation Strategies • Dictionary is an important resource • List “all” possible words • Find the most “plausible” path from a word lattice • 把他的確實行動作了分析 • 電子計算機是會計算題目的機器

Segmentation Strategies (Continued) • Disambiguation: Select the best combination • Rule-based • Longest-word first台灣大學是有名的學府長詞遮蔽短詞：*這名記者會說國語。 • Delete the discontinuous fragments • Other heuristic rules: 2-3 words preference, ... • parser • Statistics-based • Markov models, relaxation method, and so on

Segmentation Strategies • Dictionary Coverage • Dictionary cannot cover “all” the words • solutions • Morphological rules • (semi-)automatic construction of dictionaries: automatic terminology extraction • Unknown word resolution

Morphological Rules • numeral + classifier+classifier • 一個個, 一條條 • date + time • 八十五年十月四日 • noun (or verb) prefix/suffix • 學生們 • special verbs • 丟丟看，吃吃看，寫寫看 • 高高興興，歡歡喜喜，漂漂亮亮，迷迷糊糊 • 打打球，跑跑步，寫寫字 • ...

Term Extraction: n-gram Approach • Compute n-grams from a corpus • Select candidate terms • Successor variety • the successor variety will sharply increase until a segment boundary is reached • Use i-grams and (i+1)-grams to select candidate terms of length i • Mutual Information • Significance Estimation Function

Named Entity Extraction

Named Entities Extraction • Five basic components in a document • People, affairs, time, places, things • Major unknown words • Named Entities in MET2 • Names: people, organizations, locations • Number: monetary/percentage expressions • Time: data/time expressions

Named People Extraction • Chinese person names • Chinese person names are composed of surnames and names. • Most Chinese surnames are single character and some rare ones are two characters. • Most names are two characters and some rare ones are single characters (in Taiwan) • The length of Chinese person names ranges from 2 to 6 characters. • Transliterated person names • Transliterated person names denote foreigners. • The length of transliterated person names is not restricted to 2 to 6 characters.

Named People Extraction:Chinese Person Names • Extraction Strategies • baseline models: name-formulation statistics • Propose possible candidates. • context cues • Add extra scores to the candidates. • When a title appears before (after) a string, it is probably a person name. • Person names usually appear at the head or the tail of a sentence. • Persons may be accompanied with speech-act verbs like "發言", "說", "提出", etc. • cache: occurrences of named people • A candidate appearing more than once has high tendency to be a person name.

Structure of Chinese Personal Names • Chinese surnames have the following three types • Single character like '趙', '錢', '孫', '李' • Two characters like '歐陽' and '上官' • Two surnames together like '蔣宋' • Most names have the following two types • Single character • Two characters

Training Data • Name-formulation statistics is trained from 1-million person name corpus in Taiwan. • Each contains surname, name and sex. • There are 489,305 male names, and 509,110 female names. • Total 598 surnames are retrieved from this 1-M corpus. • The surnames of very low frequency like “是”, “那”, etc., are removed to avoid false alarms. • Only 541 surnames are left, and are used to trigger the person name extraction system.

Training Data • The probability of a Chinese character to be the first character (the second character) of a name is computed for male and female, separately. • We compute the probabilities using training tables for female and male, respectively. • Either male score or female score may be greater than thresholds. • In some cases, female score may be greater than male score. • Thresholds are defined as: 99% of training data should pass the thresholds.

Baseline Models: name-formulation statistics • Model 1. Single character, e.g., ‘趙’, 錢‘, ’孫‘ and ’李’ • P(C1)*P(C2)*P(C3) using the training table for male > Threshold1 and P(C2)*P(C3) using training table for male > Threshold2, or • P(C1)*P(C2)*P(C3) using the training table for female > Threshold3 andP(C2)*P(C3) using the training table for female > Threshold4 • Model 2. Two characters, e.g., ‘歐陽’ and ‘上官’ • P(C2)*P(C3) using training table for male > Threshold2, or • P(C2)*P(C3) using training table for female > Threshold4 • Model 3. Two surnames together like '蔣宋’ • P(C12)*P(C2)*P(C3) using the training table for female > Threshold3,P(C2)*P(C3) using the training table for female > Threshold4 andP(C12)*P(C2)*P(C3) using the training table for female >P(C12)*P(C2)*P(C3) using training table for male

Cues from Character Levels • Gender • A married woman may add her husband's surname before her surname. That forms type 3 person names. • Because a surname may be considered as a name, the candidates with two surnames do not always belong to the type 3 person names. • The gender information helps us disambiguate this type of person names. • Some Chinese characters have high score for male and some for female. The following shows some examples. • Male : 豪、霸、宏、志、斌、彬、強、正、昌、輝、雄 • Female : 佩、月、玉、如、君、秀、佳、怡、芬、芳、女

Cues from Sentence Levels • Titles • When a title appears before (after) a candidate, it is probable a person name. It can help to decide the boundary of a name. • 總統陳水扁 vs. 總統向青年學子 ... • Mutual Information • How to tell if a word is a content word or a name is indispensable. • 陳家世清白，決不會犯法。 • When there exists a strong relationship between surrounding words, the candidate word has a high probability to be a content word. • Punctuation Marks • When a candidate is located at the end of a sentence, we give it an extra score. • If words around the caesura mark, then they have similar types.

Cues from Passage/Document Level: Cache • A person name may appear more than once in a paragraph. • There are four cases when cache is used. • (1) C1C2C3 and C1C2C4 are both in the cache, and C1C2 is correct. • (2) C1C2C3 and C1C2C4 are both in the cache, and both are correct. • (3) C1C2C3 and C1C2 are both in the cache, and C1C2C3 is correct. • (4) C1C2C3 and C1C2 are both in the cache, and C1C2 is correct.

Cache • The problem using cache is case selection. • For every entry in the cache, we assign it a weight. • The entry with clear right boundary has a high weight. • title and punctuation • The other entries are assigned low level weight. • The use of weight in case selection • high vs. high ==> case (2) • high vs. low or low vs. high ==> high is correct • low vs. low • check the score of the last character of the name part • 邱永漢邱永強 • 李鵬常李鵬及

Discussion • Some typical types of errors. • foreign names (e.g., 魏斯特, 艾琳達) • They are identified as proper nouns correctly, but are assigned wrong features. • About 20% of errors belong to this type. • rare surnames (e.g.,應, 伊, 鳳) or artists' stage names. • Near 14% of errors come from this type. • others • Other proper nouns (place names, organization names, etc.) • identification errors

Omitted Name Problem • Some texts usually omit name part and leave surname only. • 陳踢了王一腳 • Strategies • If this candidate appears before in the same paragraph, it is an omitted name. • If this candidate has a special title like “嫌、妻、老、女” or a general title like “立委、教授、...”, then it is an omitted name. • If two single characters have very high probability to be surnames, and they appear around caesura mark, then they are regarded as omitted names.

Transliterated Person Names • Challenging Issues • No special cue like surnames in Chinese person names to trigger the recognition system. • No restriction on the length of a transliterated person name. • No large scale transliterated personal name corpus • Ambiguity in classification. '華盛頓' may denote a city or a former American president.

Strategy (1) • Character Condition • When a foreign name is transliterated, the selection of homophones is restrictive. Richard Macs: 理查馬克斯 vs. 娌茶碼剋鷥 • Basic character set can be trained from a transliterated name corpus. • If all the characters in a string belong to this set, they are regarded as a candidate.

Strategy (2) • Syllable Condition • Some characters which meet the character condition do not look like transliterated names. • Syllable Sequence • Simplified Condition • (1) For each candidate, we check the syllable of the first (the last) character. • (2) If the syllable does not belong to the training corpus, the character is deleted. • (3) The remaining characters are treated in the similar way.

Strategy (3) • Frequency Condition • For each candidate which has only two characters, we compute the frequency of these two characters to see if it is larger than a threshold. • The threshold is determined in the similar way as the baseline model of Chinese person names.

Cues around Names • Cues within Transliterated Names • Character Condition • Syllable Condition • Frequency Condition • Cues around Transliterated Names • titles: the same as Chinese person names • name introducers: "叫", "叫作", "叫做", "名叫", and "尊稱" • special verbs: the same as Chinese person names • first name․middle name ․last name

Discussion • Some transliterated person names may be identified by the Chinese person name extraction system. • 魏斯特愛琳達 • Some nouns may look like transliterated person names. • popular brands of automobiles, e.g., '飛雅特' and '雪佛蘭' • Chinese proper nouns, e.g., '利多', '連拉' and '華隆' • Chinese person names, e.g., '朱士列' • Besides the above nouns, the boundary errors affect the precision too. • (拉)瑞強森

Named Organization Extraction • A complete organization name can be divided into two parts: name and keyword. • Example: 台北市政府 • Many words can serve as names, but only some fixed words can serve as keywords. • Challenging Issues • (1) a keyword is usually a common content word. • (2) a keyword may appear in the abbreviated form. • (3) the keyword may be omitted completely.

Classification of Organization Names • Complete organization names • This type of organization names is usually composed of proper nouns and keywords. • Some organization names are very long, thus (left) boundary determination is difficult. • Some organization names with keywords are still ambiguous. • '聯合報' usually denotes reading matters, but not organizations. • Incomplete organization names • These organization names often omit their keywords. • The abbreviated organization names may be ambiguous. • '兄弟' and '公牛' are famous sport teams in Taiwan and in USA, respectively, however, they are also common content words.

Strategies • Keywords • A keyword shows not only the possibility of an occurrence of an organization name, but also its right boundary. • Prefix • Prefix is a good marker for possible left boundary. • Single-character words • If the character preceding a possible keyword is a single-character word, then the content word is not a keyword. • If the characters preceding a possible keyword cannot exist independently, they form a name part of an organization. • Words of at least two characters • The words to compose a name part usually have strong relationships.

Strategies • Parts of speech • The name part of an organization cannot extend beyond a transitive verb. • Numeral and classifier are also helpful. • Cache • problem: when should a pattern be put into cache? • Character set is incomplete. • n-gram model • It must consist of a name and an organization name keyword. • Its length must be greater than 2 words. • It does not cross any punctuation marks. • It must occur more than a threshold.

Handcrafted Rules • OrganizationName  OrganizationName OrganizationNameKeyworde.g., 聯合國部隊 • OrganizationName  CountryName OrganizationNameKeyworde.g., 美國大使館 • OrganizationName  PersonName OrganizationNameKeyworde.g., 羅慧夫基金會 • OrganizationName  CountryName OrganizationNamee.g., 美國國防部 • OrganizationName  LocationName OrgnizationNamee.g., 伊利諾州州府 • OrganizationName  CountryName {D|DD} OrganizationNameKeyworde.g., 中國國際廣播電台 • OrganizationName  PersonName {D|D} OrganizationNameKeyworde.g., 羅慧夫文教基金會 • OrganizationName  LocationName {D|D} OrganizationNameKeyworde.g., 台北國際廣播電台

Discussion • Most errors result from organization names without keywords. • 金匯通復華大公投顧 • 兄弟太陽烈火 • Identification errors • Even if keywords appear, organization names do not always exist. • 上市公司各國大學 • Error left boundary is also a problem. • 不為國安局 (基督)長老教會 • Ambiguities • 聯合報天下雜誌

Application of Gender Assignment • Anaphora resolution "問華德教授，他說那是正常的師生戀，既然雙方都是獨身男女，總不會不准談戀愛吧。至於後來趙靜雯去了那裡，為甚麼失蹤，他一概不知，並輕描淡寫的說：「也許加拿大不適合她，跑回臺灣去了。」" • Gender of a person name is useful for this problem. • The correct rate for gender assignment is 89%. • Co-Reference resolution

Named Location Extraction • A location name is composed of name and keyword parts. • Rules • LocationName PersonName LocationNameKeyword • LocationName  LocationName LocationNameKeyword • Locative verbs like '來自', '前往', and so on, are introduced to treat location names without keywords. • Cache and n-gram models are also employed to extract location names.

Chapter 13 Chinese Information Extraction Technologies