1.2k likes | 1.42k Views
自然語言處理實驗室. 資訊工程學. 研究所. 臺灣大學. Natural Language Processing Lab. National Taiwan University. Chapter 13 Chinese Information Extraction Technologies. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan
E N D
自然語言處理實驗室 資訊工程學 研究所 臺灣大學 Natural Language Processing Lab. National Taiwan University Chapter 13Chinese Information Extraction Technologies Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw
Outline • Introduction to Information Extraction (IE) • Chinese IE Technologies • Tagging Environment for Chinese IE • Applications • Summary
Introduction • Information Extraction • the extraction or pulling out of pertinent information from large volumes of texts • Information Extraction System • an automated system to extract pertinent information from large volumes of text • Information Extraction Technologies • techniques used to automatically extract specified information from text (http://www.itl.nist.gov/iaui/894.02/related_projects/muc/)
An Example in Air Vehicle Launch • Original Document • Named-Entity-Tagged Document • Equivalence Classes • Co-Reference Tagged Document
<DOC> <DOCID> NTU-AIR_LAUNCH-中國時報-19970612-002 </DOCID> <DATASET> Air Vehicle Launch </DATASET> <DD> 1997/06/12 </DD> <DOCTYPE> 報紙報導 </DOCTYPE> <DOCSRC> 中國時報 </DOCSRC> <TEXT> 【本報綜合紐約、華盛頓十一日外電報導】在華盛頓宣布首度 出售「刺針」肩射防空飛彈 給南韓的第二天,美國與北韓今天 在紐約恢復延擱已久的會談,這項預定三天的會談將以北韓的 飛彈發展為重點,包括北韓準備部署射程 可涵蓋幾乎日本全境 的「蘆洞」一號長程飛彈 的報導。 美國國務院發言人柏恩斯說:「在有關北韓 飛彈擴散問題上 ,美方的確有多項關切之處。」美國官員也長期懷疑北韓正對 伊朗和敘利亞輸出飛彈,並希望平壤加入禁止擴散此種武器的 red: location name blue: date expression green: organization name purple: person name
國際公約。美國官員已知會北韓說,倘若北韓希望與美國建立國際公約。美國官員已知會北韓說,倘若北韓希望與美國建立 正常的外交關係,就必須減少飛彈輸出。 這項有關北韓飛彈計劃的會談是雙方於一九九六年四月在德國 柏林舉行的首度會談的後續談判。美國在該次會談中要求北韓停 止生產、測試及出售飛彈給他國,尤其是敘利亞和伊朗兩國。 美國副助理國務卿艾恩宏和北韓外交部對外事務局局長李衡哲 分別為雙方的談判代表,會談預定在十三日結束。 柏恩斯說:「美方非常關心所有北韓本身,或是北韓與中共、 伊朗或其他國家的飛彈問題。我們認為就此與他們舉行會談是甚 為重要。」 而為提昇南韓陸軍的自衛能力,美國於昨天宣布準備出售價值 三億零七百萬美元的一千零六十五枚刺針飛彈與其他武器給南韓, 它說,這項交易不會使朝鮮半島的緊張局勢惡化。
五角大廈說:「這項設備與支援的銷售不會影響該區基本軍事五角大廈說:「這項設備與支援的銷售不會影響該區基本軍事 均勢。」 國務院也表示全力支持此項包含兩百一十三座發射台、支援 設備、零件與訓練的交易。 柏恩斯說:「這項交易獲得政府內每一個人的全力支持,它 符合我們在朝鮮半島的政策。」他強調:「我們的第一優先是 防衛南韓。」 如果國會同意,這將是華府對南韓出售防空 飛彈的第一筆交易。 </TEXT> </DOC>
<DOC> <DOCID> NTU-AIR_LAUNCH-中國時報-19970612-002 </DOCID> <DATASET> Air Vehicle Launch </DATASET> <DD> 1997/06/12 </DD> <DOCTYPE> 報紙報導 </DOCTYPE> <DOCSRC> 中國時報 </DOCSRC> <ISRELEVANT> NO </ISRELEVANT> <TITLE> <ENAMEX TYPE="LOCATION">美</ENAMEX>擬售<ENAMEX TYPE="LOCATION">南韓</ENAMEX>1065枚刺針飛彈 </TITLE> <TEXT> 【本報綜合<ENAMEX TYPE="LOCATION">紐約</ENAMEX>、<ENAMEX TYPE="LOCATION">華盛頓</ENAMEX><TIMEX TYPE="DATE">十一日</TIMEX>外電報導】在<ENAMEX TYPE="LOCATION">華盛頓</ENAMEX>宣布首度出售「刺針」肩射防空飛彈 給<ENAMEX TYPE="LOCATION">南韓</ENAMEX>的<TIMEX TYPE="DATE">第二天</TIMEX>,<ENAMEX TYPE="LOCATION">美國</ENAMEX>與<ENAMEX TYPE="LOCATION">北韓</ENAMEX><TIMEX TYPE="DATE">今天</TIMEX>在<ENAMEX TYPE="LOCATION">紐約</ENAMEX>恢復延擱已久的會談,這項預定三天的會談將以<ENAMEX TYPE="LOCATION">北韓</ENAMEX>的飛彈發展為重點,包括<ENAMEX TYPE="LOCATION">北韓</ENAMEX>準備部署射程 可涵蓋幾乎<ENAMEX TYPE="LOCATION">日本</ENAMEX>全境的「蘆洞」一號長程飛彈 的報導。
<ID="3">十一日 <ID="4" REF="3" >今天 <ID="5“ REF="3">出售「刺針」肩射防空飛彈 給南韓的第二天 <ID="63" >延擱已久的會談 <ID=“66” REF=“63”>一九九六年四月在德國柏林舉行的首度會談 的後續談判 <ID="65" REF="63">這項有關北韓飛彈計劃的會談 <ID="70" REF="65">會談 <ID="69" REF="65">會談 <ID="64" REF="63">這項預定三天的會談
<DOC> <DOCID> NTU-AIR_LAUNCH-中國時報-19970612-002 </DOCID> <DATASET> Air Vehicle Launch </DATASET> <DD> 1997/06/12 </DD> <DOCTYPE> 報紙報導 </DOCTYPE> <DOCSRC> <COREF ID="1">中國時報</COREF> </DOCSRC> <ISRELEVANT> NO </ISRELEVANT> <TITLE> <COREF ID="6">美</COREF>擬售<COREF ID="23">南韓</COREF><COREF ID="45" REF="44" TYPE="IDENT" MIN="刺針飛彈">1065枚刺針飛彈</COREF> </TITLE> <TEXT> 【<COREF ID="2" REF="1" TYPE="IDENT">本報</COREF>綜合<COREF ID="61">紐約</COREF>、<COREF ID="8" STATUS="OPT" REF="6" TYPE="IDENT">華盛頓</COREF><COREF ID="3">十一日</COREF>外電報導】在<COREF ID="7" REF="6" TYPE="IDENT">華盛頓</COREF>宣布首度<COREF ID="5" STATUS="OPT" REF="3" TYPE="IDENT" MIN="第二天">出售「刺針」肩射防空飛彈 給<COREF ID="24" REF="23" TYPE="IDENT">南韓</COREF>的第二天</COREF>,<COREF ID="77"><COREF ID="9" REF="6" TYPE="IDENT">美國</COREF>與<COREF ID="29">北韓</COREF></COREF><COREF ID="4" REF="3" TYPE="IDENT">今天</COREF>在<COREF ID="62" REF="61" TYPE="IDENT">紐約</COREF>恢復<COREF ID="63" MIN="會談">延擱已久的會談</COREF>,<COREF ID="64" REF="63" TYPE="IDENT" MIN="會談">這項預定三天的會談</COREF>將以<COREF ID="81" STATUS="OPT" REF="75" TYPE="IDENT" MIN="飛彈"><COREF ID="30" REF="29" TYPE="IDENT">北韓</COREF>的飛彈</COREF>發展為重點,包括<COREF ID="31" REF="29" TYPE="IDENT">北韓</COREF>準備部署射程 可涵蓋幾乎日本全境的「蘆洞」一號長程飛彈 的報導。
IE Evaluation in MUC-7 (1998) • Named Entity Task [NE]: Insert SGML tags into the text to mark each string that represents a person, organization, or location name, or a date or time stamp, or a currency or percentage figure • Multi-lingual Entity Task [MET]: NE task for Chinese and Japanese • Co-reference Task [CO]: Capture information on co-referring expressions: all mentions of a given entity, including those tagged in NE, TE tasks
IE Evaluation in MUC-7 (cont.) • Template Element Task [TE]: Extract basic information related to organization, person, and artifact entities, drawing evidence from anywhere in the text • Template Relation Task [TR]: Extract relational information on employee_of, manufacture_of, and location_of relations • Scenario Template Task [ST]: Extract pre-specified event information and relate the event information to particular organization, person, or artifact entities involved in the event.
Chinese IE Technologies • Segmentation • Named Entity Extraction • Part of Speech/Sense Tagging • Full/Partial Parsing • Co-Reference Resolution
Segmentation • Problem • A Chinese sentence is composed of characters without word boundary • 這名記者會說國語。 • 這 名 記者 會 說 國語。 • 這 名 記者會 說 國語。 • Word Definition • A character string with an independent meaning and a specific syntactic function
Segmentation • Standard • China【信息處理用現代漢語分詞規範】 • Implemented in 1988 • National standard in 1992 (GB/T13715-92) • Taiwan【資訊處理用中文分詞標準草案】 • Proposed by ROCLING in 1996 • National standard in 1999 (CNS14366)
把 他 的 確 實 行 動 作 了 分 析 電 子 計 算 機 是 會 計 算 題 目 的 機 器 Segmentation Strategies • Dictionary is an important resource • List “all” possible words • Find the most “plausible” path from a word lattice • 把他的確實行動作了分析 • 電子計算機是會計算題目的機器
Segmentation Strategies (Continued) • Disambiguation: Select the best combination • Rule-based • Longest-word first台灣大學 是 有名 的 學府長詞遮蔽短詞:*這 名 記者會 說 國語。 • Delete the discontinuous fragments • Other heuristic rules: 2-3 words preference, ... • parser • Statistics-based • Markov models, relaxation method, and so on
Segmentation Strategies • Dictionary Coverage • Dictionary cannot cover “all” the words • solutions • Morphological rules • (semi-)automatic construction of dictionaries: automatic terminology extraction • Unknown word resolution
Morphological Rules • numeral + classifier+classifier • 一個個, 一條條 • date + time • 八十五年十月四日 • noun (or verb) prefix/suffix • 學生們 • special verbs • 丟丟 看,吃吃 看,寫寫 看 • 高高興興,歡歡喜喜,漂漂亮亮,迷迷糊糊 • 打打球,跑跑步,寫寫字 • ...
Term Extraction: n-gram Approach • Compute n-grams from a corpus • Select candidate terms • Successor variety • the successor variety will sharply increase until a segment boundary is reached • Use i-grams and (i+1)-grams to select candidate terms of length i • Mutual Information • Significance Estimation Function
Named Entities Extraction • Five basic components in a document • People, affairs, time, places, things • Major unknown words • Named Entities in MET2 • Names: people, organizations, locations • Number: monetary/percentage expressions • Time: data/time expressions
Named People Extraction • Chinese person names • Chinese person names are composed of surnames and names. • Most Chinese surnames are single character and some rare ones are two characters. • Most names are two characters and some rare ones are single characters (in Taiwan) • The length of Chinese person names ranges from 2 to 6 characters. • Transliterated person names • Transliterated person names denote foreigners. • The length of transliterated person names is not restricted to 2 to 6 characters.
Named People Extraction:Chinese Person Names • Extraction Strategies • baseline models: name-formulation statistics • Propose possible candidates. • context cues • Add extra scores to the candidates. • When a title appears before (after) a string, it is probably a person name. • Person names usually appear at the head or the tail of a sentence. • Persons may be accompanied with speech-act verbs like "發言", "說", "提出", etc. • cache: occurrences of named people • A candidate appearing more than once has high tendency to be a person name.
Structure of Chinese Personal Names • Chinese surnames have the following three types • Single character like '趙', '錢', '孫', '李' • Two characters like '歐陽' and '上官' • Two surnames together like '蔣宋' • Most names have the following two types • Single character • Two characters
Training Data • Name-formulation statistics is trained from 1-million person name corpus in Taiwan. • Each contains surname, name and sex. • There are 489,305 male names, and 509,110 female names. • Total 598 surnames are retrieved from this 1-M corpus. • The surnames of very low frequency like “是”, “那”, etc., are removed to avoid false alarms. • Only 541 surnames are left, and are used to trigger the person name extraction system.
Training Data • The probability of a Chinese character to be the first character (the second character) of a name is computed for male and female, separately. • We compute the probabilities using training tables for female and male, respectively. • Either male score or female score may be greater than thresholds. • In some cases, female score may be greater than male score. • Thresholds are defined as: 99% of training data should pass the thresholds.
Baseline Models: name-formulation statistics • Model 1. Single character, e.g., ‘趙’, 錢‘, ’孫‘ and ’李’ • P(C1)*P(C2)*P(C3) using the training table for male > Threshold1 and P(C2)*P(C3) using training table for male > Threshold2, or • P(C1)*P(C2)*P(C3) using the training table for female > Threshold3 andP(C2)*P(C3) using the training table for female > Threshold4 • Model 2. Two characters, e.g., ‘歐陽’ and ‘上官’ • P(C2)*P(C3) using training table for male > Threshold2, or • P(C2)*P(C3) using training table for female > Threshold4 • Model 3. Two surnames together like '蔣宋’ • P(C12)*P(C2)*P(C3) using the training table for female > Threshold3,P(C2)*P(C3) using the training table for female > Threshold4 andP(C12)*P(C2)*P(C3) using the training table for female >P(C12)*P(C2)*P(C3) using training table for male
Cues from Character Levels • Gender • A married woman may add her husband's surname before her surname. That forms type 3 person names. • Because a surname may be considered as a name, the candidates with two surnames do not always belong to the type 3 person names. • The gender information helps us disambiguate this type of person names. • Some Chinese characters have high score for male and some for female. The following shows some examples. • Male : 豪、霸、宏、志、斌、彬、強、正、昌、輝、雄 • Female : 佩、月、玉、如、君、秀、佳、怡、芬、芳、女
Cues from Sentence Levels • Titles • When a title appears before (after) a candidate, it is probable a person name. It can help to decide the boundary of a name. • 總統陳水扁 vs. 總統向青年學子 ... • Mutual Information • How to tell if a word is a content word or a name is indispensable. • 陳家世清白,決不會犯法。 • When there exists a strong relationship between surrounding words, the candidate word has a high probability to be a content word. • Punctuation Marks • When a candidate is located at the end of a sentence, we give it an extra score. • If words around the caesura mark, then they have similar types.
Cues from Passage/Document Level: Cache • A person name may appear more than once in a paragraph. • There are four cases when cache is used. • (1) C1C2C3 and C1C2C4 are both in the cache, and C1C2 is correct. • (2) C1C2C3 and C1C2C4 are both in the cache, and both are correct. • (3) C1C2C3 and C1C2 are both in the cache, and C1C2C3 is correct. • (4) C1C2C3 and C1C2 are both in the cache, and C1C2 is correct.
Cache • The problem using cache is case selection. • For every entry in the cache, we assign it a weight. • The entry with clear right boundary has a high weight. • title and punctuation • The other entries are assigned low level weight. • The use of weight in case selection • high vs. high ==> case (2) • high vs. low or low vs. high ==> high is correct • low vs. low • check the score of the last character of the name part • 邱永漢 邱永強 • 李鵬常 李鵬及
Discussion • Some typical types of errors. • foreign names (e.g., 魏斯特, 艾琳達) • They are identified as proper nouns correctly, but are assigned wrong features. • About 20% of errors belong to this type. • rare surnames (e.g.,應, 伊, 鳳) or artists' stage names. • Near 14% of errors come from this type. • others • Other proper nouns (place names, organization names, etc.) • identification errors
Omitted Name Problem • Some texts usually omit name part and leave surname only. • 陳踢了王一腳 • Strategies • If this candidate appears before in the same paragraph, it is an omitted name. • If this candidate has a special title like “嫌、妻、老、女” or a general title like “立委、教授、...”, then it is an omitted name. • If two single characters have very high probability to be surnames, and they appear around caesura mark, then they are regarded as omitted names.
Transliterated Person Names • Challenging Issues • No special cue like surnames in Chinese person names to trigger the recognition system. • No restriction on the length of a transliterated person name. • No large scale transliterated personal name corpus • Ambiguity in classification. '華盛頓' may denote a city or a former American president.
Strategy (1) • Character Condition • When a foreign name is transliterated, the selection of homophones is restrictive. Richard Macs: 理查馬克斯 vs. 娌茶碼剋鷥 • Basic character set can be trained from a transliterated name corpus. • If all the characters in a string belong to this set, they are regarded as a candidate.
Strategy (2) • Syllable Condition • Some characters which meet the character condition do not look like transliterated names. • Syllable Sequence • Simplified Condition • (1) For each candidate, we check the syllable of the first (the last) character. • (2) If the syllable does not belong to the training corpus, the character is deleted. • (3) The remaining characters are treated in the similar way.
Strategy (3) • Frequency Condition • For each candidate which has only two characters, we compute the frequency of these two characters to see if it is larger than a threshold. • The threshold is determined in the similar way as the baseline model of Chinese person names.
Cues around Names • Cues within Transliterated Names • Character Condition • Syllable Condition • Frequency Condition • Cues around Transliterated Names • titles: the same as Chinese person names • name introducers: "叫", "叫作", "叫做", "名叫", and "尊稱" • special verbs: the same as Chinese person names • first name․middle name ․last name
Discussion • Some transliterated person names may be identified by the Chinese person name extraction system. • 魏斯特 愛琳達 • Some nouns may look like transliterated person names. • popular brands of automobiles, e.g., '飛雅特' and '雪佛蘭' • Chinese proper nouns, e.g., '利多', '連拉' and '華隆' • Chinese person names, e.g., '朱士列' • Besides the above nouns, the boundary errors affect the precision too. • (拉)瑞強森
Named Organization Extraction • A complete organization name can be divided into two parts: name and keyword. • Example: 台北市政府 • Many words can serve as names, but only some fixed words can serve as keywords. • Challenging Issues • (1) a keyword is usually a common content word. • (2) a keyword may appear in the abbreviated form. • (3) the keyword may be omitted completely.
Classification of Organization Names • Complete organization names • This type of organization names is usually composed of proper nouns and keywords. • Some organization names are very long, thus (left) boundary determination is difficult. • Some organization names with keywords are still ambiguous. • '聯合報' usually denotes reading matters, but not organizations. • Incomplete organization names • These organization names often omit their keywords. • The abbreviated organization names may be ambiguous. • '兄弟' and '公牛' are famous sport teams in Taiwan and in USA, respectively, however, they are also common content words.
Strategies • Keywords • A keyword shows not only the possibility of an occurrence of an organization name, but also its right boundary. • Prefix • Prefix is a good marker for possible left boundary. • Single-character words • If the character preceding a possible keyword is a single-character word, then the content word is not a keyword. • If the characters preceding a possible keyword cannot exist independently, they form a name part of an organization. • Words of at least two characters • The words to compose a name part usually have strong relationships.
Strategies • Parts of speech • The name part of an organization cannot extend beyond a transitive verb. • Numeral and classifier are also helpful. • Cache • problem: when should a pattern be put into cache? • Character set is incomplete. • n-gram model • It must consist of a name and an organization name keyword. • Its length must be greater than 2 words. • It does not cross any punctuation marks. • It must occur more than a threshold.
Handcrafted Rules • OrganizationName OrganizationName OrganizationNameKeyworde.g., 聯合國 部隊 • OrganizationName CountryName OrganizationNameKeyworde.g., 美國 大使館 • OrganizationName PersonName OrganizationNameKeyworde.g., 羅慧夫 基金會 • OrganizationName CountryName OrganizationNamee.g., 美國 國防部 • OrganizationName LocationName OrgnizationNamee.g., 伊利諾州 州府 • OrganizationName CountryName {D|DD} OrganizationNameKeyworde.g., 中國 國際 廣播電台 • OrganizationName PersonName {D|D} OrganizationNameKeyworde.g., 羅慧夫 文教 基金會 • OrganizationName LocationName {D|D} OrganizationNameKeyworde.g., 台北 國際 廣播電台
Discussion • Most errors result from organization names without keywords. • 金匯通 復華 大公投顧 • 兄弟 太陽 烈火 • Identification errors • Even if keywords appear, organization names do not always exist. • 上市公司 各國大學 • Error left boundary is also a problem. • 不為國安局 (基督)長老教會 • Ambiguities • 聯合報 天下雜誌
Application of Gender Assignment • Anaphora resolution "問華德教授,他說那是正常的師生戀,既然雙方都是獨身男女,總不會不准談戀愛吧。至於後來趙靜雯去了那裡,為甚麼失蹤,他一概不知,並輕描淡寫的說:「也許加拿大不適合她,跑回臺灣去了。」" • Gender of a person name is useful for this problem. • The correct rate for gender assignment is 89%. • Co-Reference resolution
Named Location Extraction • A location name is composed of name and keyword parts. • Rules • LocationName PersonName LocationNameKeyword • LocationName LocationName LocationNameKeyword • Locative verbs like '來自', '前往', and so on, are introduced to treat location names without keywords. • Cache and n-gram models are also employed to extract location names.