320 likes | 441 Views
Causality Knowledge Extraction based on A Single Sentence from Thai Textual Data. Chaveevan Pechsiri Dhurakij Pundij University Assoc. Prof. Dr. Asanee Kawtrakul NAiST Laboratory, Kasetsart University SNLP 2007 14 December, 2007. Outline. Motivation Introduction Related work
E N D
Causality Knowledge Extraction based on A Single Sentence from Thai Textual Data Chaveevan Pechsiri Dhurakij Pundij University Assoc. Prof. Dr. Asanee Kawtrakul NAiST Laboratory, Kasetsart University SNLP 2007 14 December, 2007
Outline • Motivation • Introduction • Related work • Crucial Problems • System Overview • Evaluation • Conclusion Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Motivation • Most of Knowledge is spread throughout the text. • Instead of reading huge amount of report, we need the automatic system of Knowledge Extraction from text to gain the causality knowledge for diagnosis problems , decision support or question answering systems. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Introduction • What is knowledge? • “Knowledge is the awareness and understanding of facts, truths or information gained in the form of experience or learning. (Wikipedia encyclopedia, 2006) • “The information, understanding, and skills that you gain through education or experience” (Oxford advanced learner’s Dictionary, 2000) • Knowledge types (Jana Trnková, Wolfgang Theilmann,2004) • Orientation knowledge (“know what a topic is about”) • Action knowledge (“know how”) • Explanation knowledge (“know why something is the way it is”) • Reference knowledge (“know where to find additional information”). Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Introduction • What is causality? • refers to the set of all particular "causal" or "cause-and-effect" relations (Wikipedia Encyclopedia :http://en.wikipedia.org/wiki/Main_Page ) • The relationship between something that happens and the reason for it happening (Oxford advanced learner’s Dictionary, 2000) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Causality Knowledge • Inter-causal EDU (20%) • If aphids suck sap from plant, leaves will be yellow and flowers start to drop out. • Plant leaves shrink because the aphids destroy the plant. • Intra-causal EDU (7%) • Earthquake generates Zunami. (NP1 V NP2) • Bird Flu is caused by virus ‘H5N1’.(NP1 cue NP2) • Leaves have black spots from bacteria. (NP1 V NP2 Prep NP3) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Related Work Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Causal Verb (linking verb) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Crucial Problems • How to identify causality with in one sentence • Implicit noun phrase : as zero anaphora Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
How to identify causality • By using causal verb (linking verb) • List of causal verbs from Girju, 2002(1%) ราในถั่วผลิตอัลฟาทอกซิน/Fungus in peanut produces alpha toxin. • Cue phase set (Chang and Choi, 2004)(2%) ไรรัสH5N1เป็นสาเหตุให้เกิดโรคไข้หวัดนก Bird Flu is caused by virus ‘H5N1’. • General verb+information+preposition phrase • Verb + preposition phrase 4% Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Causal Verb • General verb + information +preposition phrase • General verb = {เป็น/be, มี/have, ได้รับ/get, …} • Information= {แผล/scar, จุด/spot, รอย/mark, ขีด/scratch, ตำหนิ/defect, โรค/disease…..} • Preposition ={from, with} “NP1 Verb [NP2] Prep NP3” For example: เป็น/be+ โรค/disease = get disease A kid get disease from virus ‘H5N1’. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
How to identify causality “NP1 Verb [NP2] Prep NP3” Ex. 1.“พืช/Plant เป็น isโรค/disease จาก / from เชื้อรา /fungi” 2. “โรค/ Disease เกิด/ occurs จาก/ from ไวรัส/ virus” 3. “เด็ก/ Kid ตาย/ dies ด้วย/ with โรคไข้หวัดนก/ the Bird flu disease” 4. “เด็ก/ Kid ได้รับ/ gets เชื้อ/ disease จาก/ from การสัมผัสไก่ติดเชื้อไข้หวัดนก/ touching the infected chicken” Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Problems of using causal verb • Verb ambiguity • Causality: • “ใบพืช/Plant leaf มี/hasจุดสีน้ำตาล /brown sports จาก/fromเชื้อรา/fungi” • “คนไข้/The patientตาย/diesด้วย/withโรคมะเร็ง/cancer” • Non causality: • “ใบพืช/Plant leaf มี/hasจุดสีน้ำตาล /brown sports จาก/fromโคนใบ/the leaf base” • “คนไข้/patientตาย/diesด้วย/withความสงสัย/suspicion” Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Zero Anaphora Problem • For example • “โรคไข้หวัดนก /The Bird flu diseaseเป็น /is โรคที่สำคัญโรคหนื่ง /an important disease . เกิด /occurจาก / from ไวรัส H5N1/ H5N1 virus. ” where is zero anaphora = Bird flu disease. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
System Overview Corpus Preparation WordNet, Lexitron, Plant encyclopedia Text Causality learning Learnt model Causality extraction Knowledge base Cause-effect relation Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Corpus Preparation • Word segmentation (Sudprasert and Kawtrakul, 2003 ) • Name entity determination(Chanlekha and Kawtrakul, 2004 ) • EDU segmentation(Charoensuk and et al.,2005) • EDU (Elementary Discause Unit) is the minimal building blocks of a discourse tree. Mann and Thompson (1988, p. 244) ;simple sentence, clause Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Corpus Preparation • Mamually feature annotation (reference to WordNet and Plant encyclopedia, and Lexitron dictionary) for learning <EDU type=causality> <NP1 concept=‘plant organ#1’>ใบพืช</NP1> <Verb =‘have’> มี</Verb> <NP2 concept=‘symptom#1’> จุดสีน้ำตาล</NP2> <Preposition =‘from’>จาก</Preposition> <NP3 concept=‘fungus#1’> เชื้อรา</NP3> </EDU> Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Causality Learning • ID3 (Mitchell T.M., 1997) • SVM(Cristianini and Shawe-Taylor, 2000) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
ID3 ID3 uses the statistical property called information gain as shown in the following with the entropy measurement to measure the ability of a given attribute (A; e.g. NP1, Verb, NP2, Preposition, NP3) in separating the collected examples (S) according to their target classification. where the entropy is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S (Charniak E., 1993), c is the different values of the target attribute, and pi is the proportion of S belonging to class i. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
ID3 NP3 pathogen food poisoning contraction prep prep prep from with from verb verb verb be have infect Causality Causality Causality Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
ID3 -Rule mining by using Wekatool Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
ID3 • Rule Generalization &Verifying • There are some rules having the same general concept which can be combined into one rule as in the following example: R1: IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep= จาก/from>^ <NP3= fungi > then causality R2: IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep= จาก/from>^ <NP3= bacteria> then causality R3: IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep= จาก/from>^ <NP3=pathogen > then causality Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
ID3 • Verifying rules • The testing corpus from agricultural and health news domains of 2000 EDUs contain 102 EDUs of the specified sentence pattern, which only 87 EDUs are causality within 20 causal verb rules. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
SVM The following linear function, f(x), of the input x = (x1…xn) assigned to the positive class if f(x) 0, and otherwise to the negative class if f(x) <0 (where xi is each of five features as NP1, Verb, NP2, Preposition, and NP3 of the specified sentence pattern from the annotated corpus ) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Causality Extraction • Causality identification • Use causal verb rules from ID3 • Use weight vectors with the bias from SVM • Solving zero anaphora • Using the heuristic rule (Ching-Long Yeh and Chris Mellish, 1997) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Evaluation • 2000 EDUs from the agricultural and health news for training. And 2000 EDUs for testing base on precision and recall for training • The result is then evaluated by experts with max win voting. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Evaluation Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Discussion • The reason that the precision of the extraction through using SVM is higher than ID3 is that ID3 is based on feature occurrences which will not effect to SVM • the 73% of recall can be increased if we use a larger corpus Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Conclusion our model will be very beneficial for causal question answering and causal generalization for knowledge discovery. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Future work • Knowledge generalization Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Thank you Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University