340 likes | 494 Views
A Survey on Information Extraction from Documents Using Structures of Sentences. Chikayama Taura Lab. M1 Mitsuharu Kurita. Introduction. Current search systems are based on 2 assumptions Users send words, not sentences The aim is finding documents which is related to the query words
E N D
A Survey onInformation Extractionfrom DocumentsUsing Structures of Sentences ChikayamaTaura Lab. M1 Mitsuharu Kurita
Introduction • Current search systems are based on 2 assumptions • Users send words, not sentences • The aim is finding documents which is related to the query words • Weare unconsciously get to select words which will appear nearby the target information • In some cases this clue doesn’t work well
Introduction • For more convenient access to the information • Analysis of the detail of question • To know the target information • Analysis of the information in retrieved documents • To find the requested information Information Extraction
Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion
Information Extraction • What is Information Extraction? • A kind of task in natural language processing • Addresses extraction of information from texts • Not to retrieve the documents • Originated with an international conference named MUC • Message Understanding Conference (MUC) • Competition of IE among research groups • Set information extraction tasks every year between 1987-1997
MUC Competition • An example of MUC task • MUC-3 terrorism domain Input: news articles (someof them include terrorism event) Output: the instances involved in each incident
MUC Competition • Pattern matching or linguistic analysis • At that time (1987-1997), there were many difficulties to use advanced natural language processing • Therefore, most of competitors adopted pattern matching tofind instances
Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion
Example of Pattern Matching • CIRCUS [92 Lehnert et al.] • Each pattern consists of “trigger word” and “linguistic pattern” “The mayor was kidnapped by terrorists.” Pattern: kidnap-passive Trigger: “kidnap” Linguistic pattern: “<subject> passive-verb” Variable: “target” “kidnap” activates the pattern “was kidnapped” is a passive verb phrase The subject “mayor” is the target
Problems of Pattern Matching • It takes a huge amount of time to create patterns • In many cases, they were handwritten • It depends a lot on the target domain • It is difficult to adapt to the new task Automatic construction of patterns
The Earliest Automatic Pattern Generation • AutoSlog [93 Riloff et al.] • Creates the patterns for CIRCUS automatically • Training data: articles tagged the target word • Created 1237 patterns from 1500 tagged texts • Only 450 of them were judged to be valid by human Pattern: kidnap-passive Trigger: “kidnap” Linguistic pattern: “<subject> passive-verb” Variable: “target” “The mayor was kidnapped by terrorists.”
Recently it has become possible to use deeper linguistic analysis • Some studies are addressing new IE tasks using these linguistic resources and machine learning approach
Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion
Sentence Structures • Dependency Structure • Describes modification relations between words • One sentence makes up a tree structure • Predicate-Argument structure • Describes the semantic relations between predicate and argument • One sentence makes up a graph structure
Difficulties to Use Structured Data • Most of the machine learning algorithms deal with the data as feature vectors • It is difficult to express structured data (e.g. trees, graphs) as vectors • The ways to use sentence structures for IE • Frequent substructures • Shortest paths between 2 words • Applying the kernel method for structured data
Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion
IE withSubgraph of Sentence Structures • On-Demand Information Extraction[06 Sekine et al.] • Create extraction patterns on-demand and extract information with it query Relevant articles Dependency trees Article database Dependency analyzer Subtree patterns Frequent Subtree Mining Table of Information
Experimental Results • Generated patterns • Found patterns for a query“merger and acquisition” (M&A) • Extracted Information • For the query “acquire, acquisition, merger, buy, purchase” <COM1> <agree to buy> <COM2> <for MNY> <COM1> <will acquire> <COM2> <for MNY> <a MNY merger> <of COM1> <and COM2>
Experimental Results • Very quick construction of patterns • In MUC, it is allowed to take one month • ODIE takes only a few minutes to return the result • No training corpus is needed • ODIE learns extraction patterns from the data • Information about reprising event can be extracted well • Merger and acquisition • Nobel prize winners
Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion
IE with Shortest Path between Words • Extraction of interacting protein pair [06 Yakushiji et al.] • Extractthe interacting protein pairs from biomedical articles • Focus on the shortest path between 2 protein names on predicate-argument structure • Discriminate with Support Vector Machine (SVM) entity1 Entity1 is interacted with a hydrophilic loop region of Entity2. hydrophilic interact loop be with region of a entity2
Pattern Generation • Variation of Patterns • The extracted patterns are not enough • Divide the patterns and combine them into new patterns X protein interact with region of Y Main Prep Entity Entity ………
Pattern Generation • Validation of patterns • Some of these patterns are inappropriate • Each patterns are scored by its adequacy to the learning data • Feature vector TP: True Positive FP: False Positive
Support Vector Machine (SVM) • 2 class linear classifier • Divide the data space with hyperplane • Margin maximization Margin maximization
Experimental Results • Learning • AImed corpus • 225 abstracts of biomedical papers • Annotated with protein names and interactions • Extraction • MEDLINE • 14 million titles and 8 million abstracts • Extracted data • 7775 protein pairs • 64.0% precision • 83.8% recall
Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion
IE with The Kernel Method on Sentence Structures Kernel function • Kernel Method • e.g. SVM • Data are used only in the form of dot products • If you can calculate the dot product directly, you do not have to calculate the vector • Furthermore, you can use other functions as long as they meet some conditions Raw data classifier vector space
Relation Extraction • Relation Extraction with Tree Kernel [04 Culotta et al.] • Classify the relation between 2 entities • 5 entity types (person, organization, geo-political-entity, location, facility) • 5 major types of relations (at, near, part, role, social) • Classify the smallest subtree of dependency tree which includes the entities
Tree Kernel • Represents the similarity between 2 tree-shaped data • Calculated as the sum of similarity of nodes Start Enqueue root node pair Enqueue the child node pairs No Is the queue empty? Dequeue a node pair Yes Add the similarity Return the similarity Find all child node sequence pairs whose main features of the nodes are common End
Calculation of Tree Kernel • Features of nodes • The similarity between nodes are defined as the number of common features (except the main features) Main features
Calculation of Tree Kernel A A’ X and X’ denote the nodes whose main features are common B C D B’ C’ F’ D’ E E’ A A’ A A’ A A’ A A’ A A’ B B’ C C’ D D’ B C B’ C’ D D’ E E’
Experimental Results • Data set: ACE corpus • 800 annotated documents (gathered from newspapers and broadcasts) • 5 entity types (person, organization, geo-political-entity, location, facility) • 5 major types of relations (at, near, part, role, social)
Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion
Conclusion • Overview of Information Extraction • The aim of information extraction • Recent movement to use deep linguistic resource • The way to use sentence structures for IE • Difficulties of using structured data in machine learning • Three different approaches to exploit them