A Survey on Information Extraction from Documents Using Structures of Sentences

A Survey onInformation Extractionfrom DocumentsUsing Structures of Sentences ChikayamaTaura Lab. M1 Mitsuharu Kurita

Introduction • Current search systems are based on 2 assumptions • Users send words, not sentences • The aim is finding documents which is related to the query words • Weare unconsciously get to select words which will appear nearby the target information • In some cases this clue doesn’t work well

Introduction • For more convenient access to the information • Analysis of the detail of question • To know the target information • Analysis of the information in retrieved documents • To find the requested information Information Extraction

Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion

Information Extraction • What is Information Extraction? • A kind of task in natural language processing • Addresses extraction of information from texts • Not to retrieve the documents • Originated with an international conference named MUC • Message Understanding Conference (MUC) • Competition of IE among research groups • Set information extraction tasks every year between 1987-1997

MUC Competition • An example of MUC task • MUC-3 terrorism domain Input: news articles (someof them include terrorism event) Output: the instances involved in each incident

MUC Competition • Pattern matching or linguistic analysis • At that time (1987-1997), there were many difficulties to use advanced natural language processing • Therefore, most of competitors adopted pattern matching tofind instances

Example of Pattern Matching • CIRCUS [92 Lehnert et al.] • Each pattern consists of “trigger word” and “linguistic pattern” “The mayor was kidnapped by terrorists.” Pattern: kidnap-passive Trigger: “kidnap” Linguistic pattern: “<subject> passive-verb” Variable: “target” “kidnap” activates the pattern “was kidnapped” is a passive verb phrase The subject “mayor” is the target

Problems of Pattern Matching • It takes a huge amount of time to create patterns • In many cases, they were handwritten • It depends a lot on the target domain • It is difficult to adapt to the new task Automatic construction of patterns

The Earliest Automatic Pattern Generation • AutoSlog [93 Riloff et al.] • Creates the patterns for CIRCUS automatically • Training data: articles tagged the target word • Created 1237 patterns from 1500 tagged texts • Only 450 of them were judged to be valid by human Pattern: kidnap-passive Trigger: “kidnap” Linguistic pattern: “<subject> passive-verb” Variable: “target” “The mayor was kidnapped by terrorists.”

Recently it has become possible to use deeper linguistic analysis • Some studies are addressing new IE tasks using these linguistic resources and machine learning approach

Sentence Structures • Dependency Structure • Describes modification relations between words • One sentence makes up a tree structure • Predicate-Argument structure • Describes the semantic relations between predicate and argument • One sentence makes up a graph structure

Difficulties to Use Structured Data • Most of the machine learning algorithms deal with the data as feature vectors • It is difficult to express structured data (e.g. trees, graphs) as vectors • The ways to use sentence structures for IE • Frequent substructures • Shortest paths between 2 words • Applying the kernel method for structured data

IE withSubgraph of Sentence Structures • On-Demand Information Extraction[06 Sekine et al.] • Create extraction patterns on-demand and extract information with it query Relevant articles Dependency trees Article database Dependency analyzer Subtree patterns Frequent Subtree Mining Table of Information

Experimental Results • Generated patterns • Found patterns for a query“merger and acquisition” (M&A) • Extracted Information • For the query “acquire, acquisition, merger, buy, purchase” <COM1> <agree to buy> <COM2> <for MNY> <COM1> <will acquire> <COM2> <for MNY> <a MNY merger> <of COM1> <and COM2>

Experimental Results • Very quick construction of patterns • In MUC, it is allowed to take one month • ODIE takes only a few minutes to return the result • No training corpus is needed • ODIE learns extraction patterns from the data • Information about reprising event can be extracted well • Merger and acquisition • Nobel prize winners

IE with Shortest Path between Words • Extraction of interacting protein pair [06 Yakushiji et al.] • Extractthe interacting protein pairs from biomedical articles • Focus on the shortest path between 2 protein names on predicate-argument structure • Discriminate with Support Vector Machine (SVM) entity1 Entity1 is interacted with a hydrophilic loop region of Entity2. hydrophilic interact loop be with region of a entity2

Pattern Generation • Variation of Patterns • The extracted patterns are not enough • Divide the patterns and combine them into new patterns X protein interact with region of Y Main Prep Entity Entity ………

Pattern Generation • Validation of patterns • Some of these patterns are inappropriate • Each patterns are scored by its adequacy to the learning data • Feature vector TP: True Positive FP: False Positive

Support Vector Machine (SVM) • 2 class linear classifier • Divide the data space with hyperplane • Margin maximization Margin maximization

Experimental Results • Learning • AImed corpus • 225 abstracts of biomedical papers • Annotated with protein names and interactions • Extraction • MEDLINE • 14 million titles and 8 million abstracts • Extracted data • 7775 protein pairs • 64.0% precision • 83.8% recall

IE with The Kernel Method on Sentence Structures Kernel function • Kernel Method • e.g. SVM • Data are used only in the form of dot products • If you can calculate the dot product directly, you do not have to calculate the vector • Furthermore, you can use other functions as long as they meet some conditions Raw data classifier vector space

Relation Extraction • Relation Extraction with Tree Kernel [04 Culotta et al.] • Classify the relation between 2 entities • 5 entity types (person, organization, geo-political-entity, location, facility) • 5 major types of relations (at, near, part, role, social) • Classify the smallest subtree of dependency tree which includes the entities

Tree Kernel • Represents the similarity between 2 tree-shaped data • Calculated as the sum of similarity of nodes Start Enqueue root node pair Enqueue the child node pairs No Is the queue empty? Dequeue a node pair Yes Add the similarity Return the similarity Find all child node sequence pairs whose main features of the nodes are common End

Calculation of Tree Kernel • Features of nodes • The similarity between nodes are defined as the number of common features (except the main features) Main features

Calculation of Tree Kernel A A’ X and X’ denote the nodes whose main features are common B C D B’ C’ F’ D’ E E’ A A’ A A’ A A’ A A’ A A’ B B’ C C’ D D’ B C B’ C’ D D’ E E’

Experimental Results • Data set: ACE corpus • 800 annotated documents (gathered from newspapers and broadcasts) • 5 entity types (person, organization, geo-political-entity, location, facility) • 5 major types of relations (at, near, part, role, social)

Conclusion • Overview of Information Extraction • The aim of information extraction • Recent movement to use deep linguistic resource • The way to use sentence structures for IE • Difficulties of using structured data in machine learning • Three different approaches to exploit them

A Survey on Information Extraction from Documents Using Structures of Sentences

A Survey on Information Extraction from Documents Using Structures of Sentences

Presentation Transcript

Open Information Extraction from Conjunctive Sentences

Information Extraction from Web Documents

Content Extraction from HTML Documents

Information Extraction From Recipes

Automating the Extraction of Genealogical Information from Historical Documents

Triplet Extraction from Sentences

Triplet Extraction from Sentences

A Survey of WEB Information Extraction Systems

Information Extraction from Literature

Information extraction from text

Information extraction from text

Information Extraction A Practical Survey

Information extraction from Queries

Information extraction from web pages using extraction ontologies

Information extraction from text

Information extraction from text

Information extraction from text

A Survey of WEB Information Extraction Systems

Information extraction from text

Information extraction from web pages using extraction ontologies

Information Extraction A Practical Survey