1 / 34

A Survey on Information Extraction from Documents Using Structures of Sentences

A Survey on Information Extraction from Documents Using Structures of Sentences. Chikayama Taura Lab. M1 Mitsuharu Kurita. Introduction. Current search systems are based on 2 assumptions Users send words, not sentences The aim is finding documents which is related to the query words

aya
Download Presentation

A Survey on Information Extraction from Documents Using Structures of Sentences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey onInformation Extractionfrom DocumentsUsing Structures of Sentences ChikayamaTaura Lab. M1 Mitsuharu Kurita

  2. Introduction • Current search systems are based on 2 assumptions • Users send words, not sentences • The aim is finding documents which is related to the query words • Weare unconsciously get to select words which will appear nearby the target information • In some cases this clue doesn’t work well

  3. Introduction • For more convenient access to the information • Analysis of the detail of question • To know the target information • Analysis of the information in retrieved documents • To find the requested information Information Extraction

  4. Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion

  5. Information Extraction • What is Information Extraction? • A kind of task in natural language processing • Addresses extraction of information from texts • Not to retrieve the documents • Originated with an international conference named MUC • Message Understanding Conference (MUC) • Competition of IE among research groups • Set information extraction tasks every year between 1987-1997

  6. MUC Competition • An example of MUC task • MUC-3 terrorism domain Input: news articles (someof them include terrorism event) Output: the instances involved in each incident

  7. MUC Competition • Pattern matching or linguistic analysis • At that time (1987-1997), there were many difficulties to use advanced natural language processing • Therefore, most of competitors adopted pattern matching tofind instances

  8. Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion

  9. Example of Pattern Matching • CIRCUS [92 Lehnert et al.] • Each pattern consists of “trigger word” and “linguistic pattern” “The mayor was kidnapped by terrorists.” Pattern: kidnap-passive Trigger: “kidnap” Linguistic pattern: “<subject> passive-verb” Variable: “target” “kidnap” activates the pattern “was kidnapped” is a passive verb phrase The subject “mayor” is the target

  10. Problems of Pattern Matching • It takes a huge amount of time to create patterns • In many cases, they were handwritten • It depends a lot on the target domain • It is difficult to adapt to the new task Automatic construction of patterns

  11. The Earliest Automatic Pattern Generation • AutoSlog [93 Riloff et al.] • Creates the patterns for CIRCUS automatically • Training data: articles tagged the target word • Created 1237 patterns from 1500 tagged texts • Only 450 of them were judged to be valid by human Pattern: kidnap-passive Trigger: “kidnap” Linguistic pattern: “<subject> passive-verb” Variable: “target” “The mayor was kidnapped by terrorists.”

  12. Recently it has become possible to use deeper linguistic analysis • Some studies are addressing new IE tasks using these linguistic resources and machine learning approach

  13. Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion

  14. Sentence Structures • Dependency Structure • Describes modification relations between words • One sentence makes up a tree structure • Predicate-Argument structure • Describes the semantic relations between predicate and argument • One sentence makes up a graph structure

  15. Difficulties to Use Structured Data • Most of the machine learning algorithms deal with the data as feature vectors • It is difficult to express structured data (e.g. trees, graphs) as vectors • The ways to use sentence structures for IE • Frequent substructures • Shortest paths between 2 words • Applying the kernel method for structured data

  16. Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion

  17. IE withSubgraph of Sentence Structures • On-Demand Information Extraction[06 Sekine et al.] • Create extraction patterns on-demand and extract information with it query Relevant articles Dependency trees Article database Dependency analyzer Subtree patterns Frequent Subtree Mining Table of Information

  18. Experimental Results • Generated patterns • Found patterns for a query“merger and acquisition” (M&A) • Extracted Information • For the query “acquire, acquisition, merger, buy, purchase” <COM1> <agree to buy> <COM2> <for MNY> <COM1> <will acquire> <COM2> <for MNY> <a MNY merger> <of COM1> <and COM2>

  19. Experimental Results • Very quick construction of patterns • In MUC, it is allowed to take one month • ODIE takes only a few minutes to return the result • No training corpus is needed • ODIE learns extraction patterns from the data • Information about reprising event can be extracted well • Merger and acquisition • Nobel prize winners

  20. Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion

  21. IE with Shortest Path between Words • Extraction of interacting protein pair [06 Yakushiji et al.] • Extractthe interacting protein pairs from biomedical articles • Focus on the shortest path between 2 protein names on predicate-argument structure • Discriminate with Support Vector Machine (SVM) entity1 Entity1 is interacted with a hydrophilic loop region of Entity2. hydrophilic interact loop be with region of a entity2

  22. Pattern Generation • Variation of Patterns • The extracted patterns are not enough • Divide the patterns and combine them into new patterns X protein interact with region of Y Main Prep Entity Entity ………

  23. Pattern Generation • Validation of patterns • Some of these patterns are inappropriate • Each patterns are scored by its adequacy to the learning data • Feature vector TP: True Positive FP: False Positive

  24. Support Vector Machine (SVM) • 2 class linear classifier • Divide the data space with hyperplane • Margin maximization Margin maximization

  25. Experimental Results • Learning • AImed corpus • 225 abstracts of biomedical papers • Annotated with protein names and interactions • Extraction • MEDLINE • 14 million titles and 8 million abstracts • Extracted data • 7775 protein pairs • 64.0% precision • 83.8% recall

  26. Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion

  27. IE with The Kernel Method on Sentence Structures Kernel function • Kernel Method • e.g. SVM • Data are used only in the form of dot products • If you can calculate the dot product directly, you do not have to calculate the vector • Furthermore, you can use other functions as long as they meet some conditions Raw data classifier vector space

  28. Relation Extraction • Relation Extraction with Tree Kernel [04 Culotta et al.] • Classify the relation between 2 entities • 5 entity types (person, organization, geo-political-entity, location, facility) • 5 major types of relations (at, near, part, role, social) • Classify the smallest subtree of dependency tree which includes the entities

  29. Tree Kernel • Represents the similarity between 2 tree-shaped data • Calculated as the sum of similarity of nodes Start Enqueue root node pair Enqueue the child node pairs No Is the queue empty? Dequeue a node pair Yes Add the similarity Return the similarity Find all child node sequence pairs whose main features of the nodes are common End

  30. Calculation of Tree Kernel • Features of nodes • The similarity between nodes are defined as the number of common features (except the main features) Main features

  31. Calculation of Tree Kernel A A’ X and X’ denote the nodes whose main features are common B C D B’ C’ F’ D’ E E’ A A’ A A’ A A’ A A’ A A’ B B’ C C’ D D’ B C B’ C’ D D’ E E’

  32. Experimental Results • Data set: ACE corpus • 800 annotated documents (gathered from newspapers and broadcasts) • 5 entity types (person, organization, geo-political-entity, location, facility) • 5 major types of relations (at, near, part, role, social)

  33. Outline • Introduction • Overview of Information Extraction (IE) • IE with pattern matching • IE with sentence structures • Frequent substructure • Shortest path between 2 words • Applying the kernel method for structured data • Conclusion

  34. Conclusion • Overview of Information Extraction • The aim of information extraction • Recent movement to use deep linguistic resource • The way to use sentence structures for IE • Difficulties of using structured data in machine learning • Three different approaches to exploit them

More Related