1 / 35

Semantic Relation Extraction for Linking Named Entities to Biomedical Databases

Semantic Relation Extraction for Linking Named Entities to Biomedical Databases. Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU. Hanoi, February 18 th , 2012. lhquynh@gmail.com. Main contents. Motivation and purpose Some approaches: the pros and cons

satchel
Download Presentation

Semantic Relation Extraction for Linking Named Entities to Biomedical Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU Hanoi, February 18th, 2012 lhquynh@gmail.com

  2. Main contents • Motivation and purpose • Some approaches: the pros and cons • Discussion and Proposal • Conclusion

  3. Motivation and purpose “… developing a state of the art named entities tagger for full open source biomedical texts …” • Deploying various named entity recognizers to see which works the best • Linking the named entities to its appropriate identifier in public databases

  4. Motivation and purpose (cont’) What’re named entities we focus on ? • Phenotype descriptions • Disease names • Gene names • Chemical names

  5. Motivation and purpose (cont’) Ontology = • Concept/Class • Term/Individual • Relation/Property

  6. Motivation and purpose (cont’) The Biocaster Multilingual Ontology biocaster.org

  7. Motivation and purpose (cont’) • How to link the named entities to unique identifiers in a biomedical database ? • What are the difference between “linking” and “filling” ? • Method ? • Clustering • Sematic relation extraction [LTB11] • … [LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text. In IALP 2011, Penang, Malaysia.

  8. Motivation and purpose (cont’) Semantic relation extraction • Extracting relationships between terms is the task of extracting underlying relations between two term expressed by words or phrases [Gir08] • Due to the unique patterns of biomedical relations, techniques designed for extracting relations from general text may not be suitable for the biomedical domain [Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008

  9. Motivation and purpose (cont’) What’re kinds of semantic relation we focus on ? • Hyponymy • Synonymy • Causal/effect • Indicate/hasSymptom • Treat • ….. • Entity: • Phenotype descriptions • Disease names • Gene names • Chemical names

  10. Motivation and purpose (cont’) What’re kinds of semantic relation we focus on ? • Hyponymy • Synonymy • Causal/effect • Indicate/hasSymptom • Treat • ….. • Entity: • Phenotype descriptions • Disease names • Gene names • Chemical names

  11. Some approaches Three groups of existing methods: • Pattern-based extraction relies on the occurrence of term pairs in the same contexts and uses the words in the context to identify the relation • Distributional clustering uses the contexts that terms occur in individually and attempts to group semantically related elements based on similarities of these contexts • Term variation is based on the form of the term and uses similarities between terms to identify, which are semantically related

  12. Some approaches (cont’) Distributional clustering: • Considering the context that a term tends to occur in and then apply clustering to work out, which terms are most “similar". • By using this methodology they could found class of words that are similar in meaning For example: Use the verb "fire“ we to found these following class of nouns: • Gun, Missile, Weapon • Shot, Bullet, Rocket, Missile • Officer, Aide, Chief Manager

  13. Some approaches (cont’) Distributional clustering: • Pros: • Distributional clustering does not require that the terms occur in the same sentence or even in the same document • Generally has a higher recall than pattern based methods • Cons: • This method requires a mathematical approach to determine the clusters of terms which have a similar distribution of contexts • It is very difficult from distributional clustering to work out the nature of the relationship between the terms • Distributional clustering is not suitable for extracting specific relationships such as if "X is a causal agent of Y“

  14. Some approaches (cont’) Term variation: • Looking at the form of the actual term and using the similarity of the words in it to deduce if the terms are related. For example: "cancer of the mouth" and "mouth cancer" • Jacquemin [Jac99] defines three main ways that term variation occurs: • Syntactic Variations • Morpho-syntactic Variations • Semantic Variations [Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 341-348.1999.

  15. Some approaches (cont’) Term variation: • Pros: • Often has very high precision • Strongest for finding if two terms are synonymous • Can prove useful for some other cases as well • Cons: • Cannot help to identify relationships between terms with no similarity

  16. Some approaches (cont’) Pattern-based extractioninvolve finding the terms in the same sentence and in some “pattern" that is suggestive of a particular relation. • Hearst [Hea92] used patterns to extract terms that exhibit the hyponymy relation • Her approach involved noting that such terms often occurred near each other in stereotypical patterns Some kinds of flu, such as bird flu are …” Pattern: noun phrase - “such as" - noun phrase hyponym(“bird flu", “flu") • Method for developing these patterns • Decide on a lexical relationship • Collect a set of term pairs known to have this relationship and a corpus, which contains these pairs • Find the places where these terms co-occur • Find commonalities and hypothesize a pattern • Use this pattern to find more term pairs and repeat the process “ [Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539-545, 1992.

  17. Some approaches (cont’) Pattern-based extraction • Pros: • Simple • Patterns have the advantage that they can be specialisedfor different relationships. • Can be used for various languages • Cons: • This method was manual • There was no way to provide a strong comparison between the effectiveness of the different patterns, which perhaps lead to the inclusion of a relatively “weak" pattern • It is not clear how to automatically generate patterns, which are specific to a given relationship and domain • As patterns rely on finding the two terms in the same context, this limits the recall and ambiguity in the text can cause errors in the extractions • Problem of identification boundaries of the terms

  18. Some approaches (cont’) Mccrae’s approach [Mcc09] for synonym and hyponymy relation • Starts with the most general pattern, that is the pattern consisting of only wild cards • Develops a more specific pattern by replacing wild cards with terms from some corpus (full text chap. 3.1) [Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora. Doctor of philosophy. Department of informatics, school of multidisciplinary sciences, the graduate univesity of advanced studies. September 2009

  19. Some approaches (cont’) Mccrae’sapproach: • Problem of identification term’s boundary entity = (NN|JJ|NNS|NNP|FW|NNPS|JJR) * (NN|NNS|NNP|NNPS) NN: A singular noun NNS: A plural noun NNP: A proper noun NNPS: A pluralised proper noun JJ: An adjective FW: A prefix JJR: An adjective in comparative form

  20. Some approaches (cont’) Mccrae’s approach: • Covers every possible variation of the patterns the the search space is far too large to be tractable It is necessary to find a way to cover this search space more efficiently • prioritizing "better" patterns • skipping those patterns which are too similar to existing patterns.

  21. Some approaches (cont’) Mccrae’s approach: • Rule definition: *1 * such as *2  :Rule: :- name() words(1,1) "such" "as" name() • Simplified the rules • Match-set (Chap. 3.2.1 in full text) :- words(1,2) name() words(0,1) words(2,3) "literal" name() Simplified form::- words(1,1) name() words(2,4) "literal" name() • Join-set and alignment (Chap 3.2.2 in full text) :- "a" name() "b" "c" "d" name() :- words(,1) name() words(2,3) "c" name() Alignment on these rules: f(2; 2); (4; 4); (6; 5)g The alignment-to-join conversion: :- words(,1) name() words(2,3) "c" words(0,1) name() words(0,0) Simplified form: :- name() words(2,3) "c" words(0,1) name() • Classification

  22. Some approaches (cont’) Mccrae’s approach: Results

  23. Some approaches (cont’) Mccrae’s approach: Results

  24. Some approaches (cont’) Approach by utilizing the Web [SNR08] [TNN10] • RDF describes a SemanticWeb using RDFStatements, which are triples of the form <Subject, Property, Object> • Query the search engines with lexico-syntactic patterns to retrieve relevant information • The “seed” patterns are initially handcrafted but can be progressively learnt • Extract relations from snippets [SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations Utilizing the Web. ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008. [TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin, Heilongjiang China; December 28-30, 2010

  25. Some approaches (cont’) • [SNR08] focus on discovering causal relationship between a disease and a biological entity • Application: For augmenting Ontologies • Purpose: Given a disease  discover the likely causes of this disease

  26. Some approaches (cont’) Approaches summary and evaluation

  27. Some approaches (cont’) What if using machine learning ? • Using CRF [BDS08]: • Extracts both the existence of a relation and its type • Using two type of CRF • Using Kernel-Based learning [LZL08]: • Relation detection: a binary classification of true and false relations • Relation classification: a 4-class classification of the four relation types [BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207 doi:10.1186/1471-2105-9-207 [LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation Extraction. journal of the american society for information science and technology, 59(5):756–769, 2008

  28. Discussion and Proposal Challenges • Language complexity • Requirement good pre-processing (POS-tagging, chunking, NER, etc.) • … • Techniques designed for extracting relations from general text may not be suitable for the biomedical domain • Lack of tools, data • … • It is unlikely that the extracted relations will match the structure of the ontology

  29. Discussion and Proposal Challenges • Modifiers: The inclusion of an adjective modifier in a term For example: "acute headache" &"headache“; “mental retardation” • Granularity: Terms are nearly always used synonymously but have slight differences in their meaning. For example: The term "HIV-1" is the most common strain of "HIV“ but "HIV-2" is less easily transmitted and mostly confined to a small area of West Africa • Property: This means that two terms refer to the same thing but with a slightly different property For example: "dengue shock syndrome" is a late stage development of "dengue fever

  30. Discussion and Proposal (cont’) Compromises - Figure out what type of relationship or not • Binary classification or multi-label classification • 1 or 2 classifier • Pattern-based extraction, distributional clustering or term variation • Using machine learning or not • …

  31. Discussion and Proposal (cont’) Proposal • Only deal with intra-sentence relations !!! • 2 classifiers • Pattern-based extraction and term variation • Semi-supervised learning • There is still not a strong definition or training resources for Phenotype and disease  need to work on this using available resources such as the Human Phenotype Ontology and the CALBC data set from the EBI shared task 2011

  32. Discussion and Proposal (cont’) What’s about the Model ?

  33. Conclusion & Future Works • Purpose: Hyponymy, Synonym and Causal relation extraction for Phenotype descriptions, Disease names, Gene names and Chemical names • Improve on method (using semantic pattern & term variation, bootstrapping technique, etc.) • Exploring data and ontology • “Linking to ontology” review • Propose model • Try to use other available resources

  34. References [LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text. In IALP 2011, Penang, Malaysia. [TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin, Heilongjiang China; December 28-30, 2010 [Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora. Doctor of philosophy. Department of informatics, school of multidisciplinary sciences, the graduate univesity of advanced studies. September 2009 [BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207 doi:10.1186/1471-2105-9-207 [Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008 [LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation Extraction. journal of the american society for information science and technology, 59(5):756–769, 2008 [SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations Utilizing the Web. ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008. [Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 341-348.1999. [Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539-545, 1992. [Bio] http://biocaster.org

  35. Thank you for you attention!

More Related