Semantic Relation Extraction for Linking Named Entities to Biomedical Databases

Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU Hanoi, February 18th, 2012 lhquynh@gmail.com

Main contents • Motivation and purpose • Some approaches: the pros and cons • Discussion and Proposal • Conclusion

Motivation and purpose “… developing a state of the art named entities tagger for full open source biomedical texts …” • Deploying various named entity recognizers to see which works the best • Linking the named entities to its appropriate identifier in public databases

Motivation and purpose (cont’) What’re named entities we focus on ? • Phenotype descriptions • Disease names • Gene names • Chemical names

Motivation and purpose (cont’) Ontology = • Concept/Class • Term/Individual • Relation/Property

Motivation and purpose (cont’) The Biocaster Multilingual Ontology biocaster.org

Motivation and purpose (cont’) • How to link the named entities to unique identifiers in a biomedical database ? • What are the difference between “linking” and “filling” ? • Method ? • Clustering • Sematic relation extraction [LTB11] • … [LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text. In IALP 2011, Penang, Malaysia.

Motivation and purpose (cont’) Semantic relation extraction • Extracting relationships between terms is the task of extracting underlying relations between two term expressed by words or phrases [Gir08] • Due to the unique patterns of biomedical relations, techniques designed for extracting relations from general text may not be suitable for the biomedical domain [Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008

Motivation and purpose (cont’) What’re kinds of semantic relation we focus on ? • Hyponymy • Synonymy • Causal/effect • Indicate/hasSymptom • Treat • ….. • Entity: • Phenotype descriptions • Disease names • Gene names • Chemical names

Some approaches Three groups of existing methods: • Pattern-based extraction relies on the occurrence of term pairs in the same contexts and uses the words in the context to identify the relation • Distributional clustering uses the contexts that terms occur in individually and attempts to group semantically related elements based on similarities of these contexts • Term variation is based on the form of the term and uses similarities between terms to identify, which are semantically related

Some approaches (cont’) Distributional clustering: • Considering the context that a term tends to occur in and then apply clustering to work out, which terms are most “similar". • By using this methodology they could found class of words that are similar in meaning For example: Use the verb "fire“ we to found these following class of nouns: • Gun, Missile, Weapon • Shot, Bullet, Rocket, Missile • Officer, Aide, Chief Manager

Some approaches (cont’) Distributional clustering: • Pros: • Distributional clustering does not require that the terms occur in the same sentence or even in the same document • Generally has a higher recall than pattern based methods • Cons: • This method requires a mathematical approach to determine the clusters of terms which have a similar distribution of contexts • It is very difficult from distributional clustering to work out the nature of the relationship between the terms • Distributional clustering is not suitable for extracting specific relationships such as if "X is a causal agent of Y“

Some approaches (cont’) Term variation: • Looking at the form of the actual term and using the similarity of the words in it to deduce if the terms are related. For example: "cancer of the mouth" and "mouth cancer" • Jacquemin [Jac99] defines three main ways that term variation occurs: • Syntactic Variations • Morpho-syntactic Variations • Semantic Variations [Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 341-348.1999.

Some approaches (cont’) Term variation: • Pros: • Often has very high precision • Strongest for finding if two terms are synonymous • Can prove useful for some other cases as well • Cons: • Cannot help to identify relationships between terms with no similarity

Some approaches (cont’) Pattern-based extractioninvolve finding the terms in the same sentence and in some “pattern" that is suggestive of a particular relation. • Hearst [Hea92] used patterns to extract terms that exhibit the hyponymy relation • Her approach involved noting that such terms often occurred near each other in stereotypical patterns Some kinds of flu, such as bird flu are …” Pattern: noun phrase - “such as" - noun phrase hyponym(“bird flu", “flu") • Method for developing these patterns • Decide on a lexical relationship • Collect a set of term pairs known to have this relationship and a corpus, which contains these pairs • Find the places where these terms co-occur • Find commonalities and hypothesize a pattern • Use this pattern to find more term pairs and repeat the process “ [Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539-545, 1992.

Some approaches (cont’) Pattern-based extraction • Pros: • Simple • Patterns have the advantage that they can be specialisedfor different relationships. • Can be used for various languages • Cons: • This method was manual • There was no way to provide a strong comparison between the effectiveness of the different patterns, which perhaps lead to the inclusion of a relatively “weak" pattern • It is not clear how to automatically generate patterns, which are specific to a given relationship and domain • As patterns rely on finding the two terms in the same context, this limits the recall and ambiguity in the text can cause errors in the extractions • Problem of identification boundaries of the terms

Some approaches (cont’) Mccrae’s approach [Mcc09] for synonym and hyponymy relation • Starts with the most general pattern, that is the pattern consisting of only wild cards • Develops a more specific pattern by replacing wild cards with terms from some corpus (full text chap. 3.1) [Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora. Doctor of philosophy. Department of informatics, school of multidisciplinary sciences, the graduate univesity of advanced studies. September 2009

Some approaches (cont’) Mccrae’sapproach: • Problem of identification term’s boundary entity = (NN|JJ|NNS|NNP|FW|NNPS|JJR) * (NN|NNS|NNP|NNPS) NN: A singular noun NNS: A plural noun NNP: A proper noun NNPS: A pluralised proper noun JJ: An adjective FW: A prefix JJR: An adjective in comparative form

Some approaches (cont’) Mccrae’s approach: • Covers every possible variation of the patterns the the search space is far too large to be tractable It is necessary to find a way to cover this search space more efficiently • prioritizing "better" patterns • skipping those patterns which are too similar to existing patterns.

Some approaches (cont’) Mccrae’s approach: • Rule definition: *1 * such as *2  :Rule: :- name() words(1,1) "such" "as" name() • Simplified the rules • Match-set (Chap. 3.2.1 in full text) :- words(1,2) name() words(0,1) words(2,3) "literal" name() Simplified form::- words(1,1) name() words(2,4) "literal" name() • Join-set and alignment (Chap 3.2.2 in full text) :- "a" name() "b" "c" "d" name() :- words(,1) name() words(2,3) "c" name() Alignment on these rules: f(2; 2); (4; 4); (6; 5)g The alignment-to-join conversion: :- words(,1) name() words(2,3) "c" words(0,1) name() words(0,0) Simplified form: :- name() words(2,3) "c" words(0,1) name() • Classification

Some approaches (cont’) Mccrae’s approach: Results

Some approaches (cont’) Approach by utilizing the Web [SNR08] [TNN10] • RDF describes a SemanticWeb using RDFStatements, which are triples of the form <Subject, Property, Object> • Query the search engines with lexico-syntactic patterns to retrieve relevant information • The “seed” patterns are initially handcrafted but can be progressively learnt • Extract relations from snippets [SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations Utilizing the Web. ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008. [TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin, Heilongjiang China; December 28-30, 2010

Some approaches (cont’) • [SNR08] focus on discovering causal relationship between a disease and a biological entity • Application: For augmenting Ontologies • Purpose: Given a disease  discover the likely causes of this disease

Some approaches (cont’) Approaches summary and evaluation

Some approaches (cont’) What if using machine learning ? • Using CRF [BDS08]: • Extracts both the existence of a relation and its type • Using two type of CRF • Using Kernel-Based learning [LZL08]: • Relation detection: a binary classification of true and false relations • Relation classification: a 4-class classification of the four relation types [BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207 doi:10.1186/1471-2105-9-207 [LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation Extraction. journal of the american society for information science and technology, 59(5):756–769, 2008

Discussion and Proposal Challenges • Language complexity • Requirement good pre-processing (POS-tagging, chunking, NER, etc.) • … • Techniques designed for extracting relations from general text may not be suitable for the biomedical domain • Lack of tools, data • … • It is unlikely that the extracted relations will match the structure of the ontology

Discussion and Proposal Challenges • Modifiers: The inclusion of an adjective modifier in a term For example: "acute headache" &"headache“; “mental retardation” • Granularity: Terms are nearly always used synonymously but have slight differences in their meaning. For example: The term "HIV-1" is the most common strain of "HIV“ but "HIV-2" is less easily transmitted and mostly confined to a small area of West Africa • Property: This means that two terms refer to the same thing but with a slightly different property For example: "dengue shock syndrome" is a late stage development of "dengue fever

Discussion and Proposal (cont’) Compromises - Figure out what type of relationship or not • Binary classification or multi-label classification • 1 or 2 classifier • Pattern-based extraction, distributional clustering or term variation • Using machine learning or not • …

Discussion and Proposal (cont’) Proposal • Only deal with intra-sentence relations !!! • 2 classifiers • Pattern-based extraction and term variation • Semi-supervised learning • There is still not a strong definition or training resources for Phenotype and disease  need to work on this using available resources such as the Human Phenotype Ontology and the CALBC data set from the EBI shared task 2011

Discussion and Proposal (cont’) What’s about the Model ?

Conclusion & Future Works • Purpose: Hyponymy, Synonym and Causal relation extraction for Phenotype descriptions, Disease names, Gene names and Chemical names • Improve on method (using semantic pattern & term variation, bootstrapping technique, etc.) • Exploring data and ontology • “Linking to ontology” review • Propose model • Try to use other available resources

References [LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text. In IALP 2011, Penang, Malaysia. [TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin, Heilongjiang China; December 28-30, 2010 [Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora. Doctor of philosophy. Department of informatics, school of multidisciplinary sciences, the graduate univesity of advanced studies. September 2009 [BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207 doi:10.1186/1471-2105-9-207 [Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008 [LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation Extraction. journal of the american society for information science and technology, 59(5):756–769, 2008 [SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations Utilizing the Web. ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008. [Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 341-348.1999. [Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539-545, 1992. [Bio] http://biocaster.org

Thank you for you attention!

Semantic Relation Extraction for Linking Named Entities to Biomedical Databases

Semantic Relation Extraction for Linking Named Entities to Biomedical Databases

Presentation Transcript

Towards a semantic extraction of named entities

Relation Extraction

Relation Extraction

Linking Entities in # Microposts

Information Extraction Lecture 7 – Relation Extraction

Biomedical Named Entity Recognition

Semantic MEDLINE: Semantic Predications for Biomedical Research

LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge

Kernel Methods for Relation Extraction

Tree Kernel-based Semantic Relation Extraction using Unified Dynamic Relation Tree

Biomedical Information Extraction

Relation Extraction

Exploiting Constituent Dependencies for Tree Kernel-based Semantic Relation Extraction

Linking Entities in # Microposts

Coreference Based Event-Argument Relation Extraction on Biomedical Text

Linking Ontologies to Spatial Databases

Linking Named Entities in Tweets with Knowledge Base via User Interest Modeling

Relation Extraction

LINDEN: Linking Named Entities with Knowledge Base via Semantic Knowledge

Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE

A Study of Hybrid Similarity Measures for Semantic Relation Extraction

Named Entity Extraction