Recognition of Multi-sentence n -ary Subcellular Localization Mentions in Biomedical Abstracts

Recognition of Multi-sentencen-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007 http://www.gabormelli.com/2007/2007_MultiNaryBio_Melli_Presentation.ppt

Introduction • We propose a method for detecting n-ary relations that may span multiple sentences • Motivation is to support the semi-automated population of subcellular localizations in db.psort.org. • Organism / Protein / Location • We cast each document as a text graph and use machine learning to detect patterns in the graph.

Is there an SCL in this text?

Here is the relevant passage • Yes: (V. cholerae, TcpC, outer membrane) • Current algorithms are restricted to the detection of binary relations within one sentence: (TcpC, outer membrane).

Challenge #1 • A significant number of the relation cases (~40%) span multiple sentences. • Proposed solution: • Create a text graph for the entire document • The graph can contain a superset of the information used by the current binary relation single sentence approaches. (Jiang and Zhai, 2007; Zhou et al, 2007)

LOC ORG LOC p ilus PROT LOC LOC • Automated Markup • Syntactic analysis • End of sent. • Part-of-speech • Parse tree • Semantic analysi • Named-entity recognition • Coreference resolution

A Single Relation Case

Challenge #2 • An n-ary Relation • The task involves three entity mentions: Organism, Protein, Subcellular Loc. • Current approaches designed for detecting mentions with two entities. • Proposed solution • Create a feature vector that contains the information for three pairings `

3-ary Relation Feature Vector

PPLRE v1.4 Data Set • 540 true and 4,769 false curated relation cases drawn from 843 research paper abstracts. • 267 of the 540 true relation cases (~49%) span multiple sentences. • Data available at koch.pathogenomics.ca/pplre/

Performance Results • Tested against two baselines that were tuned to this task: YSRL and Zparser. • TeGRR achieved the highest F-score (by significantly increasing the Recall). 5-fold cross validated

Research Directions • Actively grow the PSORTdb curated set • Qualifying the Certainty of a Case • E.g. label cases with: “experiment”, “hypothesized”, “assumed”, and “False”. • Ontology constrained predictions • E.g. Gram-positive bacteria do not have a periplasm therefore do not predict periplasm. • Application to other tasks

Recognition of Multi-sentencen-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007 http://www.gabormelli.com/2007/2007_MultiNaryBio_Melli_Presentation.ppt

Extra Slides for Questions

Shortened Reference List M. Craven, and J. Kumlien. (1999). Constructing Biological Knowledge-bases by Extracting Information from Text Sources. In Proc. of the International Conference on Intelligent Systems for Molec. Bio. K. Fundel, R. Kuffner, and R. Zimmer. (2007). RelEx--Relation Extraction Using Eependency Parse Trees. Bioinformatics. 23(3). J. Jiang and C. Zhai. (2007). A Systematic Exploration of the Feature Space for Relation Extraction. In Proc. of NAACL/HLT-2007. Y. Liu, Z. Shi and A. Sarkar. (2007). Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles. In Proc. of NAACL/HLT-2007. Z. Shi, A. Sarkar and F. Popowich. (2007). Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques. Proc. of NAACL/HLT-2007 M. Skounakis, M. Craven and S. Ray. (2003). Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI-2003. Zhang M, Zhang J, Su J: Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. Procs. of NAACL/HLT-2006; 2006.

Pipelined Process Framework

Relation Case Generation • Input: (D, R): A text document D and a set of semantic relations R with a arguments. • Output: (C): A set of unlabelled semantic relation cases. • Method: • Identify all e entity mentions Ei in D • Create every combination of a entity mentions from the e mentions in the document (without replacement). • For intrasentential semantic relation detection and classification tasks, limit the entity mentions to be from the same sentence. • For typed semantic relation detection and classification tasks, limit the combinations to those where there is a match between the semantic classes of each of the entity mentions Ei and the semantic class of their corresponding relation argument Ai.

Relation Case Labeling

Naïve Baseline Algorithms • Predict True: Always predicts “True” regardless of the contents of the relation case • Attains the maximum Recall by any algorithm on the task. • Attains the maximum F1 by any naïve algorithm. • Most commonly used naïve baseline.

Prediction Outcome Labels • true positive (tp) • predicted to have the label True and whose label is indeed True . • false positive (fp) • predicted to have the label True but whose label is instead False . • true negative (tn) • predicted to have the label False and whose label is indeed False . • false negative (fn) • predicted to have the label False and whose label is instead True .

Performance Metrics • Precision (P): probability that a test case that is predicted to have label True is tp. • Recall (R): probability that a True test case will be tp. • F-measure (F1): Harmonic mean of the Precision and Recall estimates.

Token-based Features“Protein1is aLocation1 ...” • Token Distance • 2 intervening tokens • Token Sequence(s) • Unigrams • Bigrams

Token-based Features (cont.) • Token Part-of-Speech Role Sequences

Additional Features/Knowledge • Expose additional features that can identify the more esoteric ways of expressing a relation. • Features from outside of the “shortest-path”. • Challenge: past open-ended attempts have reduced performance (Jiang and Zhi, 2007) • (Zhou et al, 2007) add heuristics for five common situations. • Use domain-specific background knowledge. • E.g. Gram-positive bacteria (such as M. tuberculosis) do not have a periplasm therefore do not predict periplasm.

Challenge: Qualifying the Certainty of a Relation Case • It would be useful qualify the certainty that can be assigned to a relation mention. • E.g. In the news domain, distinguish relation mentions based on first hand information versus those based on hearsay. • Idea: Add an additional label to each relation case that qualifies the certainty of the statement. E.g. in the PPLRE task label cases with: “directly validated”, “indirectly validated”, “hypothesized”, and “assumed”.

Recognition of Multi-sentence n -ary Subcellular Localization Mentions in Biomedical Abstracts