240 likes | 250 Views
This research paper explores reranking approaches to improve the extraction of protein-protein interactions from biomedical literature. The authors propose a Hidden Vector State (HVS) model and conduct experiments to demonstrate the effectiveness of reranking techniques. The results show a 4% relative improvement in F-measure.
E N D
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering Nanyang Technological University, Singapore 30 August 2007
Outline • Protein-protein interactions (PPIs) extraction • Hidden Vector State (HVS) model for PPIs extraction • Reranking approaches • Experimental results • Conclusions
Interact Protein Protein Protein Protein-Protein Interactions Extraction Spc97p interacts with Spc98 and Tub4 in the two-hybrid system Spc97p interact Spc98 Spc97p interact Tub4
Existing Approaches Parsing-Based Pattern Matching Statistics Methods Simple to Complicated
semantic model lexical model Semantic Parser For each candidate word string Wn, need to compute most likely set of embedded concepts Ĉ = argmax { P(C|Wn) } = argmax { P(C) P(Wn|C) } c c
P(C) P(Wn|C) We could use a simple finite state tagger … … can be robustly trained using EM, but model is too weak to represent embeddings in natural language
P(C) P(Wn|C) Perhaps use some form of hierarchical HMM in which each state is a terminal or a nested HMM … … but when using EM, models rarely converge on good solutions and, in practice, direct maximum-likelihood from “tree-bank” data are needed to train models
P(C) P(Wn|C) The HVS model is an HMM in which the states correspond to the stack of a push-down automata with a bounded stack size … … this is a very convenient framework for applying constraints
HVS model transition constraints: • finite stack depth – D • push only one non-terminal semantic onto the stack at each step Ĉ = argmax { ∏P(nt|Ct-1) P(Ct[1]|Ct [2..Dt]) P(Wt|Ct) } c,Nt … model defined by three simple probability tables
1) POP 1 elements from the previous stack state, n =1 2) Push 1 pre-terminal semantic concept into stack P(nt|Ct-1) P(Ct[1]|Ct [2..Dt]) 3) Generate the next word P(Wt|Ct) Parsing with the HVS model INTERACT PROTEIN SS DUMMY INERACT PROTEIN SS PROTEIN INTERACT PROTEIN SS … with Spc98 and Tub4 …
Constraints Data EM Parameter Estimation HVS Model Parameters Parse Statistics Train using EM and apply constraints Training text Abstract semantic annotation PROTEIN ( INTERACT ( PROTEIN) ) CUL-1 was found to interact with SKR-1, SKR-2, SKR-3, and SKR-7 in yeast two-hybrid system Limit forward-backward search to only include states which are consistent with the constraints
Reranking Methodology • Reranking approaches attempts to improve upon an existing probabilistic parser by reranking the output of the parser. • It has benefited applications such as name-entity extraction, semantic parsing and semantic labeling. • To rerank parses generated by the HVS model for protein-protein interactions extraction
Reranking approaches • Features for Reranking Suppose sentence Si has its corresponding parse set Ci = {Cij, j = 1,.. N} • Parsing Information • Structure Information • Complexity Information
Reranking approaches Score is defined as • log-linear regression model • Neural Network • Support Vector Machines
Experiments • Setup • Corpus I • comprises of 300 abstracts randomly retrieved from the GENIA corpus • GENIA is a collection of research abstracts selected from the search results of MEDLINE database with keyword (MeSH terms) “human, blood cells and transcription factors” • split into two parts: • Part I contains 1500 sentences (training data) • Part II consists of 1000 sentences (test data)
Experimental Results Figure 1: F-measure vs number of candidate parses.
Experimental Results (cont’d) Table 3: Results based on the interaction category.
Conclusions • Three reranking methods for the HVS model in the application of extracting protein-protein interactions from biomedical literature. • Experimental results show that 4% relative improvement in F-measure can be obtained through reranking on the semantic parse results • Incorporating other semantic or syntactic information might be able to give further gains.