340 likes | 350 Views
Semantic Relation Detection in Bioscience Text. Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech. BioText Project Goals. Provide flexible, intelligent access to information for use in biosciences applications. Focus on
E N D
Semantic Relation Detectionin Bioscience Text Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech
BioText Project Goals • Provide flexible, intelligent access to information for use in biosciences applications. • Focus on • Textual Information from Journal Articles • Tightly integrated with other resources • Ontologies • Record-based databases
Project Team • Project Leaders: • PI: Marti Hearst • Co-PI: Adam Arkin • Computational Linguistics • Barbara Rosario (graduated) • Presley Nakov • Database Research • Ariel Schwartz • Gaurav Bhalotia (graduated) • User Interface / IR • Rowena Luk • Dr. Emilia Stoica • Bioscience • Dr. TingTing Zhang • Janice Hamer Supported primarily by NSF DBI-0317510 and a gift from Genentech
BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface
The Nature of Bioscience Text Claim: Bioscience semantics are simultaneously easier and harder than general text. easier harder Fewer subtleties Fewer ambiguities “Systematic” meanings Enormous terminology Complex sentence structure
Two tasks • Relationship Extraction: • Identify the several semantic relations that can occur between two entities (in this case, protein names) in bioscience text. • Entity extraction: • Related problem: identify the entities
The Approach • Data: MEDLINE abstracts and titles • Graphical models • Combine in one framework both relation and entity extraction • Both static and dynamic models • Simple discriminative approach: • Neural network • Lexical, syntactic and semantic features
Protein-Protein interactions • Tasks: • Given sentences from Paper ID, and/or citation sentences to ID • Predict the interaction type given in the HIV database for Paper ID • Extract the proteins involved • 10-way classification problem
Models Dynamic graphical model Naïve Bayes Protein-Protein interactions
Evaluation • Evaluation at document level • All (sentences from papers + citations) • Papers (only sentences from papers) • Citations (only citation sentences) • “Trigger word” approach • List of keywords (ex: for inhibits: “inhibitor”, “inhibition”, “inhibit”…etc. • If keyword presents: assign corresponding interaction
Results • Accuracies on interaction classification (Roles hidden)
Results: confusion matrix For All. Overall accuracy: 60.5%
Hiding the protein names • Replaced protein names with tokens PROT_NAME • Selective CXCR4 antagonism by Tat • Selective PROT_NAME antagonism by PROT_NAME
Protein extraction • (Protein name tagging, role extraction) • The identification of all the proteins present in the sentence that are involved in the interaction • These results suggest that Tat - induced phosphorylation of serine 5 by CDK9 might be important after transcription has reached the +36 position, at which time CDK7 has been released from the complex. • Tatmight regulate the phosphorylation of the RNA polymerase II carboxyl - terminal domain in pre - initiation complexes by activating CDK7
Protein extraction: results No dictionary used
Conclusions of protein-protein interaction project • Encouraging results for the automatic classification of protein-protein interactions • Use of an existing database for gathering labeled data • Use of citations
BioScience Researchers • Read A LOT! • Cite A LOT! • Curate A LOT! • Are interested in specific relations, e.g.: • What is the role of this protein in that pathway? • Show me articles in which a comparison between two values is significant.
A discovery is made … A paper is written …
That paper is cited … and cited … and cited … … as the evidence for some fact(s) F.
Each of these in turn are cited for some fact(s) … … until it is the case that all important facts in the field can be found in citation sentences alone!
Citances • Nearly every statement in a bioscience journal article is backed up with a cite. • It is quite common for papers to be cited 30-100 times. • The text around the citation tends to state biological facts. (Call these citances.) • Different citances will state the same facts in different ways … • … so can we use these for creating models of language expressing semantic relations?
Using Citances • Potential uses of citation sentences (citances) • creation of training and testing data for semantic analysis, • synonym set creation, • database curation, • document summarization, • and information retrieval generally. • Some preliminary results: • Citances to a document align well with a hand-built curation. • Citances are good candidates for paraphrase creation.
Issues for Processing Citances • Text span • Identification of the appropriate phrase, clause, or sentence that constructs a citance. • Correct mapping of citations when shown as lists or groups (e.g., “[22-25]”). • Grouping citances by topic • Citances that cite the same document should be grouped by the facts they state. • Normalizing or paraphrasing citances • For IR, summarization, learning synonyms, relation extraction, question answering, and machine translation.
Sample Sentences • NGF withdrawal from sympathetic neurons induces Bim, which then contributes to death. • Nerve growth factor withdrawal induces the expression of Bimand mediates Bax dependent cytochrome c release and apoptosis. • The proapoptotic Bcl-2 family member Bim is strongly induced in sympathetic neurons in response to NGF withdrawal. • In neurons, the BH3 only Bcl2 member, Bim, and JNK are both implicated in apoptosis caused by nerve growth factor deprivation.
Their Paraphrases • NGF withdrawal induces Bim. • Nerve growth factorwithdrawal induces the expression of Bim. • Bimhas been shown to be upregulated following nerve growth factor withdrawal. • Bimimplicated in apoptosis caused by nerve growth factor deprivation. They all paraphrase: Bimis induced after NGF withdrawal.
Paraphrase Creation Algorithm 1.Extract the sentences that cite the target. 2. Mark the NEs of interest (genes/proteins, MeSH terms) and normalize. 3. Dependency parse (MiniPar). 4. For each parse For each pair of NEs of interest i. Extract the path between them. ii. Create a paraphrase from the path. 5. Rank the candidates for a given pair of NEs. 6. Select only the ones above a threshold. 7. Generalize.
Relevant Papers • Citances: Citation Sentences for Semantic Analysis of Bioscience Text, Preslav Nakov, Ariel Schwartz, and Marti Hearst, in the SIGIR'04 workshop on Search and Discovery in Bioinformatics. • Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti Hearst, in ACL 2004. • The Descent of Hierarchy, and Selection in Relational Semantics, Barbara Rosario, Marti Hearst, and Charles Fillmore, in ACL 2002.
Thank you! Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu