1 / 34

Semantic Relation Detection in Bioscience Text

Semantic Relation Detection in Bioscience Text. Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech. BioText Project Goals. Provide flexible, intelligent access to information for use in biosciences applications. Focus on

leerich
Download Presentation

Semantic Relation Detection in Bioscience Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Relation Detectionin Bioscience Text Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

  2. BioText Project Goals • Provide flexible, intelligent access to information for use in biosciences applications. • Focus on • Textual Information from Journal Articles • Tightly integrated with other resources • Ontologies • Record-based databases

  3. Project Team • Project Leaders: • PI: Marti Hearst • Co-PI: Adam Arkin • Computational Linguistics • Barbara Rosario (graduated) • Presley Nakov • Database Research • Ariel Schwartz • Gaurav Bhalotia (graduated) • User Interface / IR • Rowena Luk • Dr. Emilia Stoica • Bioscience • Dr. TingTing Zhang • Janice Hamer Supported primarily by NSF DBI-0317510 and a gift from Genentech

  4. BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

  5. The Nature of Bioscience Text Claim: Bioscience semantics are simultaneously easier and harder than general text. easier harder Fewer subtleties Fewer ambiguities “Systematic” meanings Enormous terminology Complex sentence structure

  6. Entity-EntityRelation Recognition

  7. Two tasks • Relationship Extraction: • Identify the several semantic relations that can occur between two entities (in this case, protein names) in bioscience text. • Entity extraction: • Related problem: identify the entities

  8. The Approach • Data: MEDLINE abstracts and titles • Graphical models • Combine in one framework both relation and entity extraction • Both static and dynamic models • Simple discriminative approach: • Neural network • Lexical, syntactic and semantic features

  9. Protein-Protein interactions • Tasks: • Given sentences from Paper ID, and/or citation sentences to ID • Predict the interaction type given in the HIV database for Paper ID • Extract the proteins involved • 10-way classification problem

  10. Models Dynamic graphical model Naïve Bayes Protein-Protein interactions

  11. Graphical Models

  12. Evaluation • Evaluation at document level • All (sentences from papers + citations) • Papers (only sentences from papers) • Citations (only citation sentences) • “Trigger word” approach • List of keywords (ex: for inhibits: “inhibitor”, “inhibition”, “inhibit”…etc. • If keyword presents: assign corresponding interaction

  13. Results • Accuracies on interaction classification (Roles hidden)

  14. Results: confusion matrix For All. Overall accuracy: 60.5%

  15. Hiding the protein names • Replaced protein names with tokens PROT_NAME • Selective CXCR4 antagonism by Tat • Selective PROT_NAME antagonism by PROT_NAME

  16. Results with no protein names

  17. Protein extraction • (Protein name tagging, role extraction) • The identification of all the proteins present in the sentence that are involved in the interaction • These results suggest that Tat - induced phosphorylation of serine 5 by CDK9 might be important after transcription has reached the +36 position, at which time CDK7 has been released from the complex. • Tatmight regulate the phosphorylation of the RNA polymerase II carboxyl - terminal domain in pre - initiation complexes by activating CDK7

  18. Protein extraction: results No dictionary used

  19. Conclusions of protein-protein interaction project • Encouraging results for the automatic classification of protein-protein interactions • Use of an existing database for gathering labeled data • Use of citations

  20. Acquiring Labeled Data using Citances

  21. BioScience Researchers • Read A LOT! • Cite A LOT! • Curate A LOT! • Are interested in specific relations, e.g.: • What is the role of this protein in that pathway? • Show me articles in which a comparison between two values is significant.

  22. Acquiring Labeled Data using Citances

  23. A discovery is made … A paper is written …

  24. That paper is cited … and cited … and cited … … as the evidence for some fact(s) F.

  25. Each of these in turn are cited for some fact(s) … … until it is the case that all important facts in the field can be found in citation sentences alone!

  26. Citances • Nearly every statement in a bioscience journal article is backed up with a cite. • It is quite common for papers to be cited 30-100 times. • The text around the citation tends to state biological facts. (Call these citances.) • Different citances will state the same facts in different ways … • … so can we use these for creating models of language expressing semantic relations?

  27. Using Citances • Potential uses of citation sentences (citances) • creation of training and testing data for semantic analysis, • synonym set creation, • database curation, • document summarization, • and information retrieval generally. • Some preliminary results: • Citances to a document align well with a hand-built curation. • Citances are good candidates for paraphrase creation.

  28. Issues for Processing Citances • Text span • Identification of the appropriate phrase, clause, or sentence that constructs a citance. • Correct mapping of citations when shown as lists or groups (e.g., “[22-25]”). • Grouping citances by topic • Citances that cite the same document should be grouped by the facts they state. • Normalizing or paraphrasing citances • For IR, summarization, learning synonyms, relation extraction, question answering, and machine translation.

  29. Early results:Paraphrase Creation from Citances

  30. Sample Sentences • NGF withdrawal from sympathetic neurons induces Bim, which then contributes to death. • Nerve growth factor withdrawal induces the expression of Bimand mediates Bax dependent cytochrome c release and apoptosis. • The proapoptotic Bcl-2 family member Bim is strongly induced in sympathetic neurons in response to NGF withdrawal. • In neurons, the BH3 only Bcl2 member, Bim, and JNK are both implicated in apoptosis caused by nerve growth factor deprivation.

  31. Their Paraphrases • NGF withdrawal induces Bim. • Nerve growth factorwithdrawal induces the expression of Bim. • Bimhas been shown to be upregulated following nerve growth factor withdrawal. • Bimimplicated in apoptosis caused by nerve growth factor deprivation. They all paraphrase: Bimis induced after NGF withdrawal.

  32. Paraphrase Creation Algorithm 1.Extract the sentences that cite the target. 2. Mark the NEs of interest (genes/proteins, MeSH terms) and normalize. 3. Dependency parse (MiniPar). 4. For each parse For each pair of NEs of interest i. Extract the path between them. ii. Create a paraphrase from the path. 5. Rank the candidates for a given pair of NEs. 6. Select only the ones above a threshold. 7. Generalize.

  33. Relevant Papers • Citances: Citation Sentences for Semantic Analysis of Bioscience Text, Preslav Nakov, Ariel Schwartz, and Marti Hearst, in the SIGIR'04 workshop on Search and Discovery in Bioinformatics.   • Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti Hearst, in ACL 2004.   • The Descent of Hierarchy, and Selection in Relational Semantics, Barbara Rosario, Marti Hearst, and Charles Fillmore, in ACL 2002.

  34. Thank you! Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu

More Related