170 likes | 362 Views
Citances: Citation Sentences for Semantic Analysis of Bioscience Text. Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu. Supported by NSF DBI-0317510 and a gift from Genentech. Overview.
E N D
Citances: Citation Sentences for Semantic Analysis ofBioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. HearstComputer Science Division and SIMSUniversity of California, Berkeleyhttp://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech
Overview • We propose the use of the text of the sentences surrounding citations as an important tool for semantic interpretation of bioscience text. • We hypothesize several different uses of citation sentences (which we call citances), including • the creation of training and testing data for semantic analysis (especially for entity and relation recognition), • synonym set creation, • database curation, • document summarization, • and information retrieval generally. • We illustrate some of these ideas, showing that citations to one document in particular align well with what a hand-built curator extracted. • We also show preliminary results on the problem of normalizing the different ways that the same concepts are expressed within a set of citances, using and improving on existing techniques in automatic paraphrase generation.
Motivation for using Citances in Bioscience Text • We are interested in utilizing the large volume of available bioscience text when designing information extraction and retrieval tools. • While the size of available text is growing rapidly, only few small annotated corpora for the bioscience domain exist. • Full text (as opposed to abstracts) is becoming more available, providing new opportunities for automatic text processing. • Citances provide an opportunity for coping with this limitation. They essentially contain a semi-annotated corpora for free.
The Nature of Citances in Bioscience Literature • Citations are particularly abundant in biosciences. • Nearly every statement is backed up with at least one citation. • It is quite common for papers in the bioscience domain to be cited by 30-100 other papers. • The citances tend to state known biological facts with reference to the original papers that discovered them. • The cited facts are typically stated in a more concise way in the citing papers than in the original papers. • As the same facts are repeatedly stated in different ways in different papers, statistical models can be trained on existing citances to identify similar facts in unseen text.
Examples of Citances “The genetic data presented here clearly show that the Eiger-induced small eye phenotype depends strongly on the JNK signaling pathway. In mammals, it has been demonstrated that the JNK pathway is essential for the execution of stress-induced cell death. JNK3, a JNK isoform that is selectively expressed in the nervous system, is required for neuronal cell death caused by excitotoxic stress (Yang et al., 1997). Embryonic fibroblasts from mouse deficient for both JNK1 and JNK2 are resistant to UV-stimulated apoptosis (Tournier et al., 2000). Whitfield et al. (2001) have shown that Bim acts downstream of the JNK pathway in NGF-deprivation-induced neuronal cell death. One possible downstream mechanism of the JNK pathway to induce cell death may be transcriptional upregulation of Bim. However, our results suggest the possibility that Eiger-induced cell death signaling may be independent of downstream jun expression, similar to the observation that the effect of UV to cause cell death does not require new gene expression (Tournier et al., 2000). The JNK signaling also mediates heat shock-induced cell death, the execution of which is caspase independent (Gabai et al., 2000). Furthermore, overexpression of the EDA receptor or TAJ/TROY, a member of the TNF receptor superfamily that exhibits extensive homology to the EDA receptor, results in the activation of the JNK pathway and caspase-independent cell death (Eby et al., 2000; Kumar et al., 2001). In some cases, JNK-induced cell death is mediated by the release of mitochondrial apoptogenic factors (Tournier et al., 2000). Recently, it has been shown that cancer cell death induced by TRAIL, a mammalian TNF superfamily ligand, requires mitochondrial release of Smac (Deng et al., 2002). One possible mechanism of Eiger-induced cell death may be JNK-mediated release of mitochondrial caspase-independent cell death factors. In fact, the Drosophila genome also encodes homologs of such molecules: AIF, endo G and HtrA2.”(Igaki et al., EMBO J. 2002 June; 21 (12): 3009–3018)
Illustrating Diagram …[17] …[12] …[17] …[42] …[23] …[27] …[9] …[16] Fact 1 Fact 2 … Fact n …[7]
A Source for Unannotated Comparable Corpora • Comparable corpora are a useful resource for the development of NLP tools for question answering and summarization. • Most domains outside of news do not contain many articles discussing the same events, but bioscience citances have some of the requisite characteristics in that they include redundancies that allow identification of comparable sentences. • We later demonstrate the use of citances as comparable corpora for automatic paraphrase extraction.
Summarization of the Target Papers • The set of citances that refer to a specific paper can be viewed as an indication of the important facts in the paper as seen by the scientific community in that field. • This is an excellent resource for summarization. In fact, we believe that a paper that is cited enough times can be summarized using only the citances pointing to it. • Instead of showing the user all the citances pointing to a paper (as is done in CiteSeer and in Nanba et al. (2000)), we propose to first cluster related citances, and then display to the user only a summary of each cluster. • The facts expressed by each cluster can be extracted and stored in a database. • This could facilitate answering advanced queries on facts, such as “retrieve all documents that describe which genes upregulate gene G”.
Synonym Identification and Disambiguation • Bioscience literature is rife with abbreviations and synonyms. • Citances referring to the same article may allow synonyms to be identified and recorded. • A collection of related citances can help disambiguate terms with multiple meanings, since in some of the citances an unambiguous form of the term might be present.
Entity Recognition and Relation Extraction • Citances provide us a way to build a model of many of the different ways to express a relationship type R between entities of type A and B. • We can seed learning algorithms with several examples using concepts that are semantically similar to A and similar to B, for which relation R is known to hold. • Then we can train a model to recognize this kind of relation for situations for which the relation is not known. • Since the results may extend to sentences that are not citances as well, citances-based corpora should provide a good collection for building NLP tools for recognizing entities and relations in unseen text.
Targets for Curation • We hypothesize that citances contain the most important information expressed in the cited document, and therefore contain the information that curators would want to make use of. • We have found support for this hypothesis with two sample papers being used by a cancer researcher who is recording information about the process of apoptosis.
Improved Citation Indexes for Information Retrieval • Citation indexes can be improved • by combining methods that use citances’ context (e.g., Mercer and Di Marco (2004)) with methods that use citances’ content (e.g., Bradshaw (2003)). • For example, indexing terms can be taken from citances referring to a target paper, weighting them both by their relative frequency and the type of citations they appear in.
Related Work • Traditional citation analysis dates back to the 1960’s (Garfield). Includes: • Citation categorization, • Context analysis, • Citer motivation. • Citation indexing systems, such as ISI’s SCI, and CiteSeer. • Mercer and Di Marco (2004) propose to improve citation indexing using citation types. • Bradshaw (2003) introduces Reference Directed Indexing (RDI), which indexes documents using the terms in the citances citing them.
Related Work (cont.) • Teufel and Moens (2002) identify citances to improve summarization of the citing paper. They give lower weight to citances as candidate sentences for summarization. • Nanba et. al. (2000) use citances as features for classifying papers into topics. • Related field to citation indexing is the use of link structure and anchor text of Web pages. • Applications include: IR, classification, Web crawlers, and summarization. • See the full paper for references.
Issues for Processing Citances • Text span • Identification of the appropriate phrase, clause, or sentence that constructs a citance. • Correct mapping of citations when shown as lists or groups (e.g., “[22-25]”). • Grouping citances by topic • Citances that cite the same document should be group by the facts they state. • Normalizing or paraphrasing citances • For IR, summarization, learning synonyms, relation extraction, question answering, and machine translation.
Conclusions • We have motivated and discussed the potentially enormous role that the use of sentences surrounding citations, or citances, can have for automated analysis of bioscience literature. • In work not yet reported, we have found that citances align very well with rich information being curated by hand by a molecular biologist, and suspect they will be equally useful for other curation tasks. • We also hypothesize that it will be a gold mine of data for training algorithms to perform semantic analysis of bioscience text, and will improve the results of querying the bioscience literature. • Much work must be done before citances can be put to full use. • We have demonstrated some initial results in paraphrasing citances that discuss the same topic, but more work remains to be done to improve results, and to group similar citances together. • In future work, we plan to thoroughly explore the possibilities surrounding the analysis and use of citances for bioscience text analysis.