260 likes | 382 Views
Mining the Biomedical Literature for Genic Information. BioNLP ’08, June 19, 2008 Catalina O. Tudor, K. Vijay-Shanker, Carl J. Schmidt University of Delaware.
E N D
Mining the Biomedical Literature for Genic Information BioNLP ’08, June 19, 2008 Catalina O. Tudor, K. Vijay-Shanker, Carl J. Schmidt University of Delaware Presenter: Catalina O. Tudor
User Scenario – Groucho and PubMed 270 abstracts retrieved Groucho? I want to know more about this gene… Search Engine PubMed
User Scenario – Groucho and eGIFT most relevant terms associated with the given gene Groucho? I want to know more about this gene… • Key Terms for Groucho • Processes: • segmentation • neurogenesis • embryonic development • ... • Descriptors: • enhancer • corepressor • ... • Domains: • WD40 • eh1 • WRPW • basic helix-loop-helix • ... • Genes: • Hairy • AES All sentences for Groucho containing segmentation 1. The Groucho protein interacts with Hairy-related transcription factors to regulate segmentation, neurogenesis and sex determination. (PMID 8892234) 2. The Drosophila protein Groucho is involved in embryonic segmentation and neural development , and is implicated in the Notch signal transduction pathway. (PMID 8713081) ... Web Application eGIFT PubMed
What does eGIFT provide? • Two types of users • Scientists trying to quickly find information about a gene • Annotators trying to quickly locate textual evidence describing gene functions • Key Terms provide an overall picture about a given gene • eGIFT allows users to identify the set of documents for a topic relevant to the gene of interest
Overall Approach of eGIFT • Retrieve abstracts from PubMed • Background Set: all abstracts mentioning “gene” or “protein” • Query Set: all abstracts mentioning a given gene • Refine Query Set • Group morphologically related words • Calculate term scores and identify key terms • Categorize key terms using controlled vocabularies • Link sentences and abstracts to a specific key term
Retrieve abstracts • Background Set • all abstracts mentioning “gene” or “protein” • (gene[ti] OR genes[ti] OR • protein[ti] OR proteins[ti]) • AND hasabstract[text] • 639,211 abstracts retrieved • Query Set • all abstracts mentioning a given gene name, symbol, synonyms • Compare information from Query Set against general information from Background Set and determine the most specific information in the Query Set • Compare background and query frequencies of terms to identify statistically interesting cases PubMed Query Set Background Set
Overall Approach of eGIFT • Retrieve abstracts from PubMed • Refine Query Set • Group morphologically related words • Calculate term scores and identify key terms • Categorize key terms using controlled vocabularies • Link sentences and abstracts to a specific key term
Refine Query Set • Query Set = all abstracts mentioning given gene • Query Set contains two types of abstracts • About Set • abstracts which focus on the given gene • Extra Set • abstracts which focus on other topics but happen to mention the gene • Heuristics for identifying an About abstract • if given gene name occurs in title, first or last sentences • if given gene name occurs 3+ times in abstract Query Set Extra Set About Set
Refine Query Set – About Set example Multiple RTK pathways downregulate Groucho-mediated repression in Drosophila embryogenesis. RTK pathways establish cell fates in a wide range of developmental processes. However, how the pathway effector MAPK coordinately regulates the expression of multiple target genes is not fully understood. We have previously shown that the EGFR RTK pathway causes phosphorylation and downregulation of Groucho, a global co-repressor that is widely used by many developmentally important repressors for silencing their various targets. Here, we use specific antibodies that reveal the dynamics of Groucho phosphorylation by MAPK, and show that Groucho is phosphorylated in response to several RTK pathways during Drosophila embryogenesis. Focusing on the regulation of terminal patterning by the Torso RTK pathway, we demonstrate that attenuation of Groucho's repressor function via phosphorylation is essential for the transcriptional output of the pathway and for terminal cell specification. Importantly, Groucho is phosphorylated by an efficient mechanism that does not alter its subcellular localisation or decrease its stability; rather, modified Groucho endures long after MAPK activation has terminated. We propose that phosphorylation of Groucho provides a widespread, long-term mechanism by which RTK signals control target gene expression. PMID - 18216172
Refine Query Set – Extra Set example Engrailed defines the position of dorsal di-mesencephalic boundary by repressing diencephalic fate. Regionalization of a simple neural tube is a fundamental event during the development of central nervous system. To analyze in vivo the molecular mechanisms underlying the development of mesencephalon, we ectopically expressed Engrailed, which is expressed in developing mesencephalon, in the brain of chick embryos by in ovo electroporation. Misexpression of Engrailed caused a rostral shift of the di-mesencephalic boundary, and caused transformation of dorsal diencephalon into tectum, a derivative of dorsal mesencephalon. Ectopic Engrailed rapidly repressed Pax-6, a marker for diencephalon, which preceded the induction of mesencephalon-related genes such as Pax-2, Pax-5, Fgf8, Wnt-1 and EphrinA2. In contrast, a mutant Engrailed, En-2(F51rE), bearing mutation in EH1 domain, which has been shown to interact with a co-repressor, Groucho, did not show the phenotype induced by wild-type Engrailed. Furthermore, VP16-Engrailed chimeric protein, the dominant positive form of Engrailed, caused caudal shift of di-mesencephalic boundary and ectopic Pax-6 expression in mesencephalon. These data suggest that (1) Engrailed defines the position of dorsal di-mesencephalic boundary by directly repressing diencephalic fate, and (2) Engrailed positively regulates the expression of mesencephalon-related genes by repressing the expression of their negative regulator(s). PMID - 10529429
Overall Approach of eGIFT • Retrieve abstracts from PubMed • Refine Query Set • Group morphologically related words • Calculate term scores and identify key terms • Categorize key terms using controlled vocabularies • Link sentences and abstracts to a specific key term
Group morphologically related words - example • The Drosophila Groucho transcriptional corepressor protein has been shown to interact with the DNA-binding bHLH domain of Enhancer of split , Hairy and Deadpan proteins. • Groucho acts as a co-repressor for several Drosophila DNA binding transcriptional repressors. • Dorsal represses transcription by recruiting the co-repressor Groucho • The results indicate that FoxD3 recruitment of Groucho corepressors is essential for the transcriptional repression of target genes and induction of mesoderm in Xenopus. • corepressor = {corepressor, corepressors, co-repressor, …} • transcription repress = {transcriptional repressors, transcriptional repression, …}
Group morphologically related words • Unigram example • recruit = {recruit, recruits, recruited, recruitment, recruiting, recruitments} • Bigram example • transcript repress = {transcriptional repressor, transcriptional repressors, transcriptional repression, transcriptional repressions, transcription repression, transcription repressions} • Reasons for grouping morphologically related words • textual variants, independent of each other, are scattered in text • we help family stand out • we prevent a very infrequent variant from becoming a key term
Overall Approach of eGIFT • Retrieve abstracts from PubMed • Refine Query Set • Group morphologically related words • Calculate term scores and identify key terms • Categorize key terms using controlled vocabularies • Link sentences and abstracts to a specific key term
Calculate term scores • Calculate Normalized Frequencies dctq= document count of term t in Query Set Nq = total number of abstracts in Query Set dctb= document count of term t in Back Set Nb = total number of abstracts in Back Set • Calculate Score st = score of term t ft= frequency of term t 0.13 0.13 0.874 0.098 segmentation ftb = 0.0012 ftq = 0.13 these ftb = 0.47 ftq = 0.60
Other scoring methods • Pearson’s Chi-Square • Prefers only highly infrequent terms (bigrams are ranked high) • Drops very frequent terms, although much more frequent in QS • Z-score • Performance is highly dependent on the way the Background Set is grouped • Other considered • Ratio of frequencies • Tf-Idf • Mutual Information
Overall Approach of eGIFT • Retrieve abstracts from PubMed • Refine Query Set • Group morphologically related words • Calculate term scores and retrieve key terms • Categorize key terms using controlled vocabularies • Link sentences and abstracts to a specific key term
Overall Approach of eGIFT • Retrieve abstracts from PubMed • Refine Query Set • Group morphologically related words • Calculate term scores • Categorize key terms using controlled vocabularies • Link sentences and abstracts to a specific key term
Link sentences to key terms • eGIFT allows users to see every sentence mentioning a particular key term in the gene’s Query Set • by reading in context, the user gets a better appreciation of the relationship between the key term and the gene • From sentences users can choose which abstracts to read • Sentences can be saved in gene specific files (e.g. for annotation)
Related Work Keywords for a protein family Z-score Background divided by literature for individual families • Andrade and Valencia (1998) • Liu et al. (2004) • e-LiSe (Gladki et al., 2008) • MedEvi (Kim et al., 2008) • Anne O’Tate (Smalheiser et al., 2008) • XplorMed (Perez-Iratxeta et al., 2003) • Shatkay and Wilbur (2000) Keyword detection (not necessarily genes) Z-score More general background set than us, grouped randomly Keyword detection (some just nouns) More general background set than us From kernel document to Query Set of on-topic documents Background Set contains off-topic documents Score is ratio of normalized frequencies
Distinguishing Features of eGIFT • Background Set is specific for genes • About Set yields better results than the entire Query Set • Bigrams in addition to unigrams • Morphological grouping gives “textual concepts” • New scoring mechanism • Going beyond key terms • Categories of key terms (for interface purposes) • Retrieval of sentences containing a specific key term
Future Work • Evaluation • comparison with other systems • Named Entity Recognition • extend unigrams and bigrams to full length names • Using other subsets of Query Set • currently, eGIFT uses the About Set to compute key terms • different kinds of information can be obtained from variants of Extra Set and other subsets
The End http://dinah.cis.udel.edu/tudor/eGIFT