740 likes | 905 Views
Integrating Scientific Literature With Large Scale Gene Expression Analysis. Patrick Glenisson. December 21th 2004. Overview. Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate Combining text and gene expression data
E N D
Integrating Scientific Literature WithLarge Scale Gene Expression Analysis Patrick Glenisson December 21th 2004
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview
Overview M-score • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Cluster analysis Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Literature analysis Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Integrated clustering & Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview
DNA, genes, proteins and cells Genes and Microarrays
DNA, genes, proteins and cells protein Genes and Microarrays
Genes are expressed and regulated Genes and Microarrays
Microarrays measure gene expression Laser excitation Sample annotations Conditions C1 .. C2 C3 Gene annotations G1 G2 Genes G3 .. Gene expressionmeasurement Genes and Microarrays
Representing expression information Conditions in which expression occurs • Gene expression experiments are complex : • Too verbose to include in a scientific publication • Too important to compromise on reproducibility • Too valuable for post-genome research to have it scattered around on various websites • Necessary level detail for reproducibility / data mining ? • Hence, standard for reporting on MA experiments • As a guideline for databases hosting expression compendia Genes and Microarrays
Storing gene expression data Genes and Microarrays
MIAME standard • Minimum Information About a MicroArray Experiment • Internationally proposed standard • Published in Dec 2001 by International consortium MGED • prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data • Some hurdles: • Significant overhead in filling out the questionnaire • Scooping of leads (!) • Proprietary information about probe sequences • Query-enabled >< comparable (cfr. Affy vs cDNA) Genes and Microarrays
Impression on MIAME’s content Genes and Microarrays
Dissemination of gene expression data publications repositories Genes and Microarrays
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview
Questions asked with microarrays • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis
Microarray data analysis • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis
Clustering Conditions C3 Genes C2 C1 Expression data Genes Genes Hierarchical clustering k - Means Distance matrix Clustering Gene expression data analysis
Cluster validation Optimal number of clusters ? Define `optimal’ ? E.g. SILHOUETTE • Data-centered statistical scores • Coherence vs separation of clusters • Stability of a cluster solution when leaving out data C3 C2 Gene expression data analysis C1
Cluster validation – stability method Genes and Microarrays
Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Enrichment of GO annotations in clusters • Literature-based scoring Gene expression data analysis
Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Motif-based • DNA patterns in regulatory regions of gene groups Gene Regulatory DNA patterns (motifs) Gene expression data analysis
DNA patterns in expression clusters ‘Significant’ occurrences of known motifs in cluster Gene clusters Clusters 1 2 3 .. -log(p-value) A B C .. Motifs Cluster-by-Motif(motif enrichment matrix) M-score Genes expression data analysis
Cluster-by-motif matrix M-Score for the entire clustering solution one-shot estimate of the `biological relevance’ motif cluster Genes expression data analysis
M-score • A motif is less interesting when it (significantly) occurs in many clusters • A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant. • A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment. Gene expression data analysis
M-score validation M-score k • Optimal kin yeast cell cycle expression data • Original studies by Tavazoie et al. used k=30 • Overestimation confirmed by analyses of • De Smet et al. (AQBC) • Gibbons et al. (GO-based scoring) • A simplification of reality • No absolute quantification of biological relevance. • Useful tool when experimenting with • Multiple clustering methods • Multiple parameterizations • To economize on biological validations Gene expression data analysis
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview
Problem setting • Given a set of documents, • compute a representation, called index • to retrieve, summarize, classify or cluster them <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles
Problem setting • Given a set of genes (and their literature), • compute a representation, called gene index • to retrieve, summarize, classify or cluster them <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles
Vector space model gene T 3 T 2 T 1 vocabulary • Document processing • Remove punctuation & grammatical structure (`Bag of words’) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, thus, gene, ...) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..) • Compute index of textual resources: Text Mining: principles
Validity of gene index Text-based coherence score • Modeled wrt a background distribution of • through random and permuted gene groups Genes that are functionally related should be close in text space: Text Mining: principles
Validity of gene index Genes that are functionally relatedshould be close in text space: Text Mining: principles
Validity of gene index Genes that are functionally relatedshould be close in text space: Text Mining: principles
Validity of gene index • “Simple word vector representations are competitive also in terms of classification task with respect to more elaborate approaches ..” • ..despite unaddressed issues such as • phrases • homonyms • neglected grammatical structureA. Seewald: Ranking for BioMinT: Investigating performance, local search and homonymy recognition. >> www.biomint.org Genes and Microarrays
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview
Motivation 1 GO GeneRIF 12133521VEGF is associated with the development and prognosis of colorectal cancer. 12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex • cell proliferation • heparin binding • growth factor activity “ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINEabstract and downloading an entry from a biological database ” (M. Gerstein, 2001) TXTGate - a platform to profile groups of genes
Motivation 2 • Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems. • A number of structured vocabularies have already arisen: • Gene Ontology (GO) • MeSH • eVOC • Standards are systematically being adopted to store biological concepts or annotations: • HUGO • GOA@EBI TXTGate - a platform to profile groups of genes
Motivation 3 (Figure courtesy: S. Van Vooren) TXTGate - a platform to profile groups of genes
Development of text mining platform • a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated database entries & linked scientific publications. • incorporates term-based indices .. • .. and use them as a starting point • to explore the text through the eyes of different domain vocabularies • to link out to other resources by query building, or • to sub-cluster genes based on text. Genes and Microarrays
Illustration: sub-clustering Eisen et al. (1998) Genes and Microarrays
Illustration: profiling Chaussabel et al. (2003) Genes and Microarrays
TXTGate: towards closing the KD loop Distance matrix &Clustering Other vocabulary Profile TXTGate - a platform to profile groups of genes
TXTGate – a case study Two ‘new’ genes ACN9& CAT8 in module 2 • Gene modules over various expression data sets • Reported two sub modules of TCA cycle TXTGate - a platform to profile groups of genes Visualize with BioLayout / LGL
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview