620 likes | 735 Views
Integrating Scientific Literature With Large Scale Gene Expression Analysis. PhD defense Patrick Glenisson. Promotor Prof. Bart De Moor. June 11 th 2004. Overview. Genes & microarrays Gene expression data analysis Text mining in biology: principles Text mining in practice: TXTGate
E N D
Integrating Scientific Literature WithLarge Scale Gene Expression Analysis PhD defense Patrick Glenisson Promotor Prof. Bart De Moor June 11th 2004
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview
Overview M-score • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Cluster analysis Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Literature analysis Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Integrated clustering & Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview
DNA, genes, proteins and cells Genes and Microarrays
DNA, genes, proteins and cells protein Genes and Microarrays
Genes are expressed and regulated Genes and Microarrays
Microarrays measure gene expression Laser excitation Sample annotations Conditions C1 .. C2 C3 Gene annotations G1 G2 Genes G3 .. Gene expressionmeasurement Genes and Microarrays
Representing expression information Conditions in which expression occurs • Gene expression experiments are complex : • Too verbose to include in a scientific publication • Too important to compromise on reproducibility • Too valuable for post-genome research to have it scattered around on various websites • Hence, standard for reporting on MA experiments • As a guideline for databases hosting expression compendia Genes and Microarrays
MIAME standard • Minimum Information About a MicroArray Experiment • Internationally proposed standard • Published in Dec 2001 by International consortium MGED • Some prominent journals (Nature, Lancet, EMBO, Cell) require MIAME-compliant submissions of data • Some hurdles: • Significant overhead in filling out the questionnaire • Scooping of leads (!) • Proprietary information about probe sequences Genes and Microarrays
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview
Questions asked with microarrays • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis
Microarray data analysis • Fundamental • Functional roles of genes (and transcriptional regulation) • Genetic network reconstruction • Clinical • Correlation of genes with a given disease • Diagnosis of disease stage with patients • Pharmacological • Toxicological drug response assessment Gene expression data analysis
Clustering Conditions C3 Genes C2 C1 Expression data Genes Genes Hierarchical clustering k - Means Distance matrix Clustering Gene expression data analysis
Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Coherence vs separation of clusters • Stability of a cluster solution when leaving out data C3 C2 Gene expression data analysis C1
Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Enrichment of GO annotations in clusters • Literature-based scoring Gene expression data analysis
Cluster validation Optimal number of clusters ? Define `optimal’ ? • Data-centered statistical scores • Knowledge-based scores • Motif-based • DNA patterns in regulatory regions of gene groups Gene Regulatory DNA patterns (motifs) Gene expression data analysis
DNA patterns in expression clusters Significant occurrences of known motifs in cluster Gene clusters Clusters 1 2 3 .. -log(p-value) A B C .. Motifs Cluster-by-Motif(motif enrichment matrix) M-score Genes expression data analysis
Cluster-by-motif matrix M-Score for the entire clustering solution one-shot estimate of the `biological relevance’ motif cluster Genes expression data analysis
M-score • A motif is less interesting when it (significantly) occurs in many clusters • A cluster that contains a large portion of (significant) motifs is less likely to be biologically relevant. • A `too large' number of clusters is less likely to reflect the true biological diversity underlying the experiment. Gene expression data analysis
M-score validation M-score k • Optimal kin yeast cell cycle expression data • Original studies by Tavazoie et al. used k=30 • Overestimation confirmed by analyses of • De Smet et al. (AQBC) • Gibbons et al. (GO-based scoring) • A simplification of reality • No absolute quantification of biological relevance. • Useful tool when experimenting with • Multiple clustering methods • Multiple parameterizations • To economize on biological validations Gene expression data analysis
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion Overview
Problem setting • Given a set of documents, • compute a representation, called index • to retrieve, summarize, classify or cluster them <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles
Problem setting • Given a set of genes (and their literature), • compute a representation, called gene index • to retrieve, summarize, classify or cluster them <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> Text Mining: principles
Vector space model gene T 3 T 2 T 1 vocabulary • Document processing • Remove punctuation & grammatical structure (`Bag of words’) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, thus, gene, ...) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..) • Compute index of textual resources: Text Mining: principles
Validity of gene index Text-based coherence score • Modeled wrt a background distribution of • through random and permuted gene groups Genes that are functionally related should be close in text space: Text Mining: principles
Validity of gene index Genes that are functionally relatedshould be close in text space: Text Mining: principles
Validity of gene index Genes that are functionally relatedshould be close in text space: Text Mining: principles
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion TXTGate Overview
Motivation 1 GO GeneRIF 12133521VEGF is associated with the development and prognosis of colorectal cancer. 12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex • cell proliferation • heparin binding • growth factor activity “ Until now it has been largely overlooked that there is little difference between retrieving a MEDLINEabstract and downloading an entry from a biological database ” (M. Gerstein, 2001) TXTGate - a platform to profile groups of genes
Motivation 2 • Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems. • A number of structured vocabularies have already arisen: • Gene Ontology (GO) • MeSH • eVOC • Standards are systematically being adopted to store biological concepts or annotations: • HUGO • GOA@EBI TXTGate - a platform to profile groups of genes
Motivation 3 (Figure courtesy: S. Van Vooren) TXTGate - a platform to profile groups of genes
TXTGate Distance matrix &Clustering Other vocabulary Profile TXTGate - a platform to profile groups of genes
TXTGate – a case study Two ‘new’ genes ACN9& CAT8 in module 2 • Gene modules over various expression data sets • Reported two sub modules of TCA cycle TXTGate - a platform to profile groups of genes
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview
Problem setting “How can we analyze data in an integrated fashion to extract more information than solely from expression data ? ” Fusion of text and expression data
Integration of text and data • In each information space • Appropriate preprocessing • Choice of distance measures Fusion of text and expression data
Integration of text and data • Combine data: • confidence attributed to either of the two data types • in case of distance, we can see it as a scaling constant between the norms of the data- and text representations. Fusion of text and expression data
Integration of text and data • However, distribution of distances invoke a bias Scaling problem • Therefore, use technique from statistical meta-analysis(so-called omnibus procedure) Expression Distancehistogram Text Distancehistogram Fusion of text and expression data
Overview meta-clustering Clustering M-score Fusion of text and expression data
Integration improves M-score Optimal k ? Various cutoffs k of the cluster tree M-scoreintegrated clustering M-score expression data only Fusion of text and expression data
A look inside the integration Fusion of text and expression data
A look inside the integration Text Profile Expression Profile Strongre-enforcement Fusion of text and expression data
Overview • Genes & microarrays • Gene expression data analysis • Text mining in biology: principles • Text mining in practice: TXTGate • Combining text and gene expression data • Conclusion & Overview
Contributions • Representation of a gene expression experiment • MIAME • Laboratory Information Management System v. at the VIB MicroArray Facility • Gene expression analysis • Iterative clustering to determine optimal k • M-score • Text-based gene representation • To represent functional information about genes • To score gene groups based on literature • To cluster genes based on literature • TXTGate text mining application • To profile, in an flexible and interactive manner, gene groups from different ‘views’ • Integration of text and expression data in clustering Conclusion
Future work • Semantically-oriented text mining representations • Algorithm-based: • Improved phrases (word co-locations) • Latent Semantic Indexing • concept clustering, bi-clustering • Knowledge based: • Gene Ontology distance in a taxonomy • Basic natural language processing + statistics = Shallow Parsing • Advanced ways of integrating data • Combine link information with term information • Ways to determine Conclusion