310 likes | 319 Views
Explore the evolution of bioinformatics databases, from EpoDB to PlasmoDB and EPConDB, highlighting features, limitations, and future prospects in gene expression analysis. Learn about high-throughput technologies, RNA abundance databases, genomic sequence analysis, and microarray gene expression data.
E N D
Expression Databases:Past, Present, and Future Chris Stoeckert, Ph.D. Penn Center for Bioinformatics Dec 20, 2001 ELGAP:Bioinformatics/ Computational Biology
Expression Databases: PastEpoDB: A Prototype Database for the Analysis of Genes Expressed During Vertebrate Erythropoiesis • Retrieve all adult, mammalian b-globin gene transcription units. • Retrieve the proximal promoter regions from all genes with a significant change in expression between the BFU stage and erythroblast stage. • Display the predicted transcription factor binding sites in the proximal promoter region of the mouse b-globin gene. Stoeckert, Salas, Brunk, Overton (1999) Nucl. Acids Res. 26:288
EpoDB Highlights • Reference sequences • “gene ontology” • Specified sequence retrieval • All vertebrates • Controlled vocabularies • Gene names and family names • Experiment descriptions • Gene warehouse • GenBank, SWISS-PROT, Transfac, literature • But no ESTs!
EpoDB limitations • Not scalable • Manual triage of keyword-selected entries • Manual selection of reference genes • Did not handle data from high throughput technologies • No ESTs • Could not represent microarrays, SAGE • Could not make use of unannotated genomic sequence • Difficult to administer • Prolog-based
Expression Databases:Current PlasmoDB AllGenes EPConDB Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD
Controlled vocabs. free text • GO • Species • Tissue • Dev. Stage under development GUS: Genomics Unified Schema • Genes, gene models • STSs, repeats, etc • Cross-species analysis Genomic Sequence RAD RNA Abundance DB • Characterize transcripts • RH mapping • Library analysis • Cross-species analysis • DOTS Transcribed Sequence Special Features • Arrays • SAGE • Conditions Transcript Expression • Ownership • Protection • Algorithm • Evidence • Similarity • Versioning • Domains • Function • Structure • Cross-species analysis Protein Sequence Pathways Networks • Representation • Reconstruction Davidson et al. IBM Systems Journal 2001
Clusters vs. Contig Assemblies UniGene Transcribed Sequences (DOTS) CAP4 (Paracel): Consensus Sequences -Alternative splicing -Paralogs BLAST: Clusters of ESTs & mRNAs
Bridging Fingerprint Contigs and RH Maps on Mouse Chromosome 5 Crabtree et al. Genome Research 2001 Fingerprint Map Chr. 5 RH Map
Identify shared TF binding sites Genomic alignment and comparative Sequence analysis TESS (Transcription Element Search Software) RAD GUS EST clustering and assembly
Experiment Raw Data Platform Metadata Processed Data Algorithm RAD: RNA Abundance Database Stoeckert et al. Bioinformatics 2001 Compliant with the MGED standards
Microarray Gene Expression Database group (MGED) Nature Genetics 29:365-371, 2001 http://www.mged.org
http://plasmodb.org Bahl et al. Nucl. Acids. Res 2002
Functional Genomics of the Developing Endocrine Pancreas NIDDK Consortium www.cbil.upenn.edu/EPConDB
EPConDB: Content and Features • Pancreas clone sets • Panc Chip Clone sets 1.0, 1.5, 2.0 • Transcripts found in consortium libraries • Novel transcripts discovered from consortium libraries • Microarray results • Using Incyte’s GEM (genome-wide survey) • Using Panc Chip • Genes expressed in pancreas • Can combine with sequence queries: function, chromosomal location, keyword, accession, libraries • Pathways
Expression Databases: Future StemCellDB II EpoDB The Next Generation High throughput expression data Sequence & annotation Gene index (ESTs and mRNAs) Sample description And manipulation Additional DB for Biomaterial Ontology GUS RAD “BOD”
Building a Microarray Ontology http://www.cbil.upenn.edu/Ontology/Build_Ontology2.html
CBIL: Chris Stoeckert Vladimir Babenko Brian Brunk Jonathan Crabtree Sharon Diskin Greg Grant Yuri Kondrakhin Georgi Kostov Phil Le Li Li Junmin Liu Elisabetta Manduchi Joan Mazzarelli Shannon McWeeney Debbie Pinney Angel Pizarro Jonathan Schug Fidel Salas Juergen Haas Chris Overton Annotation collaborators: Nikolay Kolchanov Alexey Katohkin EPConDB collaborators: Klaus Kaestner Marie Scearce Doug Melton, Harvard Alan Permutt, Wash. U Comparative Sequence Analysis Collaborators: Maja Bucan Shaying Zhao Whitehead/MIT Center for Genome Research PlasmoDB collaborators: David Roos Martin Fraunholz Jesse Kissinger Jules Milgram Ross Koppel, Monash U. Malarial Genome Sequencing Consortium Ontology collaborators: MGED Ontology Working Group Helen Parkinson, EBI Acknowledgements http:www.cbil.upenn.edu
Assembled Transcripts About 3 million human EST and mRNA sequences used Combined into 797,028assemblies Cluster into 150,006 “genes” Can identify a protein for 76,771 genes And predict a function for 24,127 genes About 2 million mouse EST and mRNA sequences used Combined into 355,770 assemblies Cluster into 74,024 “genes” Can identify a protein for 34,008 genes And predict a function for 15,403 genes