1 / 31

Expression Databases: Past, Present, and Future

Explore the evolution of bioinformatics databases, from EpoDB to PlasmoDB and EPConDB, highlighting features, limitations, and future prospects in gene expression analysis. Learn about high-throughput technologies, RNA abundance databases, genomic sequence analysis, and microarray gene expression data.

Download Presentation

Expression Databases: Past, Present, and Future

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Expression Databases:Past, Present, and Future Chris Stoeckert, Ph.D. Penn Center for Bioinformatics Dec 20, 2001 ELGAP:Bioinformatics/ Computational Biology

  2. Expression Databases: PastEpoDB: A Prototype Database for the Analysis of Genes Expressed During Vertebrate Erythropoiesis • Retrieve all adult, mammalian b-globin gene transcription units. • Retrieve the proximal promoter regions from all genes with a significant change in expression between the BFU stage and erythroblast stage. • Display the predicted transcription factor binding sites in the proximal promoter region of the mouse b-globin gene. Stoeckert, Salas, Brunk, Overton (1999) Nucl. Acids Res. 26:288

  3. EpoDB Highlights • Reference sequences • “gene ontology” • Specified sequence retrieval • All vertebrates • Controlled vocabularies • Gene names and family names • Experiment descriptions • Gene warehouse • GenBank, SWISS-PROT, Transfac, literature • But no ESTs!

  4. http://www.cbil.upenn.edu/EpoDB

  5. EpoDB limitations • Not scalable • Manual triage of keyword-selected entries • Manual selection of reference genes • Did not handle data from high throughput technologies • No ESTs • Could not represent microarrays, SAGE • Could not make use of unannotated genomic sequence • Difficult to administer • Prolog-based

  6. Expression Databases:Current PlasmoDB AllGenes EPConDB Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD

  7. Controlled vocabs. free text • GO • Species • Tissue • Dev. Stage under development GUS: Genomics Unified Schema • Genes, gene models • STSs, repeats, etc • Cross-species analysis Genomic Sequence RAD RNA Abundance DB • Characterize transcripts • RH mapping • Library analysis • Cross-species analysis • DOTS Transcribed Sequence Special Features • Arrays • SAGE • Conditions Transcript Expression • Ownership • Protection • Algorithm • Evidence • Similarity • Versioning • Domains • Function • Structure • Cross-species analysis Protein Sequence Pathways Networks • Representation • Reconstruction Davidson et al. IBM Systems Journal 2001

  8. Clusters vs. Contig Assemblies UniGene Transcribed Sequences (DOTS) CAP4 (Paracel): Consensus Sequences -Alternative splicing -Paralogs BLAST: Clusters of ESTs & mRNAs

  9. http://www.allgenes.org

  10. AllGenes “Erythroblast” Query

  11. AllGenes Enhancements: Annotated Entries

  12. AllGenes Enhancements: Genomic Data

  13. Bridging Fingerprint Contigs and RH Maps on Mouse Chromosome 5 Crabtree et al. Genome Research 2001 Fingerprint Map Chr. 5 RH Map

  14. Identify shared TF binding sites Genomic alignment and comparative Sequence analysis TESS (Transcription Element Search Software) RAD GUS EST clustering and assembly

  15. Experiment Raw Data Platform Metadata Processed Data Algorithm RAD: RNA Abundance Database Stoeckert et al. Bioinformatics 2001 Compliant with the MGED standards

  16. Microarray Gene Expression Database group (MGED) Nature Genetics 29:365-371, 2001 http://www.mged.org

  17. http://plasmodb.org Bahl et al. Nucl. Acids. Res 2002

  18. Functional Genomics of the Developing Endocrine Pancreas NIDDK Consortium www.cbil.upenn.edu/EPConDB

  19. EPConDB: Content and Features • Pancreas clone sets • Panc Chip Clone sets 1.0, 1.5, 2.0 • Transcripts found in consortium libraries • Novel transcripts discovered from consortium libraries • Microarray results • Using Incyte’s GEM (genome-wide survey) • Using Panc Chip • Genes expressed in pancreas • Can combine with sequence queries: function, chromosomal location, keyword, accession, libraries • Pathways

  20. EPConDB Pathway query

  21. EPConDB Boolean Query

  22. EPConDB History Query

  23. Expression Databases: Future StemCellDB II EpoDB The Next Generation High throughput expression data Sequence & annotation Gene index (ESTs and mRNAs) Sample description And manipulation Additional DB for Biomaterial Ontology GUS RAD “BOD”

  24. Building a Microarray Ontology http://www.cbil.upenn.edu/Ontology/Build_Ontology2.html

  25. Example of Internal Terms

  26. Example of External Terms

  27. Example of Combined Internal and External: Treatment

  28. CBIL: Chris Stoeckert Vladimir Babenko Brian Brunk Jonathan Crabtree Sharon Diskin Greg Grant Yuri Kondrakhin Georgi Kostov Phil Le Li Li Junmin Liu Elisabetta Manduchi Joan Mazzarelli Shannon McWeeney Debbie Pinney Angel Pizarro Jonathan Schug Fidel Salas Juergen Haas Chris Overton Annotation collaborators: Nikolay Kolchanov Alexey Katohkin EPConDB collaborators: Klaus Kaestner Marie Scearce Doug Melton, Harvard Alan Permutt, Wash. U Comparative Sequence Analysis Collaborators: Maja Bucan Shaying Zhao Whitehead/MIT Center for Genome Research PlasmoDB collaborators: David Roos Martin Fraunholz Jesse Kissinger Jules Milgram Ross Koppel, Monash U. Malarial Genome Sequencing Consortium Ontology collaborators: MGED Ontology Working Group Helen Parkinson, EBI Acknowledgements http:www.cbil.upenn.edu

  29. Assembled Transcripts About 3 million human EST and mRNA sequences used Combined into 797,028assemblies Cluster into 150,006 “genes” Can identify a protein for 76,771 genes And predict a function for 24,127 genes About 2 million mouse EST and mRNA sequences used Combined into 355,770 assemblies Cluster into 74,024 “genes” Can identify a protein for 34,008 genes And predict a function for 15,403 genes

More Related