1 / 50

From EpoDB to EPConDB: Adventures in Gene Expression Databases

Explore EpoDB, a prototype database for analyzing genes expressed during vertebrate erythropoiesis. Retrieve adult b-globin gene transcription units and proximal promoter regions with significant expression changes. Discover predicted transcription factor binding sites in the mouse b-globin gene. Discover gene structure, function, regulation, and expression.

joshuagreer
Download Presentation

From EpoDB to EPConDB: Adventures in Gene Expression Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From EpoDB to EPConDB:Adventures in Gene Expression Databases Chris Stoeckert, Ph.D. Computational Biology and Informatics Laboratory

  2. EpoDB: A Prototype Database for the Analysis of Genes Expressed During Vertebrate Erythropoiesis Retrieve all adult, mammalian b-globin gene transcription units. Retrieve the proximal promoter regions from all genes with a significant change in expression between the BFU stage and erythroblast stage. Display the predicted transcription factor binding sites in the proximal promoter region of the mouse b-globin gene. Stoeckert, Salas, Brunk, Overton (1999) Nucl. Acids Res. 26:288

  3. GENE STRUCTURE: GenBank GENE FUNCTION: Swiss-Prot TRANSCRIPTION UNIT (EpoDB) GENE REGULATION: Transfac ,TRRD GENE EXPRESSION: GERD EpoDB: A Data Warehouse Formed byInformation Integration

  4. EpoDB Transcription Unit

  5. Populating EpoDB Database: GenBank Swiss-Prot TRRD GERD Total: 7819 2381 171 65 Yes: 3715 1241 80 65

  6. EpoDB Highlights • Reference sequences • “gene ontology” • Specified sequence retrieval • All vertebrates • Controlled vocabularies • Gene names and family names • Experiment descriptions • Gene warehouse • GenBank, SWISS-PROT, Transfac, literature • But no ESTs!

  7. http://www.cbil.upenn.edu/EpoDB

  8. EpoDB Gene Landmark Query

  9. EpoDB limitations • Not scalable • Manual triage of keyword-selected entries • Manual selection of reference genes • Did not handle data from high throughput technologies • No ESTs • Could not represent microarrays, SAGE • Could not make use of unannotated genomic sequence • Difficult to administer • Prolog-based

  10. Genomics to Functional Genomics Sequence level analysis: How do we (automatically?) annotate the human genome? Systems level analysis: How do we elucidate mechanisms of cellular development and differentiation? Our answer: Build databases.

  11. Relational DB (Oracle) with Perl object layer RAD GUS Chronology of CBIL Systems Gene Expression Sequence Annotation 1995 2001 EpoDB GAIA ParaDB DoTS EPConDB PlasmoDB AllGenes

  12. CBIL Project Architecture Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD

  13. Controlled vocabs. free text • GO • Species • Tissue • Dev. Stage under development GUS: Genomics Unified Schema • Genes, gene models • STSs, repeats, etc • Cross-species analysis Genomic Sequence RAD RNA Abundance DB • Characterize transcripts • RH mapping • Library analysis • Cross-species analysis • DOTS Transcribed Sequence Special Features • Arrays • SAGE • Conditions Transcript Expression • Ownership • Protection • Algorithm • Evidence • Similarity • Versioning • Domains • Function • Structure • Cross-species analysis Protein Sequence Pathways Networks • Representation • Reconstruction Davidson et al. IBM Systems Journal 2001

  14. GUS Object View Gene Genomic Sequence Gene Instance Gene Feature NA Feature NA Sequence RNA RNA Sequence RNA Instance RNA Feature Protein Protein Sequence Protein Instance Protein Feature AA Sequence AA Feature

  15. Clusters vs. Contig Assemblies UniGene Transcribed Sequences (DOTS) CAP4 (Paracel): Consensus Sequences -Alternative splicing -Paralogs BLAST: Clusters of ESTs & mRNAs

  16. Incremental Updates of DoTS Sequences Incoming Sequences (EST/mRNA) • Make Quality (remove vector, polyA, NNNs) “Quality” sequences AssemblySequence Block with RepeatMasker Blocked sequences • Assign to DOTS consensus sequences (blastn at 40 bp length, 92% identity) • Cluster incoming sequences that are not covered by consensus sequence. DOTS Consensus Sequences “Unassembled” clusters • Assemble DOTS consensus sequences and incoming sequences with CAP4 - initially reassemble CAP4 assemblies (consensus sequences and new) • Calculate new DOTS consensus sequence using weighted consensus sequence(s) and new CAP4 assembly. New Consensus sequences Update GUS database

  17. Predicting Gene Ontology Functions

  18. Assembled Transcripts About 3 million human EST and mRNA sequences used Combined into 797,028assemblies Cluster into 150,006 “genes” Can identify a protein for 76,771 genes And predict a function for 24,127 genes About 2 million mouse EST and mRNA sequences used Combined into 355,770 assemblies Cluster into 74,024 “genes” Can identify a protein for 34,008 genes And predict a function for 15,403 genes

  19. Assembly Validation • Alignment to Genomic Sequence via Blast/sim4. • preliminary data look good • Assembly consistency (Assemblies provide potential SNPs)

  20. Bridging Fingerprint Contigs and RH Maps on Mouse Chromosome 5 Crabtree et al. Genome Research 2001 Fingerprint Map Chr. 5 RH Map

  21. Identify shared TF binding sites Genomic alignment and comparative Sequence analysis TESS (Transcription Element Search Software) PROM-REC (Promoter recognition) RAD GUS EST clustering and assembly

  22. Experiment Raw Data Platform Metadata Processed Data Algorithm RAD: RNA Abundance Database Stoeckert et al. Bioinformatics 2001 Compliant with the MGED standards

  23. Microarray Gene Expression Database group (MGED) International effort on microarray data standards: • Develop standards for storing and communicating microarray-based gene expression data • defining the minimal information required to ensure reproducibility and verifiability of results and to facilitate data exchange (MIAME, MAGEML-MAGEDOM) • collecting (and where needed creating) controlled vocabularies/ ontologies. • developing standards for data comparison and normalization. http://www.mged.org

  24. Query RAD by Sample or by Experiment

  25. Different Views of GUS/RAD Focused annotation of specific organisms and biological systems: organisms biological systems Endocrine pancreas Human Mouse CNS GUS GUS Plasmodium falciparum Hematopoiesis *not drawn to scale*

  26. AllGenes

  27. AllGenes “Erythroblast” Query

  28. AllGenes Enhancements: Annotated Entries

  29. AllGenes Enhancements: Genomic Data

  30. http://plasmodb.org Plasmodium Genome Consortium Nucl. Acids. Res 2001

  31. Functional Genomics of the Developing Endocrine Pancreas • cDNA libraries from pancreatic tissue • Consortium libraries • Novel genes • relevant dbEST libraries • Microarray studies on pancreatic tissue • Genome wide-survey for genes expressed • Pancreas chip • Validated sequences of interest • Novel sequences from libraries

  32. www.cbil.upenn.edu/EPConDB

  33. EPConDB: Content and Features • Pancreas clone sets • Panc Chip Clone sets 1.0, 1.5, 2.0 • Transcripts found in consortium libraries • Novel transcripts discovered from consortium libraries • Microarray results • Using Incyte’s GEM (genome-wide survey) • Using Panc Chip • Genes expressed in pancreas • AllGenes queries: function, chromosomal location, name, accession • Pathways

  34. EPConDB Pathway query

  35. EPConDB Boolean Query

  36. EPConDB History Query

  37. EPConDB: Future Developments • Add more microarray results • Provide tools for microarray analysis • Provide genomic alignments • Provide tools for analysis of (putative) promoters

  38. Microarray Analysis: Xcluster Xcluster provided by Gavin Sherlock

  39. Microarray Analysis: R statistics SMA R package from Terry Speed’s group

  40. Microarray Analysis: PaGE Manduchi et al. Bioinformatics 2000

  41. Future EPConDB Query Result

  42. Microarray Analysis: Data download

  43. Summary • EpoDB provides high quality genes for sequence analysis • But is limited in scope • AllGenes provides the entire transcriptome for a wide variety of human and mouse tissues • Needs to provide high quality genes • PlasmoDB provides the entire Plasmodium genome. • Integrating EST, SAGE, and microarray data • EPConDB provides integration of EST and microarray gene expression data for a specific system • Will provide microarray analysis

  44. CBIL: Chris Stoeckert Vladimir Babenko Brian Brunk Jonathan Crabtree Sharon Diskin Greg Grant Yuri Kondrakhin Georgi Kostov Phil Le Li Li Junmin Liu Elisabetta Manduchi Joan Mazzarelli Shannon McWeeney Debbie Pinney Angel Pizarro Jonathan Schug Fidel Salas Juergen Haas Chris Overton Annotation collaborators: Nikolay Kolchanov Alexey Katohkin EPConDB collaborators: Klaus Kaestner Marie Scearce Doug Melton, Harvard Alan Permutt, Wash. U Comparative Sequence Analysis Collaborators: Maja Bucan Shaying Zhao Whitehead/MIT Center for Genome Research PlasmoDB collaborators: David Roos Martin Fraunholz Jesse Kissinger Jules Milgram Ross Koppel, Monash U. Malarial Genome Sequencing Consortium (Sanger Centre, Stanford U., TIGR/NMRC) Acknowledgements

  45. http:www.cbil.upenn.edu

  46. EPConDB Architecture Sequence & annotation Gene index (ESTs and mRNAs) Microarray expression data experimental annotation Relational DB (Oracle) with Perl object layer GUS RAD

  47. RAD Multiple labs Multiple biological systems Multiple platforms Expressed genes? Differentially-expressed genes? Co-regulated genes? Gene pathways?

  48. Embryonic Fetal Rabbit Tarsier Lemur Capuchin Chimpanzee Gorilla Orangutan Human Fetal Globin Gene Analysis

More Related