700 likes | 1.08k Views
Bioinformatics. Biomolecular databases. Contents. Examples of biological databases Nucleic sequences: Genbank, EMBL, and DDBJ Protein sequences: UniProt The Gene Ontology (GO) project Issues and perspectives for biological databases. Biomolecular Databases.
E N D
Bioinformatics Biomolecular databases Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/ FORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/
Contents • Examples of biological databases • Nucleic sequences: Genbank, EMBL, and DDBJ • Protein sequences: UniProt • The Gene Ontology (GO) project • Issues and perspectives for biological databases
Biomolecular Databases Examples of biomolecular databases
Examples of biomolecular databases • Sequence and structure databases • Protein sequences (UniProt) • DNA sequences (EMBL, Genbank, DDBJ) • 3D structures (PDB) • Structural motifs (CATH) • Sequence motifs (PROSITE, PRODOM) • Genome sequences and annotations • Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …) • Multiple genomes (Integr8, NCBI, KEGG, TIGR, …) • Molecular functions • Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB) • Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA) • Transport (YTPdb) • Biological processes • Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation) • Signal transduction pathways (CSNdb, Transpath) • Protein-protein interactions (DIP, BIND, MINT) • Gene networks (GeneNet, FlyNets)
Databases of databases • There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year. • Every year, the first issue of Nucleic Acids Research is dedicated to biological databases • http://nar.oupjournals.org/ • 2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1 • The same journal maintains a database of databases: the Molecular Biology Database Collection • http://www.oxfordjournals.org/nar/database/c/ • Some bioinformatics centres maintain multiple database, with cross-links between them. The SRS server at EBI holds an impressive collection of databases. • http://srs.ebi.ac.uk/
Biomolecular Databases Nucleic sequence databases: GenBank, EMBL, and DDBJ
Nucleic sequence databases Okubo et al. (2006) NAR 34: D6-D9 • To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database. • There are 3 main repositories for nucleic acid sequences. • Sequences deposited in any of these 3 databases are automatically synchronized in the 2 other ones.
The sequencing pace • Nucleic sequences • Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/ • 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions • 191,401,393,188 bases in 62,715,288 sequence records in the Whole Genome Ssequencing • Entire genomes • GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes. • http://www.genomesonline.org/gold_statistics.htm • Protein sequences • Essentially obtained by translation of putative genes in nucleic sequences (almost no direct protein sequencing). • UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences. • http://www.ebi.ac.uk/swissprot/sptr_stats/index.html Adapted from Didier Gonze
Size of the nucleotide database EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012 http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html Class entries nucleotides ------------------------------------------------------------------ CON:Constructed 7,236,371 359,112,791,043 EST:Expressed Sequence Tag 73,715,376 40,997,082,803 GSS:Genome Sequence Scan 34,528,104 21,985,922,905 HTC:High Throughput CDNA sequencing 491,770 594,229,662 HTG:High Throughput Genome sequencing 152,599 25,159,746,658 PAT:Patents 24,364,832 12,117,896,594 STD:Standard 13,920,617 37,665,112,606 STS:Sequence Tagged Site 1,322,570 636,037,867 TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279 WGS:Whole Genome Shotgun 88,288,431 305,661,696,545 ----------- --------------- Total 252,106,363 450,481,663,919 Division entries nucleotides ------------------------------------------------------------------ ENV:Environmental Samples 30,908,230 14,420,391,278 FUN:Fungi 6,522,586 11,614,472,226 HUM:Human 32,094,500 38,072,362,804 INV:Invertebrates 31,907,138 52,527,673,643 MAM:Other Mammals 40,012,731 145,678,620,711 MUS:Musmusculus 11,745,671 19,701,637,499 PHG:Bacteriophage 8,511 85,549,111 PLN:Plants 52,428,994 55,570,452,118 PRO:Prokaryotes 2,808,489 28,807,572,238 ROD:Rodents 6,554,012 33,326,106,733 SYN:Synthetic 4,045,013 782,174,055 TGN:Transgenic 285,307 849,743,891 UNC:Unclassified 8,617,225 4,957,442,673 VRL:Viruses 1,358,528 1,518,575,082 VRT:Other Vertebrates 22,809,428 42,568,889,857 ----------- --------------- Total 252,106,363 450,481,663,919
The EMBL Nucleotide Sequence Database (EBI - UK)http://www.ebi.ac.uk/embl/
Size of the nucleic sequence databases • Summary of database contents for the 3 main databases of nucleic sequences. • Source: NAR database issue January 2006.
Biomolecular Databases UniProt : protein sequencesand functional annotations
UniProt - the Universal Protein Resourcehttp://www.uniprot.org/ Number of entries (polypeptides) in Swiss-Prot http://www.expasy.org/sprot/relnotes/relstat.html Taxonomic distribution of the sequences Within Eukaryotes • Database content (Sept 2012) • UniProtKB: • 24,532,088 entries • Translation of EMBL coding sequences (non-redundant with Swiss-Prot) • UniProtKB/Swiss-Prot section (reviewed): • 537,505 entries • annotation by experts • high information content • many references to the literature • good reliability of the information • The rest (90% of the entries) • Automatic annotation by sequence similarity. • Features • The most comprehensive protein database in the world. • A huge team: >100 annotators + developers. • Annotation by experts: annotators are specialized for different types of proteins or organisms. • World-wide recognized as an essential resource. • References • Bairoch et al. The SWISS-PROT protein sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9 • The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue.
UniProt example - Human Pax-6 proteinHeader : name and synonyms
UniProt example - Human Pax-6 proteinHuman-based annotation by specialists
UniProt example - Human Pax-6 proteinStructured annotation : keywords and Gene Ontology terms
UniProt example - Human Pax-6 proteinProtein interactions; Alternative products
UniProt example - Human Pax-6 proteinDetailed description of regions, variations, and secondary structure
UniProt example - Human Pax-6 proteinReferences to original publications
UniProt example - Human Pax-6 proteinCross-references to many databases (fragment shown)
EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/
UCSC Genome Browser (University California Santa Cruz - USA)http://genome.ucsc.edu/ Human gene Pax6 aligned with Vertebrate genomes
UCSC Genome Browser (University California Santa Cruz - USA)http://genome.ucsc.edu/ Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes
UCSC Genome Browser (University California Santa Cruz - USA)http://genome.ucsc.edu/ Drosophila 120kb chromosomal region covering the Achaete-Scute Complex
EnsEMBL - Example: Drosophila gene Pax6http://www.ensembl.org/
Integr8 - access to complete genomes and proteomeshttp://www.ebi.ac.uk/integr8/
Integr8 - clusters of orthologous genes (COGs)http://www.ebi.ac.uk/integr8/
Integr8 - clusters of paralogous geneshttp://www.ebi.ac.uk/integr8/
Prosite - protein domains, families and functional siteshttp://www.expasy.ch/prosite/
Prosite - aligned sequences and logohttp://www.expasy.ch/prosite/ • Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA-binding domain (ZN2_CY6_FUNGAL_2, PS50048). • The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment. • Note the 6 cysteines, characteristic of this domain.
Prosite - Example of profile matrixhttp://www.expasy.ch/prosite/
Prosite - Example of sequence logohttp://www.expasy.ch/prosite/
Prosite - Example of domain signaturehttp://www.expasy.ch/prosite/ • The domain signature is a string-based pattern representing the residues that are characteristic of a domain.
PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)
CATH - Protein Structure Classificationhttp://www.cathdb.info/ • CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels: • Class (C), • Architecture (A), • Topology (T) • Homologous superfamily (H). • The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis. • References • Orengo et al. The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9 • Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.
CATH - Protein Structure Classificationhttp://www.cathdb.info/
InterPro (EBI - UK)http://www.ebi.ac.uk/interpro/ • “A database of protein families, domains, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences.”
InterPro (EBI - UK)Antennapedia-like Homeobox (entry IPR001827)
Biomolecular Databases The Gene Ontology (GO) database
Ontology definition • Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières • Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinationsLe Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993