2.71k likes | 3.16k Views
Protein sequence databases http://education.expasy.org/cours/Murcia2011/. Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics . Menu Introduction Nucleic acid sequence databases ENA, GenBank , DDBJ Protein sequence databases
E N D
Protein sequence databases http://education.expasy.org/cours/Murcia2011/ Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein Sequence Databases
Menu • Introduction • Nucleic acid sequence databases • ENA, GenBank, DDBJ • Protein sequence databases • UniProt databases (UniProtKB) • NCBI protein databases • Other databases (Ensembl, IPI, CCDS, …) Protein Sequence Databases
Menu • Introduction • Nucleic acid sequence databases • ENA, GenBank, DDBJ • Protein sequence databases • UniProt databases (UniProtKB) • NCBI protein databases Protein Sequence Databases
Indispensible for bioinformatic studies • Databases (free access on the web) • Software tools • Servers Protein Sequence Databases
What is a database ? • A collection of related data, which are • structured • searchable • updated periodically • cross-referenced • Includes also associated tools necessary for access/query, download, etc. Protein Sequence Databases
Why biological databases ? • Exponential growth in biological data. • Data (genomic sequences, protein sequences, 3D structures, 2D gel electrophoresis, MS analysis, microarrays, publications….) are no longer published in a conventional manner, but directly submitted to databases. • Essential tools for biological research. Protein Sequence Databases
The NAR Online MolecularBiologyDatabase collection in 2011 A total of 1’330 databases http://nar.oxfordjournals.org/content/38/suppl_1 Protein Sequence Databases
Categories of databases for Life Sciences • Sequences (DNA, protein) • Genomics • 3D structure • Mutation/polymorphism • Protein domain/family • Metabolism/Pathways • Bibliography • ‘Others’ (Protein protein interaction, Microarrays…) Protein Sequence Databases
Categories of databases for Life Sciences • Sequences (DNA, protein) • DNA/RNA: EMBL/GenBank/DDBJ, • Protein: UniProtKB, NCBInr • Genomics - OMIM, Flybase • 3D structure • PDB • Mutation/polymorphism • dbSNP • Protein domain/family • InterPro • Metabolism/Pathways • KEGG • Bibliography • PubMed • ‘Others’ (Protein protein interaction, Microarrays…) Protein Sequence Databases
DNA sequences Protein Sequences Microarray Expression Data Human Genome Gene Annotation Macromolecular Structure Data Protein Sequence Databases
Whichdoescontain the highestquality data ? Whichiscomprehensive ? Whichis up-to-date ? Whichisredundant ? Whichisindexed (allowscomplexqueries) ? Which Web server doesrespondmostquickly ? …….?????? Proliferation of databases
Awareness of the content and usage of knowledge resources is a pre-requisite to do any type of « serious » research in the field of molecular life sciences (AMB, 2007) Protein Sequence Databases
A video -> Youtube Info on S. Hawking-> Wikipedia A book -> Amazon A friend -> Facebook Usuallyonly one server DNA sequence -> EMBL Proteinsequence -> UniProtKB, RefSeq… Severaldifferent servers giveaccess to the ‘same’ database Wherecanwefind…
Servers • ‘Any computer (…) serving out applications or services can technically be called a server. ‘ (Wikipedia) Protein Sequence Databases
EBI: http://www.ebi.ac.uk/ Protein Sequence Databases
NCBI: http://www.ncbi.nlm.nih.gov/ Protein Sequence Databases
ExPASy: http://expasy.org Protein Sequence Databases
www.uniprot.org Protein Sequence Databases
How to find a database ? • Beware not all servers giveaccess to the latest version of the database. Important to know the ‘home server’ for a givendatabase. • ExPASy life sciences directory: -> ‘home’ server links (www.expasy.org/alinks.html) • Google (http://www.google.com) (not alwayslinked to the ‘home’ server) Protein Sequence Databases
http://www.expasy.org/ Protein Sequence Databases
http://www.expasy.org/links.html http://www.expasy.org/links.html Protein Sequence Databases
The same data on different servers…. UniProt NCBI Protein Sequence Databases
http://srs.dna.affrc.go.jp/srs8/srs?-id+1QexuT1Yn4Di0xF+[uniprot_swissprot-AccNumber:P16855]+-ehttp://srs.dna.affrc.go.jp/srs8/srs?-id+1QexuT1Yn4Di0xF+[uniprot_swissprot-AccNumber:P16855]+-e Protein Sequence Databases
Proteins…proteins Protein Sequence Databases
Protein sequences are the fundamental determinants of biological structure and function. http://www.ncbi.nlm.nih.gov/protein Protein Sequence Databases
Protein sequence databases are essential for… • Identification of proteins by proteomics • -> completeness, sequencequality • ‘producing large protein lists is not the end point in Proteomics’ -> extract knowledge • Similarity searches, BLAST (functional prediction) • -> sequence quality (no redundance) • Training datasets (prediction tools, PTM etc.) • -> sequence and annotation quality • Creation of DNA chips for mRNA expression studies • -> completeness (completeproteome), sequence quality Protein Sequence Databases
? RefSeq PRF TrEMBLGenpept TPA UniProtKB (IPI) Swiss-Prot UniParc Ensembl (PIR) PDB UniMES CCDS NCBInr Protein Sequence Databases
These identifiers are all pointing to a same sequence of TP53 (p53) ! • P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, HIT000320921, XP_001172091, DD954676 , JT0436 , etc. Protein Sequence Databases
A HUPO test sample study reveals common problems in mass spectrometry–based proteomics PubMed 19448641 (2009) • A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides) • Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results). • Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, mainly due to the fact that the search engines used cannot distinguish among different identifiers for the same protein… Protein Sequence Databases
Proteinsequenceorigin… Protein Sequence Databases
Protein sequence origin More than 99 % of the protein sequences are derived from the translation of nucleotide sequences (genomes and/or cDNAs) -> Important to know where the protein sequence comes from… (sequencing & gene prediction quality) ! Protein Sequence Databases
New challenge • Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery Protein Sequence Databases
… ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects Protein Sequence Databases
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat ~ 50-100 genomes/month + ~2’500 viral genomes => Total ~ 5’000 genomes Protein Sequence Databases
… ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms, Protein Sequence Databases
Metagenomicsstudy of genetic material recovered directly from environmental samples • Global OceanSampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus • Whale fall (AAFZ00000000.1) • Soil, sandbeach, New-York air, … • Humanfluids, mouse gut (millions of bacteriawithinhuman body) • Water treatmentindustry… • Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi Venter’s Sorcerer II Protein Sequence Databases
… ~ 2500 genomes sequenced (single organism, varying sizes) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects … personalhumangenomes new generationsequencers : Illumina: 25 billions of bp /day; Protein Sequence Databases
3’000’000’000 $ (public consortium, 2000) 300’000’000 $ (Celera, 2000) 70’000’000 $ (diploid, 2007) 2010 2’000’000 $ (2007) http://www.youtube.com/watch?v=mVZI7NBgcWM …2700 genomes in 2010, 30’000 genomes in 2011 ? Protein Sequence Databases
But…we known now that his apoE allele is the one associated with increased risk for Alzheimer and that he has the ‘blue eye’ allele… Protein Sequence Databases
apoE gene (Ensembl genome browser) Protein Sequence Databases
New projects • 1000 genomes (first publication, October 2010) • Multiple personal genomes (sexual cells, lymphoid cells, cancer cells…) • International cancer genome consortium (www.icgc.org). They look at the most common cancers and for each they sequence the genome of 500 patients with cancer and 500 healthy individuals…. Protein Sequence Databases
How many proteins-coding genes at the end? Protein Sequence Databases
Peabody museum exhibition on the Tree of Life http://www.peabody.yale.edu/exhibits/treeoflife/ Protein Sequence Databases
190‘500'025'042 1st estimate: ~30 million species (1.8 million named) 2nd estimate: 20 million bacteria/archea x 4'000 genes 1 million protists x 6'000 genes 5 million insects x 14'000 genes 2 million fungi x 6'000 genes 0.5 million plants x 20'000 genes 0.5 million molluscs, worms, arachnids, etc. x 20'000 genes 0.1 million vertebrates x 25'000 genes The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105x20000+5x105x20000+1x105x25000 +20000 (Craig Venter)+ 42(Douglas Adam) + … Protein Sequence Databases
About 190 milliards of proteins (?) About 13.0 millions of ‘known’ proteinsequences in 2011 (from ~300’000 species) More than 99 % of the proteinsequences are derivedfrom the translation of nucleotidesequences Lessthan 1 % direct proteinsequencing (Edman, MS/MS…) -> It is important that users know where the protein sequence comes from… (sequencing & genepredictionquality) !
The ideallife of a sequence … cDNAs, ESTs, genes, genomes, … Nucleicacidsequencedatabases Proteinsequencedatabases Protein Sequence Databases
Menu • Introduction • Nucleic acid sequence databases • ENA/GenBank, DDBJ • Protein sequence databases • UniProt databases (UniProtKB) • NCBI protein databases Protein Sequence Databases