Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics

Protein sequence databases:dissemination of protein knowledge http://education.expasy.org/cours/UniProt/ Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics

Menu • Introduction • Nucleic acid sequence databases • ENA, GenBank, DDBJ • Protein sequence databases • UniProt databases (UniProtKB) • NCBI protein databases (NCBInr, RefSeq…)

Protein sequences are the fundamental determinants of biological structure and function. http://www.ncbi.nlm.nih.gov/protein

ChallengeFlood of data -> need to be stored, curated and made available for analysis and knowledge discovery

Challenge (1)Manydifferentproteinsequencedatabases PRF RefSeq TrEMBLGenpept TPA UniProtKB (IPI) Swiss-Prot UniParc Ensembl (PIR) PDB UniMES CCDS NCBInr

Challenge (1bis) Differentproteinsequencedatabases : manyidentifiers for the sameproteinsequence • These identifiers are all pointing to the same TP53 protein sequence (p53) ! • P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, HIT000320921, XP_001172091, DD954676 , JT0436 , etc.

A HUPO test sample study reveals common problems in mass spectrometry–based proteomics PubMed 19448641 (2009) • A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides) • Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results). • Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, mainly due to the fact that the search engines used cannot distinguish among different identifiers for the same protein…

Challenge (3)(protein) sequenceannotation Nucleic Acids Res. 2010 ; 38(Database issue): D633–D639. ‘Examining links from the perspective of PubMed, we found that only a small fraction of published articles are linked to human genes (Entrez Gene).’

Journals do not (SHOULD NOT) accept a paperdealingwith a nucleicacidsequence if the ENA/GenBank/DDBJ AC numberis not available… ‘journal publishers generally require deposition prior to publication so that an accession number can be included in the paper.’ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=refseq&rid=handbook.section.GenBank_ASM#GenBank_ASM.RefSeq …not the case for proteinsequences !!! no more the case for a lot of genomes !!!

Proteinsequenceorigin…

More than 99 % of the proteinsequences are derivedfrom the translation of nucleotidesequences (genomes and/or cDNAs) sequencingquality codingsequence (CDS) annotation accuracy genepredictionquality

… ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects

http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat ~ 50-100 genomes/month + ~2’500 viral genomes => Total ~ 5’000 genomes

… ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms,

Metagenomicsstudy of genetic material recovered directly from environmental samples • Global OceanSampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus • Whale fall (AAFZ00000000.1) • Soil, sandbeach, New-York air, … • Humanfluids, mouse gut (millions of bacteriawithinhuman body) • Water treatmentindustry… • Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi Venter’s Sorcerer II

… ~ 2500 genomes sequenced (single organism, varying sizes) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects … personalhumangenomes new generationsequencers : Illumina: 25 billions of bp /day;

3’000’000’000 $ (public consortium, 2000) 300’000’000 $ (Celera, 2000) 70’000’000 $ (diploid, 2007) 2010 2’000’000 $ (2007) http://www.youtube.com/watch?v=mVZI7NBgcWM …2700 genomes in 2010, 30’000 genomes in 2011 ?

But…we known now that his apoE allele is the one associated with increased risk for Alzheimer and that he has the ‘blue eye’ allele…

apoE gene (Ensembl genome browser)

New projects (homo sapiens) • 1000 genomes (first publication, October 2010) • Multiple personal genomes (sexual cells, lymphoid cells, cancer cells…) • International cancer genome consortium (www.icgc.org). They look at the most common cancers and for each they sequence the genome of 500 patients with cancer and 500 healthy individuals…. How to definethehumanproteome ??? Whichsequences ???

How many proteins-coding genes at the end?

190‘500'025'042 1st estimate: ~30 million species (1.8 million named) 2nd estimate: 20 million bacteria/archea x 4'000 genes 1 million protists x 6'000 genes 5 million insects x 14'000 genes 2 million fungi x 6'000 genes 0.5 million plants x 20'000 genes 0.5 million molluscs, worms, arachnids, etc. x 20'000 genes 0.1 million vertebrates x 25'000 genes The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105x20000+5x105x20000+1x105x25000 +20000 (Craig Venter)+ 42(Douglas Adam) + …

About 190 billions of proteins (?) About 14.0 millions of ‘known’ proteinsequences in 2011 (from ~300’000 species) More than 99 % of the proteinsequences are derivedfrom the translation of nucleotidesequences Lessthan 1 % direct proteinsequencing (Edman, MS/MS…) -> It is important that protein database users know where the protein sequence comes from…

The ideallife of a sequence … cDNAs, ESTs, genes, genomes, … Nucleicacidsequencedatabases Proteinsequencedatabases

Menu • Introduction • Nucleic acid sequence databases • ENA/GenBank, DDBJ • Protein sequence databases • UniProt databases (UniProtKB) • NCBI protein databases

ENA (EMBL-Bank) GenBankDDBJDNA Data Bank of Japanarchive of primarysequence data and corresponding annotation submitted by the laboratoriesthatdid the sequencing. EuropeanNucleotide Archive

ENA/GenBank/DDBJ http://www.insdc.org/

ENA/GenBank/DDBJ • Serve asarchives : ‘nothing goes out’ • Contain all public sequences derived from: • Genomeprojects (> 80 % of entries) • Sequencingcenters (cDNAs, ESTs…) • Individualscientists ( 15 % of entries) • Patent offices (i.e. European Patent Office, EPO) • Currently: ~210x106 sequences, ~320x109bp; • Sequencesfrom > 300’000 differentspecies;

Archivaldatabases: • Can beveryredundant for someloci • Sequence records are owned by the original submitter and can not bealterered by a third party (except TPA)

accession number taxonomy references Cross-references

CDS CoDing Sequence (proposed by submitters) CDS annotation (Prediction or experimentally determined) sequence

The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genes, genomes, … with or without annotated CDS provided by authors ENA, GenBank, DDBJ CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP) Experimentally proved or derived from gene prediction !!! not so well documented !!!

CoDing SequenceAlignment between a mRNA and a genomic sequence exon CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACAATGAAAGGTCGAAACCTG Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACAATG-AAGGTCGAAACCTG *** ************ ** * ************** CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG------------------------------------------------------- Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------ Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC Genomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC Genomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC **************************** *********************************************** CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA Genomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************ CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG----------------------------------------------------------------------------- Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAA Genomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA **** CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGATAAACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGATAAACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC ******************************************* * ************** ******** ***** **** * *********** *************************** CONTIG C----------------------------------------------------------------------------------------------------------------------- Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA * exon intron exon intron exon intron exon

CDS provided by the submitters The first Met ! CDS translation provided by ENA

mRNAs and theircorresponding CDS annotation (from EMBL/GenBank/DDBJ) UCSC: human EPO contig 3’ 5’

mRNAs and theircorresponding CDS annotation (from EMBL/GenBank/DDBJ)

Very rarely done…

Complete genome (submitted) but only ~ 2,000 CDS/proteins available !

…annotated CDS in UniProtKB http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

From nucleic acid to amino acid sequences databases….

no CDS Gene prediction RefSeq, Ensembl The hectic life of a protein sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … Nucleic acid databases ENA, GenBank, DDBJ …if the submitters provide an annotated Coding Sequence (CDS) (1/10 ENA entries) RefSeq, Ensembl and other Protein sequence databases

Why doing things in a simple way, when you can do it in a very complex one ?

The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … Scientific publications derived sequences ENA, GenBank, DDBJ CoDing Sequences provided by submitters CoDing Sequences provided by submitters and gene prediction TrEMBLGenpept RefSeq PRF TPA UniProtKB (IPI) Swiss-Prot UniParc Ensembl (PIR) PDB CCDS UniMES + all ‘species’ specific databases (EcoGene, TAIR, …)

Major ‘general’ protein sequence database ‘sources’ Integrated resources ‘cross-references’ PDB PRF PIR TPA UniProtKB: Swiss-Prot + TrEMBL NCBI-nr: Swiss-Prot + TrEMBL+ GenPept + PIR + PDB + PRF + RefSeq + TPA Resources kept separated not complete !!! (only entries createdbefore 2007 ?) UniProtKB/Swiss-Prot: manuallyannotatedproteinsequences(12’000 species) UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundantwithSwiss-Prot(300’000 species) GenPept: submitted CDS (GenBank); redundantwithSwiss-Prot(300’000 species ?) PIR: Protein Information Ressource; archive since 2003; integratedintoUniProtKB PDB: ProteinDatabank: 3D data and associatedsequences PRF: ProteinResearchFoundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + geneprediction+ somemanual annotation (11’000 species) TPA: Third part annotation

Look for EPO (homo sapiens) Swiss-Prot TrEMBL www.uniprot.org

Look for EPO (homo sapiens) Swiss-Prot GenPept Swiss-Prot RefSeq RefSeq GenPept

Menu • Introduction • Nucleic acid sequence databases • ENA-Bank/GenBank, DDBJ • Protein sequence databases • UniProt databases (UniProtKB) • NCBI protein databases

UniProt UniProt consortium EBI : EuropeanBioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein information resource (US)

Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics