1 / 263

Protein sequence databases http://education.expasy.org/cours/Murcia2011/

Protein sequence databases http://education.expasy.org/cours/Murcia2011/. Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics . Menu Introduction Nucleic acid sequence databases ENA, GenBank , DDBJ Protein sequence databases

adamma
Download Presentation

Protein sequence databases http://education.expasy.org/cours/Murcia2011/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein sequence databases http://education.expasy.org/cours/Murcia2011/ Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein Sequence Databases

  2. Menu • Introduction • Nucleic acid sequence databases • ENA, GenBank, DDBJ • Protein sequence databases • UniProt databases (UniProtKB) • NCBI protein databases • Other databases (Ensembl, IPI, CCDS, …) Protein Sequence Databases

  3. Menu • Introduction • Nucleic acid sequence databases • ENA, GenBank, DDBJ • Protein sequence databases • UniProt databases (UniProtKB) • NCBI protein databases Protein Sequence Databases

  4. Indispensible for bioinformatic studies • Databases (free access on the web) • Software tools • Servers Protein Sequence Databases

  5. What is a database ? • A collection of related data, which are • structured • searchable • updated periodically • cross-referenced • Includes also associated tools necessary for access/query, download, etc. Protein Sequence Databases

  6. Why biological databases ? • Exponential growth in biological data. • Data (genomic sequences, protein sequences, 3D structures, 2D gel electrophoresis, MS analysis, microarrays, publications….) are no longer published in a conventional manner, but directly submitted to databases. • Essential tools for biological research. Protein Sequence Databases

  7. The NAR Online MolecularBiologyDatabase collection in 2011 A total of 1’330 databases http://nar.oxfordjournals.org/content/38/suppl_1 Protein Sequence Databases

  8. Categories of databases for Life Sciences • Sequences (DNA, protein) • Genomics • 3D structure • Mutation/polymorphism • Protein domain/family • Metabolism/Pathways • Bibliography • ‘Others’ (Protein protein interaction, Microarrays…) Protein Sequence Databases

  9. Categories of databases for Life Sciences • Sequences (DNA, protein) • DNA/RNA: EMBL/GenBank/DDBJ, • Protein: UniProtKB, NCBInr • Genomics - OMIM, Flybase • 3D structure • PDB • Mutation/polymorphism • dbSNP • Protein domain/family • InterPro • Metabolism/Pathways • KEGG • Bibliography • PubMed • ‘Others’ (Protein protein interaction, Microarrays…) Protein Sequence Databases

  10. Protein Sequence Databases

  11. DNA sequences Protein Sequences Microarray Expression Data Human Genome Gene Annotation Macromolecular Structure Data Protein Sequence Databases

  12. Whichdoescontain the highestquality data ? Whichiscomprehensive ? Whichis up-to-date ? Whichisredundant ? Whichisindexed (allowscomplexqueries) ? Which Web server doesrespondmostquickly ? …….?????? Proliferation of databases

  13. Awareness of the content and usage of knowledge resources is a pre-requisite to do any type of « serious » research in the field of molecular life sciences (AMB, 2007) Protein Sequence Databases

  14. A video -> Youtube Info on S. Hawking-> Wikipedia A book -> Amazon A friend -> Facebook Usuallyonly one server DNA sequence -> EMBL Proteinsequence -> UniProtKB, RefSeq… Severaldifferent servers giveaccess to the ‘same’ database Wherecanwefind…

  15. Servers • ‘Any computer (…) serving out applications or services can technically be called a server. ‘ (Wikipedia) Protein Sequence Databases

  16. EBI: http://www.ebi.ac.uk/ Protein Sequence Databases

  17. NCBI: http://www.ncbi.nlm.nih.gov/ Protein Sequence Databases

  18. ExPASy: http://expasy.org Protein Sequence Databases

  19. www.uniprot.org Protein Sequence Databases

  20. How to find a database ? • Beware not all servers giveaccess to the latest version of the database. Important to know the ‘home server’ for a givendatabase. • ExPASy life sciences directory: -> ‘home’ server links (www.expasy.org/alinks.html) • Google (http://www.google.com) (not alwayslinked to the ‘home’ server) Protein Sequence Databases

  21. http://www.expasy.org/ Protein Sequence Databases

  22. http://www.expasy.org/links.html http://www.expasy.org/links.html Protein Sequence Databases

  23. Protein Sequence Databases

  24. The same data on different servers…. UniProt NCBI Protein Sequence Databases

  25. http://srs.dna.affrc.go.jp/srs8/srs?-id+1QexuT1Yn4Di0xF+[uniprot_swissprot-AccNumber:P16855]+-ehttp://srs.dna.affrc.go.jp/srs8/srs?-id+1QexuT1Yn4Di0xF+[uniprot_swissprot-AccNumber:P16855]+-e Protein Sequence Databases

  26. Proteins…proteins Protein Sequence Databases

  27. Protein sequences are the fundamental determinants of biological structure and function. http://www.ncbi.nlm.nih.gov/protein Protein Sequence Databases

  28. Protein sequence databases are essential for… • Identification of proteins by proteomics • -> completeness, sequencequality • ‘producing large protein lists is not the end point in Proteomics’ -> extract knowledge • Similarity searches, BLAST (functional prediction) • -> sequence quality (no redundance) • Training datasets (prediction tools, PTM etc.) • -> sequence and annotation quality • Creation of DNA chips for mRNA expression studies • -> completeness (completeproteome), sequence quality Protein Sequence Databases

  29. ? RefSeq PRF TrEMBLGenpept TPA UniProtKB (IPI) Swiss-Prot UniParc Ensembl (PIR) PDB UniMES CCDS NCBInr Protein Sequence Databases

  30. These identifiers are all pointing to a same sequence of TP53 (p53) ! • P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, HIT000320921, XP_001172091, DD954676 , JT0436 , etc. Protein Sequence Databases

  31. A HUPO test sample study reveals common problems in mass spectrometry–based proteomics PubMed 19448641 (2009) • A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides) • Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results). • Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, mainly due to the fact that the search engines used cannot distinguish among different identifiers for the same protein… Protein Sequence Databases

  32. Proteinsequenceorigin… Protein Sequence Databases

  33. Protein sequence origin More than 99 % of the protein sequences are derived from the translation of nucleotide sequences (genomes and/or cDNAs) -> Important to know where the protein sequence comes from… (sequencing & gene prediction quality) ! Protein Sequence Databases

  34. Flood of dataexamplewith the genomesequences…

  35. New challenge • Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery Protein Sequence Databases

  36. … ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects Protein Sequence Databases

  37. http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat ~ 50-100 genomes/month + ~2’500 viral genomes => Total ~ 5’000 genomes  Protein Sequence Databases

  38. … ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms, Protein Sequence Databases

  39. Metagenomicsstudy of genetic material recovered directly from environmental samples • Global OceanSampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus • Whale fall (AAFZ00000000.1) • Soil, sandbeach, New-York air, … • Humanfluids, mouse gut (millions of bacteriawithinhuman body) • Water treatmentindustry… • Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi Venter’s Sorcerer II Protein Sequence Databases

  40. … ~ 2500 genomes sequenced (single organism, varying sizes) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects … personalhumangenomes new generationsequencers : Illumina: 25 billions of bp /day; Protein Sequence Databases

  41. 3’000’000’000 $ (public consortium, 2000) 300’000’000 $ (Celera, 2000) 70’000’000 $ (diploid, 2007) 2010 2’000’000 $ (2007) http://www.youtube.com/watch?v=mVZI7NBgcWM …2700 genomes in 2010, 30’000 genomes in 2011 ? Protein Sequence Databases

  42. But…we known now that his apoE allele is the one associated with increased risk for Alzheimer and that he has the ‘blue eye’ allele… Protein Sequence Databases

  43. apoE gene (Ensembl genome browser) Protein Sequence Databases

  44. New projects • 1000 genomes (first publication, October 2010) • Multiple personal genomes (sexual cells, lymphoid cells, cancer cells…) • International cancer genome consortium (www.icgc.org). They look at the most common cancers and for each they sequence the genome of 500 patients with cancer and 500 healthy individuals…. Protein Sequence Databases

  45. How many proteins-coding genes at the end? Protein Sequence Databases

  46. Peabody museum exhibition on the Tree of Life http://www.peabody.yale.edu/exhibits/treeoflife/ Protein Sequence Databases

  47. 190‘500'025'042 1st estimate: ~30 million species (1.8 million named) 2nd estimate: 20 million bacteria/archea x 4'000 genes 1 million protists x 6'000 genes 5 million insects x 14'000 genes 2 million fungi x 6'000 genes 0.5 million plants x 20'000 genes 0.5 million molluscs, worms, arachnids, etc. x 20'000 genes 0.1 million vertebrates x 25'000 genes The calculation: 2x107x4000+1x106x6000+5x106x14000+2x106x6000+5x105x20000+5x105x20000+1x105x25000 +20000 (Craig Venter)+ 42(Douglas Adam) + … Protein Sequence Databases

  48. About 190 milliards of proteins (?) About 13.0 millions of ‘known’ proteinsequences in 2011 (from ~300’000 species) More than 99 % of the proteinsequences are derivedfrom the translation of nucleotidesequences Lessthan 1 % direct proteinsequencing (Edman, MS/MS…) -> It is important that users know where the protein sequence comes from… (sequencing & genepredictionquality) !

  49. The ideallife of a sequence … cDNAs, ESTs, genes, genomes, … Nucleicacidsequencedatabases Proteinsequencedatabases Protein Sequence Databases

  50. Menu • Introduction • Nucleic acid sequence databases • ENA/GenBank, DDBJ • Protein sequence databases • UniProt databases (UniProtKB) • NCBI protein databases Protein Sequence Databases

More Related