770 likes | 1.04k Views
The UniProt knowledgebase www.uniprot.org a hub of integrated protein data http://education.expasy.org/cours/Prague2011/. Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics. Protein sequences.
E N D
The UniProt knowledgebasewww.uniprot.orga hub of integrated protein datahttp://education.expasy.org/cours/Prague2011/ Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics
Protein sequences • > 180 billions of ‘different’ proteins on earth (∑ N species x M genes) • > 16.0 millions of ‘known’ protein sequences in 2011 • More than 99 % of the protein sequences are derived from the translation of nucleotide sequences (mRNA or DNA) • Less than 1 % direct protein sequencing (Edman, MS/MS…)
data knowledge proteinsequencefunctional information
UniProt consortium EBI : European Bioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein information resource (US)
UniProt databases UniParc: proteinsequence archive (EMBL-ENA equivalentat the proteinlevel) Each entry contains a proteinsequence, taxonomic information, cross-links to otherdatabaseswhereyoufind the sequence (active or not) No annotation You can: query, Blast, download ~28 mo entries
UniProt databases UniRef 3 clusters of proteinsequenceswith 100, 90 and 50 % identity; useful to speed up sequencesimilaritysearch (BLAST) You can: query, Blast, download UniRef100 14 mo entries; UniRef90 9 mo entries; UniRef50 4 mo entries
UniProt databases UniMES: proteinsequencesderivedfrommetagenomicprojects (mostlyGlobal OceanSampling (GOS)) You can : download 10 mo entries, included in UniParc
UniProt databases The central piece
UniProtKB an encyclopedia on proteins composed of 2 sectionsUniProtKB/TrEMBL and UniProtKB/Swiss-Prot unreviewed and reviewed automatically annotated and manually annotated released every 4 weeks
UniProtKB • Origin of proteinsequences • UniProtKBproteinsequences are mainlyderivedfrom • INSDC (translatedsubmittedcodingsequences- CDS) • Ensembl (geneprediction) and RefSeqsequences • Sequences of PDB structures • Direct submission or sequencesscannedfromliterature • (includes direct proteinsequencing) • Notes:- UniProtis not doinganygeneprediction • - Most non-germlineimmunoglobulins, T-cell receptors , most patent sequences, highly over-representeddata (e.g. viral antigens), pseudogenessequences are excludedfromUniProtKB, - but stored in UniParc • - Data from the PIR database have been integrated in UniProtKBsince 2003. 85 % 15 %
Manual annotation of the sequence and associated biological information Swiss-Prot EMBL TrEMBL Automated extraction of proteinsequence (translated CDS), genename and references. Automated annotation
UniProtKB/TrEMBL unreviewed Automatic annotation released every 4 weeks
Protein and genenames Taxonomic information Automated annotation Function, Subcellular location, Catalyticactivity, Sequencesimilarities… Automated annotation transmembranedomains, signal peptide… References One proteinsequence One species Automated annotation Keywords and Gene Ontology Cross-references to over 125 databases UniProtKB/TrEMBL www.uniprot.org
UniProtKB/TrEMBLAutomatic annotation • Proteinsequence • -The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). • - 100% identical sequences (same length, same organism are merged automatically). • Biologicalinformation • Sources of annotation • Provided by the submitter (EMBL, PDB, TAIR…) • From automated annotation (automated generated annotation rules (i.e. SAAS) and/or manually generated annotation rules (i.e. UniRule))
UniProtKB/TrEMBL Example of fullyautomatic annotation: SAAS • Rules are derived from the UniProtKB/Swiss-Prot manual annotation. • Fully automated rule generation based on C4.5 decision tree algorithm. • One annotation, one rule. • High stringency – require 99% or greater estimated precision to generate annotation(test on UniProtKB/Swiss-Prot) • Rules are produced, updated and validated at each release.
UniProtKB/Swiss-Prot reviewed manually annotated released every 4 weeks
Manual annotation Function, Subcellular location, Catalyticactivity, Disease, Tissue specificty, Pathway… Protein and genenames Taxonomic information MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG Manual annotation Post-translational modifications, variants, transmembranedomains, signal peptide… References One proteinsequence One gene One species Alternative products: proteinsequencesproduced by alternative splicing, alternative promoter usage, alternative initiation… Manual annotation Keywords and Gene Ontology Cross-references to over 125 databases UniProtKB/Swiss-Prot www.uniprot.org
UniProtKB/Swiss-ProtManual annotation 1. Protein sequence(merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)2. Biological information(sequence analysis,extract literature information, ortholog data propagation, …)
UniProtKB/Swiss-Prot 1- Protein sequence curation
The displayed protein sequence: …canonical, representative, consensus…+alternative sequences (described within the entry) UniProtKB/Swiss-Prot a gene-centric view of the protein space 1 entry <-> 1 gene (1 species)
What is the current status? • At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. • Typical problems • unsolved conflicts • uncorrected initiation sites • frameshifts • wrong gene prediction • other ‘problems’
UCSC genome browser examples of CDS annotation submitted to INSDC…
UniProtKB/Swiss-Prot 2- Biological data curation
Extract literature informationand protein sequence analysis maximum usage of controlled vocabulary UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, TMHMM, …) - contacts with experts - other databases - nomenclature committees An evidence attribution system allows to easily trace the source of each annotation
General annotation (Comments) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org
Human protein manual annotation: some statistics (June 2011)
Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org
Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both
Find all the proteins localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven)
‘Protein existence’ tag • The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein; • Different qualifiers: 1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),…) 2. Evidence at transcript level (~19%) 3. Inferred from homology (~58 %) 4. Predicted (~5%) 5. Uncertain (mainly in TrEMBL) http://www.uniprot.org/docs/pe_criteria
UniProtKB Additional information can be found in the cross-references (to more than 140 databases)
Family and domain Gene3D HAMAP InterPro PANTHER Pfam PIRSF PRINTS ProDom PROSITE SMART SUPFAM TIGRFAMs Organism-specific AGD ArachnoServer CGD ConoServer CTD CYGD dictyBase EchoBASE EcoGene euHCVdb EuPathDB FlyBase GeneCards GeneDB_Spombe GeneFarm GenoList Gramene H-InvDB HGNC HPA LegioList Leproma MaizeGDB MGI MIM neXtProt Orphanet PharmGKB PseudoCAP RGD SGD TAIR TubercuList WormBase Xenbase ZFIN Sequence EMBL IPI PIR RefSeq UniGene Proteomic PeptideAtlas PRIDE ProMEX Polymorphism dbSNP Genome annotation Ensembl EnsemblBacteria EnsemblFungi EnsemblMetazoa EnsemblPlants EnsemblProtists GeneID GenomeReviews KEGG NMPDR TIGR UCSC VectorBase Gene expression ArrayExpress Bgee CleanEx Genevestigator GermOnline Protein family/group Allergome CAZy MEROPS PeroxiBase PptaseDB REBASE TCDB Ontologies GO UniProtKB/Swiss-Prot: 129 explicit links 2D gel 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE UCD-2DPAGE World-2DPAGE and 14 implicit links! Phylogenomic dbs eggNOG GeneTree HOGENOM HOVERGEN InParanoid OMA OrthoDB PhylomeDB ProtClustDB 3D structure DisProt HSSP PDB PDBsum ProteinModelPortal SMR Other BindingDB DrugBank NextBio PMAP-CutDB PPI DIP IntAct MINT STRING Enzyme and pathway BioCyc BRENDA Pathway_Interaction_DB Reactome PTM GlycoSuiteDB PhosphoSite PhosSite
The UniProt web site www.uniprot.org Powerful search engine, google-like and easy-to-use, but also supports very directed field searches Scoring mechanism presenting relevant matches first Entry views, search result views and downloads are customizable The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access Search, Blast, Align, Retrieve, ID mapping
Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information
Find all humanproteins located in the nucleus
The search interface guides users with helpful suggestions and hints
Advanced Search A very powerful search tool To be used when you know in which entry section the information is stored
Find all the protein localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven)