1.01k likes | 1.1k Views
Protein Sequence Database:. UniProt. Jennifer McDowall. Overview. The UniProt databases UniProt/SwissProt annotation UniProt/TrEMBL automatic annotation Using the uniprot.org website Computational access. 1) The UniProt databases. Source of protein sequence data.
E N D
Protein Sequence Database: UniProt Jennifer McDowall
Overview The UniProt databases UniProt/SwissProt annotation UniProt/TrEMBL automatic annotation Using the uniprot.org website Computational access
Source of protein sequence data • Protein sequencing is rare • Most protein sequence derived from nucleotide data Large-scale sequencing projects Individual scientists Patent Offices Nucleotide sequencing Submit Protein sequencing Submit Protein sequence database Nucleotide sequence database Derive protein sequence
Protein sequence is mainly derived data submit DNA sequence ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT transcribe Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC translate Derived protein sequence MRSNECCCAMSC
Protein sequence is mainly derived data submit DNA sequence ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT Predicted start Predicted stop may not have direct evidence Predicted splice sites transcribe Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC translate Derived protein sequence MRSNECCCAMSC
How to find the information you need? GAATCATCGTCTACG High quality protein sequence AATCATCACGAT ATAGACATCA CGCAGCACCAT GACGCGCATAACT • Non-redundant data • Splice isoforms, disease variants, PTMs • Sequence archiving essential GCAGCATCAG TAGCGAGCAGCAGCA TAGAGGCTATCAGCA CTATCTGT CAGCATC CTAAGCGACA AGATCGC Protein identification TATCTACAG GATCTACGA • Stable identifiers • Consistent nomenclature Protein annotation protein function biological processes • Information molecular interactions pathways
UniProt Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL PIR-PSD Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database http://www.uniprot.org/
Where does the data come from? ENA UniParc exchange data daily Sequence sources
Where does the data come from? ENA UniParc History of sequences PDB Sequence sources Metagenomic & environmental Taxonomy known RefSeq Ensembl UniMES UniProtKB/ TrEMBL VEGA Patents Manual annotation Remove redundancy Model organisms UniProtKB/ SwissProt High quality annotation more…
Where does the data come from? ENA UniParc PDB Sequence sources Metagenomic & environmental Taxonomy known RefSeq Ensembl UniMES UniProtKB/ TrEMBL VEGA Patents UniRef Clusters UniMES Clusters Model organisms UniProtKB/ SwissProt more…
4 components of UniProt • Complete history of sequences (no annotation) • Cross-links to external sequence sources UniParc • Swiss-Prot: non-redundant, manual annotation • TrEMBL: redundant, automatic annotation UniProtKB UniMES • Sequences from metagenomic projects • Combines sequences (speed searching) • UniRef100, UniRef90, UniRef50 UniRef
Browsing a UniParc entry Accession Download data List of databases containing sequence Deleted entries identified (greyed out) Navigate to individual entries Sequence
Browsing a UniProtKB/SwissProt entry Download data Names (synonyms) and taxonomy Protein attributes Annotation Ontologies Protein interactions Splice variants Sequence features Sequence References Navigate to external data sources e.g. Ensembl General information
Browsing a UniRef90 entry Faster and more sensitive sequence search with no loss of information Status (SwissProt and/or TrEMBL) Cluster name List of entries in cluster Taxonomy of each entry % identity of sequences in cluster
Taxonomic distribution of species Within Eukaryota: All kingdoms: Other mammals (27%) Bacteria (61%) Other Vertebrata (10%) Homo (12%) Archaea (4%) Viruses (3%) Other (8%) Viridiplantae (18%) Nematoda (2%) Insecta (5%) Eukaryota (32%) Fungi (18%)
SwissProt – most represented species Mainly model organisms
Protein Existence tag !! Not sequence validation !! Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) Total 13% 12% 70% 5% -
Protein existence categories !! Not sequence validation !! Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) Human 59% 37.5% 1% 0.5% 2%
2) UniProtKB/SwissProt annotation
Annotation sources for UniProtKB * Manual curation * Literature-based annotation * Sequence analysis GO Functional info Protein identification data PRIDE Protein families and domains InterPro Molecular interactions IntAct IntEnz Enzymes Microbial protein families HAMAP Post-translational modifications RESID Protein classification Data sources Some data sources for annotation InterPro classification Signal prediction Transmembrane prediction UniProtKB Other predictions * Automated annotation
Features of UniProtKB Splice variants Sequence features Sequence Ontologies Annotations References Nomenclature
Organism-specificDBs DictyBase AGD EchoBASE CGD EcoGene CTD euHCVdb CYGD FlyBase HGNC GeneCards HPA GeneFarm MGI GrameneMIM H-InvDBRGD LegioListSGD LepromaTAIR ListiListZFIN MaizeGDBMypuList OrphanetPharmGKB PhotoListPseudoCAP SagaListSubtiList TubercuListWormBase WormPepXenbase GeneDB_Spombe ArachnoServerBuruList Enzyme & pathwayDBs BioCyc BRENDA Reactome Pathway_Interaction_DB ProteomicDBs PeptideAtlas PRIDE ProMEX Genome annotation DBs EnsemblKEGG GeneID NMPDR VectorBase UCSC GenomeReviewsTIGR Family and domainDBs Gene3D PIRSF HAMAP PRINTS InterPro ProDom PANTHER PROSITE PfamTIGRFAMs SMART A wealth of external links PhylogenomicDBs HOGENOM OMA HOVERGEN PhylomeDB InParanoidOrthoDB 125 links! PolymorphismDBs dbSNP Ontologies GO 2D gelDBs 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE World-2DPAGE 3D structure DBs DisProtHSSP PDB PDBsum SMR Gene expression DBs ArrayExpressBgee GermOnlineCleanEx Genevestigator Others BindingDB PMAP-CutDB DrugBank NextBio Protein-protein interaction DBs DIP IntAct STRING PTM DBs GlycoSuiteDB PhosphoSite PhosSite SequenceDBs EMBL IPI PIR RefSeq UniGene Proteinfamily/group DBs CAZy MEROPS PeroxiBaseREBASE PptaseDBTCDB
SwissProt manual annotation Protein sequence • Merge available CDS (coding sequence) • Annotate sequence discrepancies • Report sequencing errors... Biological information • Extract literature information • Orthologue data propagation • Protein sequence analysis...
Problem #1: sequence correction ~20% of Swiss-Prot entries required correction • Typical problems: • Unsolved conflicts (sequencing errors) • Erroneous gene model predictions • Wrong initiation sites • Frameshifts...
Sequence quality from genome projects • Drosophila: • Well-curated • 1.8% of gene models incorrect • Arabidopsis: • Annotated when sequenced, but no update • 19.5% of gene models incorrect • Tetraodon nigroviridis: • Automatic run through (no manual intervention) • >90% of gene models incorrect
Sequence curation Sequencing errors Other examples of sequencing errors include: premature stop codons, read-throughs, erroneous initiator methionines
Problem #2: proteome complexity 1 SwissProt entry = 1 gene (1 species) genome ~20,000 human protein-coding genes proteome >1,000,000 human proteins alternative splicing, alternative initiation, mRNA editing... Post-translational modification transcriptome ~100,000 human transcripts Annotation of sequence differences
Merging entries Because of: • Errors • Erroneous gene model predictions; sequence errors • Natural variation • Polymorphisms; Alternative start sites; Alternative splicing • Multiple entries for the same protein exist in TrEMBL (redundancy) • Apart from 100% identical sequences all merged sequences are analyzed by a curator so they can be annotated accordingly.
Example Multiple alignment of the end of the available GCR sequences: Annotation of the sequence differences (protein diversity):
Sequence curation Alternative Splicing
Sequence curation Alternative Splicing
Sequence curation Alternative Splicing
Sequence curation Alternative Splicing
Sequence curation Alternative Splicing
Sequence curation Identification of amino acid variants ....and of PTMs ....and also
Sequence curation Domain annotation Binding sites
SwissProt manual annotation Protein sequence • Merge available CDS (coding sequence) • Annotate sequence discrepancies • Report sequencing errors... Biological information • Extract literature information • Orthologue data propagation • Protein sequence analysis...
Sources of annotated information UniProtKB/SwissProt gathers information from multiple sources: • Publications (literature/PubMed) • Prediction proteins (Prosite, Anabelle) • Contact with experts • Other databases • Nomenclature committees
Nomenclature Synonyms useful for literature searching
Nomenclature Provides synonyms and cleavage products of bifunctional proteins
Annotation comments >30 comment fields Controlled vocabularies used whenever possible…
Disease association Mendelian Inheritance in Man provides information on genetic disease associations Pharmacogenomics database
Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein…
Sequence annotation (Features) Feature (e.g. domain) highlighted on sequence
Gene Ontology 1. Biological Process • Cell division • Mitosis • Organelle fission A commonly recognized series of events 2. Molecular Function • Protein kinase activity • Insulin binding • Insulin receptor activity An elemental activity or task or job 3. Cellular Component • Mitochondrion • Mitochondrial matrix • Mitochondrial membrane Where a gene product is located
Gene Ontology Annotation for human Rhodopsin:
Imported annotation Binary interactions are taken from the database Interactors of human p53