780 likes | 904 Views
Proteomics Resources at the EBI. Sandra Orchard EMBL-EBI. What do Protein scientists require?. 1. Protein Identification
E N D
Proteomics Resources at the EBI Sandra Orchard EMBL-EBI
What do Protein scientists require? 1. Protein Identification A high quality, non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs to act as a reference set. Stable identifiers and sequence archiving essential 2. Protein annotation Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source 3. Reference data sets Comparative datasets to compare tissue specificity patterns, normal/disease protein sets
Where do we go from here? Sequence similarity programs run against UniProt What is UniProt? Based on the original work on PIR, Swiss-Prot and TrEMBL Funded mainly by NIH Collaboration between EBI, SIB and PIR
UniRef 50 UniRef 90 IPI Proteome Sets UniRef 100 UniSave UniProtKB UniMes UniParc PDB Sub/ Peptide Data FlyBase WormBase Patent Data INSDC (incl. WGS, Env.) RefSeq Ensembl VEGA Database sources UniProt data sources and data flow
UniProtKB • UniProt Knowledgebase: • Aims to describe in a single record all protein products derived from a certain gene from a certain species • 2 sections • UniProtKB/Swiss-Prot Non-redundant, high-quality manual annotation - reviewed • UniProtKB/TrEMBL Redundant, automatically annotated - unreviewed www.uniprot.org
What does UniProtKB give you? • Curated protein sequences – correction of frameshifts, premature stop sites, incorrect initiator methionine…….. stable identifiers, with archiving and versioning • Consistent nomenclature – plus synonyms • Identification of splice variants and/or alternative promoter usage - stable identifiers, with archiving and versioning
What does UniProtKB give you? 4.Identification of variants (at amino acid level) and of PTMs – where known, consequence is given - stable identifiers, with archiving and versioning 5. Annotation of literature experimental data in 27 defined fields. Increasing use of controlled vocabularies, without loss of detail
What does UniProtKB give you? 6. Extensive cross-referencing, a central portal to a wealth of external resources - 81 external databases cross-referenced to UniProtKB
1. Sequence curation, stable identifiers, versioning and archiving www.ebi.ac.uk/uniprot/unisave
Sequence curation, stable identifiers, versioning and archiving • For example – erroneous gene model predictions…. …frameshifts ..premature stop codons, readthroughs, erroneous initiator methionines…..
4. Identification of variants (at amino acid level)…. …and of PTMs … and also
Domain annotation Binding sites
Splice variants Experimental mutations Sequence conflicts
5. Annotation of literature experimental data in 27 defined fields.
Controlled vocabularies used whenever possible… ..but ability to further describe each specific situation retained
Disease specific annotation added to human entries… … with supporting cross-referencing
6. Extensive cross-referencing, a central portal to a wealth of external resources… .. Additional annotation (Gene Ontology)..
InterPro – defines protein family membership and enables domain annotation
UniProtKB/TrEMBL • Redundant – only 100% identical sequences merged • Automated clean-up of annotation from original nucleotide sequence entry • Additional value added by using automatic annotation
Automatic Annotation • Recognises common annotation belonging to a closely related family within UniProtKB/Swiss-Prot • Identifies all members of this family using pattern/motif/HMMs in InterPro • Transfers common annotation to related family members in TrEMBL
BLAST more sequences Conserved signatures Protein Sequence Characterisation Basic information Build up consensus sequences of families, domains, motifs or sites
Simplest (limited) More information Finding Conserved Signatures • Pattern • Fingerprint • Sequence clustering • Profile • HMM
Integration of signatures InterPro Foundations of InterPro Manual curation
(100) 1) PROSITE IPR000001 (100) PFAM (100) IPR000001 2) PROSITE (50) IPR000002 PFAM 3) (100) IPR000001 PROSITE IPR000001 (100) IPR000002 PFAM IPR000002 (100) PROSITE 4) (100) PFAM Integration Process Same positions Same protein hits Same positions Different protein hits Different positions Same protein hits Different positions
(100) Protein kinase PFAM PFAM (75) Serine kinase SMART Protein kinase * (100) Protein kinase PFAM (25) PROSITE Tyrosine kinase SMART PROSITE Serine kinase Tyrosine kinase SMART PROSITE Children No proteins in common Signature Relationships 1) Parent - Child (subgroup of more closely related proteins) * Parent Applies to domains and families
Receptor family PFAM N-terminal domain C-terminal domain SMART PROSITE Contains (Smart and Prosite) PFAM Receptor Family Found in (Pfam) SMART PROSITE N-terminal domain C-terminal domain Signature Relationships 2) Contains – Found in (Describes domain composition) Both families and domains can contain domains
PDB sequence InterPro sequence-structure comparison MSD Residue-by-residue mapping UniProt amino acid position Structural Representation in InterPro
PDB structures displayed as striped patterns Structural classification in CATH and SCOP CATH SCOP and ModBase Homology models from Swiss-model Swiss-M ModB Structural Representation
Signatures predictive of protein annotation Structural data for specific proteins Sequence-Structure Display