1 / 34

Classification of protein and domain families

Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences. Domain structures. Domain structure predictions. Classification of protein and domain families. Structure to function. Sequence to function. H. A. T. C. Fold Group (1100).

aviva
Download Presentation

Classification of protein and domain families

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions Classification of protein and domain families Structure to function Sequence to function

  2. H A T C Fold Group (1100) 40,000 domain entries Homologous Superfamily (2100) Gene3D Sequence Family ~100,000 domains of known structure in CATH ~2 million sequences from genomes assigned to CATH superfamilies in Gene3D and functionally annotated

  3. Gene3D:Domain structure annotations in genome sequences ~5 million protein sequences from 560 completed genomes and UniProt ~ 2 million domain sequences assigned to CATH superfamilies scan against library of HMM models and sequences for CATH Pfam NewFam superfamilies

  4. Gene3D (1) Cluster ~5 million sequences into protein superfamilies (2) Map domains onto the sequences using HMM technology (CATH & Pfam domains) >200,000 protein superfamilies ~10,000 domain superfamilies (2100 of known structure)

  5. Proportion of genome sequences which can be assigned to domain families of known structure in CATH or SCOP HMM prediction threading prediction

  6. Annotation levels for an average genome 100% many belonging to small species specific families many predicted to be transmembrane 50% predicted to belong to structural superfamilies using HMM or threading techniques 0

  7. 100 80 60 Percentage of domain sequences 40 20 0 0 1000 2000 3000 4000 5000 6000 Families ordered by size Target selection strategy for PSI-2 Adam Godzik JCSG, Andras Fiser – NYSGC, Burkhard Rost - NESG unknown structure (BIG -Pfam) known structure (CATH - MEGA)

  8. Correlation of sequence and structural variability of CATH families with the number of different functional groups Structural Diversity Population in genomes (x 1000)

  9. Structural diversity in the CATH Domain Superfamily P-loop hydrolases Cutinase Cocaine esterase Acetylcholinesterase

  10. Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions Sequence to function

  11. Sequence identity thresholds for 90% conservation of enzyme function (to 3 EC Levels) highly variable families Number of sequences Number of families Sequence identity threshold for 90% conservation

  12. N-Fold Increase in Functional Annotation for Sequences in Gene3D N-fold increase in coverage family specific thresholds general thresholds

  13. Gene3D Get an XML version of this page Link to UniProt Links to different levels in the Gene3D protein family Link to InterPro Links to GO Links to KEGG Links to CATH/Pfam “S” - indicates you can search the term against Gene3D Functional information from GO, COGS, KEGG, EC, FunCat, MINT, IntAct, ComplexDB

  14. Functional annotation of structures using EC, GO, KEGG, FunCat resources Non-PSI PDBs PSI PDBs 0 terms 1 term 2 terms 3 terms 4 terms

  15. Phylogenetic trees derived from multiple sequence alignments can be used to infer functionally related proteins Tree Determinants - Valencia Evolutionary Trace - Lichtarge Funshift – Sonnhammer SCI-PHY – Sjolander

  16. Methods exploiting information on sequence conserved residue positions multiple sequence alignment of relatives from functional group 1 = highly conserved Structural model Score conservation for each position in the alignment using an entropy measure 0 = unconserved Putative functional site Scorecons –Thornton Protein Keys – Sander

  17. GEMMA: Compares sequence profiles (HMMs) between subfamilies sequence subfamily 80% seq. id) Superfamily of known structure (CATH) putative structure-function group clusters sequence relatives predicted to have similar structures/functions even at low levels of sequence identity

  18. GeMMA v SCI-PHY using gold annotated sequences in Babbitt benchmark Purity (high is best) Edit distance (low) VI distance (low is best) Deviation from no. singletons (low)

  19. Coverage of superfamily (%) experimental annotations inherit functions at 60% seq. id. inherit functions by GEMMA Functional annotation coverage using different strategies

  20. Protein interactions and gene networks • Gene3D Biominer Methods • Phylotuner: Correlation of domain occurrence profiles • GOSS:Semantic Similarity calculation between protein pairs. • CODA: Domain fusion analysis. • HiPPI: homology inheritance of protein-protein physical interaction data. • GECO: Correlation of gene expression data

  21. Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions Structure to function

  22. Methods for Assessing Structural Novelty CATHEDRAL – structure comparison Redfern et al. PLOS comp. biol. 2007

  23. Aminoacyl tRNA synthetases Gln-hydrolyzing synthases DNA-binding, stress-related Nucleotidyl-transferases Argininosuccinate lyases Structural clusters in the Aminoacyl tRNA synthetases – like family structure similarity score

  24. Galectin binding superfamily 2.60.120.200 1bkzA00 1dypA00

  25.  Aminoacyl tRNA synthetases – like  Identifying functional groups in domain superfamilies Deoxyribodi- pyrimidine photo-lyases Electron transfer flavoprotein Nucleotidylyl- transferases AA tRNA synthetase, Class I 1n3lA01 1dnpA00 1ej2A00 1o97D01

  26. Exploiting 3D Templates to Represent Functional Relatives JESS – Thornton GASP - Babbitt SPASM – Kleywegt PINTS – Russell DRESPAT - Sarawagi pvSOAR – Joachimiak

  27. SITESEER: Match 3-residue templates and assess relevance of hits by looking at residues within the local environment Laskowski and Thornton green and purple – identical residues; orange and white – similar residues

  28. FLORA:3D templates for functional groups From multiple structure alignments of functional subgroups in the superfamily, identify vectors between amino acids that are highly conserved and distinctive for the functional subgroup.

  29. FLORA:3D templates for functional groups localFLORA globalFLORA multiple sites single site

  30. FLORA:Performance in recognising functionally related homologues Benchmark of 36 diverse enzyme groups (from 12 families)

  31. Performance of FLORA Benchmarked on 36 large enzyme families

  32. FLORA: 3D Templates for Structure-Function Groups in Domain Families 1q77A00 Unknown function MCSG 1o97D01 Electron transfer flavoprotein 1dnpA01 Deoxyribo- dipyrimidine photo-lyases 1n3lA01 AA tRNA synthetases 1ej2A00 Nucleotidylyl- transferases

  33. Fold and structural motifs n-residue templates Sequence scans Sequence search vs PDB Enzyme active sites SSM fold search Sequence search vs Uniprot Ligand binding sites Surface clefts Sequence motifs (PROSITE, BLOCKS, SMART, Pfam, etc) Residue conservation DNA binding sites DNA-binding HTH motifs Superfamily HMM library Reverse templates Nest analysis Gene neighbours http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/

  34. Function Prediction for Proteins of ‘Putative’ or Unknown Function * Numbers refer to results where the top hit is classed as ‘Strong’ or ‘Moderate’ structural data provides relatively more information for proteins about which there is less knowledge these predictions need to be experimentally validated

More Related