300 likes | 415 Views
Predicting domain structure families and their domain contexts Exploring how structural divergence in domain families correlates with functional change Predicting domain relatives likely to have significantly different structures and functions.
E N D
Predicting domain structure families and their domain contexts • Exploring how structural divergence in domain families correlates with functional change • Predicting domain relatives likely to have significantly different structures and functions Exploiting Structural and Comparative Genomics to Reveal Protein Functions T H A C Domain families of known structure Gene3D Protein families and domain annotations for completed genomes
Congratulations Swiss-Prot - 20 Years!! Thanks to Amos, Rolf and the Swiss-Prot Team!!!!
T H A C Class (3) Orengo and Thornton (1994) Architecture (36) 86,000 domains Topology or Fold (1100) Homologous superfamily (2100) H1 H2 H3
Gene3D:Domain annotations in genome sequences scan against library of HMM models ~2100 CATH ~8300 Pfam >2 million protein sequences from 300 completed genomes and UniProt assign domains to CATH and Pfam superfamilies Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs
DomainFinder: structural domains from CATH take precedent NewFam Pfam-1 CATH-1 Pfam-2 Gene3D: Domain annotations in genome sequences UniProt sequence N C CATH-1 Pfam-2 Pfam-1 NewFam Assigned domains
Domain families ranked by size (number of domain sequences) NewFam of unknown stucture (>50,000 families) Pfam families of unknown structure Percentage of all domain family sequences in UniProt CATH superfamilies of known structure Rank by family size >90% of domain sequences in UniProt can be assigned to ~7000 domain families
Domain families ranked by size (number of domain sequences) NewFam of unknown stucture (>50,000 families) Pfam families of unknown structure Percentage of all domain family sequences in UniProt CATH superfamilies of known structure Rank by family size 100 largest families of known structure account for 30% of domain sequences in UniProt
Correlation of sequence and structural variability of CATH families with the number of different functional groups Structural Diversity Population in genomes
Prediting domain structure families and their domain contexts • Exploring how structural divergence in domain families correlates with functional change • Predicting domain relatives likely to have significantly different structures and functions Exploiting Structural and Comparative Genomics to Reveal Protein Functions T H A C Domain families of known structure Gene3D Protein families and domain annotations for completed genomes
Some superfamilies show great structural diversity Gabrielle Reeves J. Mol. Biol. (2006) Multiple structural alignment by CORA allows identification of consensus secondary structures and secondary structure embellishments 2DSEC algorithm In 117 superfamilies relatives expanded by >2 fold or more
Structural embellishments can modify the active site Galectin binding superfamily
Structural embellishments can modulate domain interactions side orientation face orientation Glucose 6-phosphate dehydrogenase a Dihydrodipiccolinate reductase Additional secondary structure shown at (a) are involved in subunit interactions
Structural embellishments can modify function by modifying active site geometry and mediating new domain and subunit interactions Biotin carboxylase D-alanine-d-alanine ligase ATP Grasp superfamily Dimer of biotin carboxylase
Secondary structure insertions are distributed along the chain but aggregate in 3D
Secondary structure insertions are distributed along the chain but aggregate in 3D
80 60 Frequency (%) 40 Indel frequency < 1 % 20 0.85% 0.38% 0.23% 0.11% 0.06% 0.02% 0 1 2 3 4 5 6 7 8 9 10 11 12 Size of Indel (number of secondary structures) 85% of insertions comprise only 1 or 2 secondary structures Frequency (%) Size of insertion (number of secondary structures) For ~70% of domains analysed, 80% of the secondary structure embellishments are co-located in 3D with 3 or more other embellishments In 80% of domains, 1 or more embellishments contacts other domains or subunits
3 Layer Alpha/Beta Sandwich 2 Layer Alpha/Beta Alpha/Beta Barrel 2 Layer Beta Sandwich Many structurally diverse superfamilies adopt folds with these regular layered architectures
3 Layer Alpha/Beta Sandwich 2 Layer Alpha/Beta Alpha/Beta Barrel 2 Layer Beta Sandwich Many structurally diverse superfamilies adopt folds with these regular layered architectures
Predicting domain structure families and their domain contexts • Exploring how structural divergence in domain families correlates with functional change • Predicting domain relatives likely to have significantly different structures and functions Exploiting Structural and Comparative Genomics to Reveal Protein Functions T H A C Domain families of known structure Gene3D Protein families and domain annotations for completed genomes
GEMMA – GEne Model and Model AnnotationAlgorithm for Predicting Sequence Homologues with Similar Structures and Functions structural superfamily subfamily of close sequence relatives predicted to have similar functions (>=60% sequence identity) Largest 100 CATH families have more than 20,000 subfamilies
GEMMA – Predicting Functional Groups in CATH Superfamilies subfamily of close relatives predicted to have similar function (>60% identity) structural superfamily Build multiple sequence alignments for each subfamily
GEMMA – Predicting Functional Groups in CATH Superfamilies subfamily of close relatives predicted to have similar function (>60% identity) structural superfamily Cluster subfamilies predicted to have similar functions into functional groups
Pyruvate phosphate dikinase (subfamily 1) Succinyl-CoA synthetase (subfamily 22) SSAP score = 68.69 PSS score = 0.375 SSAP score = 93.01 PSS score = 0.827 Pyruvate phosphate dikinase (subfamily 15) SSAP score = 68.32 PSS score =0.333 ATP Grasp Family 192 subfamilies
subfamily profiles coloured by residue conservation (red = high, blue = low) Profiles aligned using profile -profile comparison (MAFFT) Pyruvate phosphate dikinase Pyruvate phosphate dikinase Many fully conserved positions 6/7 positions are fully conserved Equivalent functions Scorecons (Valdar and Thornton, Profunc)
subfamily profiles coloured by residue conservation (red = high, blue = low) Profiles aligned using profile -profile comparison (MAFFT) Succinyl-CoA synthetase Pyruvate phosphate dikinase Fully conserved positions No fully conserved positions Different functions Scorecons (Valdar and Thornton, Profunc)
Performance in Merging Subfamilies into Functional Groups Number of functional groups predicted Error rate 10 experimentally identified enzyme functions identified in this family
GEMMA – Predicting Functional Groups in CATH Superfamilies subfamily of close relatives predicted to have similar function (>60% identity) structural superfamily functional group Benchmarked on 12 large enzyme families in CATH 6-10 fold reduction in the number of functional subfamilies
Summary • More than half the domains in UniProt can be assigned to families of known structure • Analysis of some very large structural families revealed how secondary structure insertions can modulate functions • Functional groups can be identified in diverse families by comparing multiple features (e.g. residue conservation, predicted secondary structure)
CATH Gene3D Lesley Greene Stathis Sidderis Russell Marsden Ian Sillitoe Sarah Addou Juan Ranea Tony Lewis Dave Lee Ollie Redfern Alison Cuff Mark Dibley Ilhem Diboun Adam Reid Corin Yeats Tim Dallman http://www.biochem.ucl.ac.uk/bsm/cath_new MRC, Wellcome Trust, NIH, EU -Biosapiens, Embrace, Enfin, BBSRC