1 / 69

domain database

C. H. C. A. T. The CATH domain database and associated resources - DHS, Gene3D How do we determine domain boundaries? How do we you identify fold groups and evolutionary superfamilies? What is the distribution of the CATH domain families in the PDB and in the genomes?. lass.

chessa
Download Presentation

domain database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C H C A T The CATH domain database and associated resources - DHS, Gene3D How do we determine domain boundaries? How do we you identify fold groups and evolutionary superfamilies? What is the distribution of the CATH domain families in the PDB and in the genomes? lass domain database A Orengo & Thornton 1994 rchitecture T opology or Fold Group H omologous Superfamily

  2. Multidomain proteins ~20,000 chains from Protein Databank (PDB) ~50,000 domains in CATH structure database ~40% of the entries in CATH are multidomain

  3. Domains are important evolutionary units analysis by Teichmann and others suggests that ~60-80% of genes in genomes may be multidomain

  4. Carboxypeptidase A (2ctc) Carboxypeptidase G2 (1cg2A) ~30% of multidomains in CATH are discontinuous

  5. DETECTIVE Swindells 1995 each domain should have a recognisable hydrophobic core DOMAKSiddiqui & Barton, 1995 residues comprising a domain make more internal contacts than external ones PUUHolm & Sander, 1994 parser for protein folding units: maximal interaction within domains and minimal interaction between domains Algorithms for Recognising Domain Boundaries Consensus is sought between the three methods – on average this occurs about 20% of the time

  6. 74% Close homologues 29% 21% Twilight zone 4% Midnight zone 11% Homologues/analogues

  7. Sequence Based methods close homologues – BLAST (Altschul et al.) - SSEARCH (Smith & Waterman) remote homologues – SAM-T99 (Karplus et al) Structure Based Methods close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo) - SSAP (Taylor & Orengo) - CORA (Orengo) Algorithms for Recognising Homologues

  8. 74% Close homologues SSEARCH 29% 21% Twilight zone HMMs, SSAP 4% Midnight zone CATHEDRAL, SSAP 11% Homologues/analogues CATHEDRAL, SSAP

  9. Hidden Markov Models (HMMs) SAM-T99 Karplus Group SAMOSA Orengo Group Non redundant GenBank database query sequence hits these methods can currently identify ~70% of remote homologues (3 times more powerful than BLAST)

  10. Percentage of PDB structures classified in CATH by different methods over the last 2 years remote homologues (8.6) analogues (1.9) SSAP Novel folds 2.0 1.9 remote homologues (<30%) HMMs 8.6 7.6 20.7 59.2 Close homologues (>30%) SSEARCH Near-identical SSEARCH

  11. 7.7 11.8 8.0 22.0 28.4 22.0 Percentage of structural genomics PDB structures classified in CATH by different methods over the last 2 years near-identical SSEARCH novel folds analogues SSAP close homologues (>30%) SSEARCH remote homologues SSAP remote homologues (<30%) HMMs

  12. CATHEDRAL Pairwise alignment - secondary structure comparison SSAP Pairwise alignment - residue comparison CORA Multiple alignment – residue comparison Structure Based Algorithms for Recognising Homologues

  13. 74% Close homologues ssearch 29% 21% Twilight zone HMMs 4% Midnight zone CATHEDRAL, SSAP 11% Homologues/analogues CATHEDRAL, SSAP

  14. structure is much more highly conserved than sequence cholera toxin pertussis toxin Structure similarity (SSAP) score 97 81 Heat labile enterotoxin 79% 12% Sequence identity

  15. Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Families structure similarity (SSAP) score same function different function sequence identity (%)

  16. Residue insertions in the loops connecting secondary structures • Shifts in the orientations of secondary structures

  17. Structural variation in the P-loop Hydrolase Superfamily

  18. Structural variation in the Galectin Binding Superfamily

  19. ignore the variable loop regions and only compare the secondary structures derive vectors through secondary structure elements compare closest approach distances and vector orientations using graph theory Fast Structure Comparison Method (CATHEDRAL) Andrew Harrison et al., JMB, 2002

  20. d a b a . b = | a || b | cos + dihedral angle +chirality

  21. CATHEDRALCATHs Existing Domain Recognition ALgorithm Compares graphs of proteins d, , , chirality H edge H d, , , chirality d, , , chirality H node

  22. Comparing proteins with similar folds identifies an overlap graph with the largest common structural motif A III A,a I C III II B I C,d IV a B,c II III b b I overlap graph has a structural motif of 3 secondary structures d V II c

  23. Graphs are compared using the Bron Kerbosch algorithm to find the largest common graph In this example the common graph contains 5 nodes. 1000 times faster than residue based methods (e.g. SSAP)

  24. Performance

  25. statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures Score ~ common graph size (size protein1 . size protein2)1/2

  26. statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures Score ~ common graph size (size protein1 . size protein2)1/2

  27. scores for unrelated structures exhibit an extreme value distribution F = A e - b . scorelog F = log A - b .score allows you to calculate the probability (P-value, E-value) of obtaining any score by chance

  28. Using CATHEDRAL to Identify Domain Boundaries Graph based secondary structure comparison is very fast - 1000 times faster than residue based methods New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be used to identify significant matches. 85-90% of domains in new multi-domain structures have relatives in CATH

  29. CATHEDRAL Multi-domain structure Secondary structure match by graph SSAP residue alignment residues in new multi-domain residues in CATH domain family 1 Fold A residues in CATH domain family 2 Fold B

  30. SSAP Protein A Protein B Taylor & Orengo, J. Mol. Biol. 1989 residue based structure comparison method using dynamic programming Scores range from 0-100 Residues in protein A Residues in protein B

  31. CATHEDRAL One third of known multi-domain structures are discontinuous

  32. Divergence - similarity arises due to divergent evolution from a common ancestor - structure much more highly conserved than sequence Convergence - similarity due to there being a limited number of ways of packing helices and strands in 3D space Reasons for Structural Similarity

  33. C lass Domain structure database A Orengo & Thornton 1994 rchitecture T opology or Fold Group H omologous Superfamily ~50,000 domains in PDB ~1500 domain superfamilies in CATH

  34. H C A T 3 Class ~36 Architecture Topology or Fold ~810 ~50,000 domains domain database

  35. H A T C Topology or Fold Group ~810 40,000 domain entries ~50,000 domain entries Homologous Superfamily (Domain Family) ~1500 Sequence Family (35%, 60%, 95%)

  36. DHS Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs Description of structural and functional characteristics for each superfamily

  37. DHS Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs Description of structural and functional characteristics for each superfamily

  38. Variation in Secondary Structures Across Superfamily DHS:Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs

  39. Functional annotations from GO, EC, COGs, KEGG DHS:Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs

  40. Multiple structure alignments with conserved residues highlighted DHS:Dictionary of Homologous superfamilies http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D

  41. Population of CATH Families and Structural Groups ~50,000 structural domains cluster proteins with similar sequences S ~4000 sequence families (35%) cluster proteins with similar structures and functions ~1,500 homologous superfamilies H cluster proteins with similar structures T ~810 fold groups A ~36 architectures C 3 major protein classes

  42. Arc repressor-like Rossmann Fold OB Fold Alpha/Beta Plaits Jelly Roll CATH Arc repressor-like nearly one third of the superfamilies belong to <10 fold groups Up-down Rossmann SH3-like OB fold Immunoglobulin Jelly Roll Alpha-beta plait TIM barrel

  43. CATH numbering scheme 2.40.50.100 Class 2. Mainly beta 40. Barrel Architecture 50. OB Fold Topology 100 Heat labile enterotoxin superfamily Homology

  44. CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH domain structure database

  45. CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH class level

  46. CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH architecture level

  47. CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH Topology or fold group level

  48. CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH homologous superfamilies in each fold group

More Related