440 likes | 607 Views
The evolution of domain superfamilies from a structural and functional perspective. Oliver Redfern CATH-GENE3D group Dept. Structural and Molecular Biology University College London UK. The CATH and Gene3D Domain and protein family resources (sequence, structure, function).
E N D
The evolution of domain superfamilies from a structural and functional perspective Oliver Redfern CATH-GENE3D group Dept. Structural and Molecular Biology University College London UK
The CATH and Gene3D Domain and protein family resources (sequence, structure, function) Domain structures Domain structure predictions Homologous Superfamily Function 2 Function 1 Classifying domain superfamilies The impact of structural divergence on function Predicting protein function from structure
The CATH and Gene3D Domain and protein family resources (sequence, structure, function) Domain structures Domain structure predictions Homologous Superfamily Function 2 Function 1 Classifying domain superfamilies The impact of structural divergence on function Predicting protein function from structure
Why domains? • Unit of evolution • ~2000 domain superfamilies (have we found them all?) • 10,000s different domain combinations (37,000 already) • Domain-based function annotation can allow functional predictions of novel domain combinations
Other domain databases Domain structures grouped by superfamily • links to sequences through Gene3D Domain structures grouped by superfamily • Links to sequence through SUPERFAMILY Domain sequences grouped into families Integration of domain families from Pfam, SCOP, CATH etc. for sequence databases
The domain structure database I Class Achitecture Topology Homologous Superfamily e.g. 2.40.50.100 (toxin) superfamily
PDB Split PDB into chains Split chain into CATH domains Assign domain to superfamily HomCheck DomChop The CATH pipeline: Flow Chart
The domain structure database II ~114,000 domain structures ~2200 superfamilies
How do we define a “domain”? • Unit of evolution • Hydrophobic core • Compact unit, with few contacts with other domains
Multi-domain proteins • ~40% of structures in the PDB comprise more than one domain (i.e. multi-domain) • ~60-80% of genes are thought to code for multi-domain proteins
Algorithms for recognising domain boundaries • DETECTIVE Swindells, 1995 each domain should have a recognisable hydrophobic core. • DOMAK Siddiqui & Barton, 1995 residues comprising a domain make more internal contacts than external ones. • PUUHolm & Sander, 1994 parser for protein folding units: maximal interaction within domains and minimal interaction between domains • CATHEDRALRedfern and Orengo, 2007 structure comparison algorithm which uses alignment to known structural domains
<15% sequence identity 1dnpA01 Deoxyribo-dipyrimidine photo-lyases 1o97D01 Electron transfer flavoprotein How do we define a “superfamily”? • Related through a common ancestor • Evidence from sequence, structural, and/or functional similarity
Detecting homology using structure: CATHEDRAL CATHs Existing Domain Recognition Algorithm • Rapid graph theory secondary structure filter • Double dynamic programming for accurate residue alignment Redfern et al. PLOS Comp. Biol. (2007)
Coverage Error CATHEDRAL vs. other structure comparison methods CATHEDRAL method for structural comparison Redfern et al. PLOS Comp. Biol. 2007
Advantages of other popular structure comparison methods • Combinatorial Extension (CE) • Fast • Linked to PDB • Dali • Accurate, “industry standard” • FatCat • Allows for flexible alignment. • Vast/MSDFold • Secondary structure based • Fast and linked directly to PDB/MSD
Sequence-based homology recognition methods • PSI-BLAST • HMM scans (HMMer, SAM-T, PRC) • Needlemann-Wunsch • PFam scan
Scan against CATH HMM library protein sequences from genomes assign domains to CATH superfamilies Expanding CATH with sequence relatives from genomes Library of HMMs built for representative sequences from each CATH domain superfamily Up to 60% of sequences in completed genomes can be assigned to CATH domain superfamilies
Are all superfamilies equally populated? CATH domain structures in the PDB CATH domain sequences in the genomes Largest 100 account for more than half the sequences of known structure in the genomes
Why is the distribution of superfamilies uneven? • FunctionalityCertain families expand with genome size (e.g. metabolic genes, Ig domains) • DesignabilityStable folds are compatible with more sequences • Stochastic effects Large families just got bigger Goldstein Curr Op Structural Biology 2008
The CATH and Gene3D Domain and protein family resources (sequence, structure, function) Domain structures Domain structure predictions Homologous Superfamily Function 2 Function 1 Classifying domain superfamilies The impact of structural divergence on function Predicting protein function from structure
Correlation of sequence and structural variability of CATH-Gene3D families with the number of different functional groups
Domain structure embellishments in the P-loop Hydrolase Superfamily Fold spin plot
2DSEC algorithm Some superfamilies show great structural diversity Multiple structural alignment allows identification of consensus secondary structures and secondary structure embellishments Gabrielle Reeves J. Mol. Biol. (2006) In 117 superfamilies relatives expanded by >2 fold or more
Correlation of sequence and structural variability of CATH-Gene3D families with the number of different functional groups
ligand binding site Conservation of binding site region I ligand binding site Arginyl-tRNA synthetase 1f7uA01 Pantetheine-phosphate adenyltransferase
Conservation of binding site region II Deamido-NAD Sulfate L-Tyrosine ATP ATP ATP NH3-dependent NAD+ synthetase ATP sulfurylase Tyrosyl-tRNA synthetase
The CATH and Gene3D Domain and protein family resources (sequence, structure, function) Domain structures Domain structure predictions Homologous Superfamily Function 2 Function 1 Classifying domain superfamilies The impact of structural divergence on function Predicting protein function from structure
Methods to predict function from structure • Which bit of function are you interested in? • Diversity of structural data (apo-, holo-, non-cognate ligands). • Different similarity cut-offs for different functions/families?
Using pre-defined binding site templates • Ligand/Catalytic prediction: SiteEngines, PDBSiteScan, MSDSite, Catalytic Site Atlas, Evolutionary Trace
first explode the structure into 3 residue fragments (templates) green and purple – identical residues; orange and white – similar residues Automatic binding site templates Matching reverse templates and assessing relevance of hits by looking at sequence conservation within the local environment • e.g. GASP, DRESPAT, PINTS, FLORA, Reverse templates. Laskowski and Thornton (2005)
Methods for analysing ligand binding • Surface comparison: SURF’S UP, pvSOAR, Consurf. • Mapping sequence conservation: Evolutionary trace, many other methods
Can characterising structure-function families help with function prediction? 1q77A00 Unknown function 1o97D01 Electron transfer flavoprotein 1dnpA01 Deoxyribo- dipyrimidine photo-lyases 1ej2A00 Nucleotidylyl- transferases 1n3lA01 AA tRNA synthetases
Align domains within FSG Determine FSG-specific positions Functional group A Functional group B FLORA: Collate functional groups
FLORA: Extracting enzyme family-specific vectors Comparing unclassified structures to templates - score similarity over all template vectors
FLORA: Performance of FLORA compared to structure comparison Coverage Error
Useful methods of function prediction from structure • PROFUNC • Several methods (e.g. BLAST, MSDFold, template methods) • PROKNOW • Annotation with Gene Ontology terms • FLORA • Direct link to CATH
Summary • CATH decomposes PDB structures into their component domains and classifies domains into superfamilies. • There are some very large superfamilies, which are structurally and functionally diverse and dominate the genomes. • Structural data can help us understand how different protein functions have evolved.
Practical http://www.cathdb.info Click on Documentation at the top, then Tutorials, then “Combining structural and functional analysis”