800 likes | 835 Views
C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Lecture 14:. Protein domains, function and associated prediction. Introduction to Bioinformatics. Functional Genomics – Systems Biology. Genome.
E N D
C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Lecture 14: Protein domains, function and associated prediction Introduction to Bioinformatics
Functional Genomics – Systems Biology Genome Expressome Proteome TERTIARY STRUCTURE (fold) Metabolome TERTIARY STRUCTURE (fold) Metabolomics fluxomics
Experimental • Structural genomics • Functional genomics • Protein-protein interaction • Metabolic pathways • Expression data
Issue when elucidating function experimentally • Typically done through knock-out experiments • Partial information (indirect interactions) and subsequent filling of the missing steps • Negative results (elements that have been shown not to interact, enzymes missing in an organism) • Putative interactions resulting from computational analyses
Protein function categories • Catalysis (enzymes) • Binding – transport (active/passive) • Protein-DNA/RNA binding (e.g. histones, transcription factors) • Protein-protein interactions (e.g. antibody-lysozyme) (experimentally determined by yeast two-hybrid (Y2H) or bacterial two-hybrid (B2H) screening ) • Protein-fatty acid binding (e.g. apolipoproteins) • Protein – small molecules (drug interaction, structure decoding) • Structural component (e.g. -crystallin) • Regulation • Signalling • Transcription regulation • Immune system • Motor proteins (actin/myosin)
Catalytic properties of enzymes Michaelis-Menten equation: Vmax × [S] V = ------------------- Km + [S] Vmax Km kcat E + S ES E + P • E = enzyme • S = substrate • ES = enzyme-substrate complex (transition state) • P = product • Km = Michaelis constant • Kcat = catalytic rate constant (turnover number) • Kcat/Km = specificity constant (useful for comparison) Moles/s Vmax/2 Km [S]
Protein interaction domains http://pawsonlab.mshri.on.ca/html/domains.html
Energy difference upon binding Examples of protein interactions (and of functional importance) include: • Protein – protein (pathway analysis); • Protein – small molecules (drug interaction, structure decoding); • Protein – peptides, DNA/RNA The change in Gibb’s Free Energy of the protein-ligand binding interaction can be monitored and expressed by the following equation: G = H – T S (H=Enthalpy, S=Entropy and T=Temperature)
Protein function • Many proteins combine functions • Some immunoglobulin structures are thought to have more than 100 different functions (and active/binding sites) • Alternative splicing can generate (partially) alternative structures
Protein function & Interaction Active site / binding cleft Shape complementarity
Protein function evolution Chymotrypsin ... to a more elaborate active site with four different features, all helping to optimise proteolysis (cleavage) From a simple ancestral active site for cutting protein chains... Gene duplication has resulted in two-domain protein
Chymotrypsin Protein function evolution The active site lies between the two domains. It consists of residues on the same two loops (firstly between beta-strands 3 and 4, secondly between beta strands 5 and 6) of each of the two barrel domains. Four features of the active site are indicated in the figure. The Substrate Specificity Pocket Main Chain Substrate-binding The Oxyanion Hole (white) Catalytic triad Chymotrypsin cleaves peptides at the carboxyl side of tyrosine, tryptophan, and phenylalanine because those three amino acids contain phenyl rings.
How to infer function • Experiment • Deduction from sequence • Multiple sequence alignment – conservation patterns • Homology searching • Deduction from structure • Threading • Structure-structure comparison • Homology modelling
A domain is a: • Compact, semi-independent unit (Richardson, 1981). • Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). • Recurring functional and evolutionary module (Bork, 1992). • “Nature is a tinkerer and not an inventor” (Jacob, 1977). • Smallest unit of function
Delineating domains is essential for: • Obtaining high resolution structures (x-ray but particularly NMR – size of proteins) • Sequence analysis • Multiple sequence alignment methods • Prediction algorithms (SS, Class, secondary/tertiary structure) • Fold recognition and threading • Elucidating the evolution, structure and function of a protein family (e.g. ‘Rosetta Stone’ method) • Structural/functional genomics • Cross genome comparative analysis
Domain connectivity linker
Structural domain organisation can be nasty… Pyruvate kinase Phosphotransferase b barrel regulatory domain a/b barrel catalytic substrate binding domain a/b nucleotide binding domain 1 continuous + 2 discontinuous domains
Domain size • The size of individual structural domains varies widely • from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998) • the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995) • with an average of about 100 residues (Islam et al., 1995). • Small domains (less than 40 residues) are often stabilised by metal ions or disulphide bonds. • Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).
Domain characteristics Domains are genetically mobile units, and multidomain families are found in all three kingdoms (Archaea, Bacteria and Eukarya) underlining the finding that ‘Nature is a tinkerer and not an inventor’ (Jacob, 1977). The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multidomain proteins created as a result of gene duplication events (Apic et al., 2001). Domains in multidomain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993).
Protein function evolution-Gene (domain) duplication - Active site Chymotrypsin
Pyruvate phosphate dikinase • 3-domain protein • Two domains catalyse 2-step reaction A B C • Third so-called ‘swivelling domain’ actively brings intermediate enzymatic product (B) over 45Å from one active site to the other /
Pyruvate phosphate dikinase • 3-domain protein • Two domains catalyse 2-step reaction A B C • Third so-called ‘swivelling domain’ actively brings intermediate enzymatic product (B) over 45Å from one active site to the other /
The DEATH Domain • Present in a variety of Eukaryotic proteins involved with cell death. • Six helices enclose a tightly packed hydrophobic core. • Some DEATH domains form homotypic and heterotypic dimers. http://www.mshri.on.ca/pawson
Detecting Structural Domains • A structural domain may be detected as a compact, globular substructure with more interactions within itself than with the rest of the structure (Janin and Wodak, 1983). • Therefore, a structural domain can be determined by two shape characteristics: compactness and its extent of isolation (Tsai and Nussinov, 1997). • Measures of local compactness in proteins have been used in many of the early methods of domain assignment (Rossmann et al., 1974; Crippen, 1978; Rose, 1979; Go, 1978) and in several of the more recent methods (Holm and Sander, 1994; Islam et al., 1995; Siddiqui and Barton, 1995; Zehfus, 1997; Taylor, 1999).
Detecting Structural Domains • However, approaches encounter problems when faced with discontinuous or highly associated domains and many definitions will require manual interpretation. • Consequently there are discrepancies between assignments made by domain databases (Hadley and Jones, 1999).
Detecting Domains using Sequence only • Even more difficult than prediction from structure!
Integrating protein multiple sequence alignment, secondary and tertiary structure prediction in order to predict structural domain boundaries in sequence data SnapDRAGON • Richard A. George • George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.
SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH TERTIARY STRUCTURE (fold) QUATERNARY STRUCTURE Protein structure hierarchical levels
SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH TERTIARY STRUCTURE (fold) QUATERNARY STRUCTURE Protein structure hierarchical levels
SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH TERTIARY STRUCTURE (fold) QUATERNARY STRUCTURE Protein structure hierarchical levels
SECONDARY STRUCTURE (helices, strands) PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH TERTIARY STRUCTURE (fold) QUATERNARY STRUCTURE Protein structure hierarchical levels
SNAPDRAGONDomain boundary prediction protocol using sequence information alone (Richard George) • Input: Multiple sequence alignment (MSA) and predicted secondary structure • Generate 100 DRAGON 3D models for the protein structure associated with the MSA • Assign domain boundaries to each of the 3D models (Taylor, 1999) • Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted window George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851.
SnapDragon Folds generated by Dragon Multiple alignment Boundary recognition (Taylor, 1999) Predicted secondary structure Summed and Smoothed Boundaries CCHHHCCEEE
SNAPDRAGONDomain boundary prediction protocol using sequence information alone (Richard George) • Input: Multiple sequence alignment (MSA) • Sequence searches using PSI-BLAST (Altschul et al., 1997) • followed by sequence redundancy filtering using OBSTRUCT (Heringa et al.,1992) • and alignment by PRALINE (Heringa, 1999) • and predicted secondary structure • PREDATOR secondary structure prediction program George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851.
Domain prediction using DRAGON Distance Regularisation Algorithm for Geometry OptimisatioN (Aszodi & Taylor, 1994) • Folded protein models based on the requirement that (conserved) hydrophobic residues cluster together. • First construct a random high dimensional Cadistance matrix. • Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.
SNAPDRAGONDomain boundary prediction protocol using sequence information alone (Richard George) • Generate 100 DRAGON (Aszodi & Taylor, 1994)models for the protein structure associated with the MSA • DRAGON folds proteins based on the requirement that (conserved) hydrophobic residues cluster together • (Predicted) secondary structures are used to further estimate distances between residues (e.g. between the first and last residue in a b-strand). • It first constructs a random high dimensional Ca (and pseudo Cb) distance matrix • Distance geometry is used to find the 3D conformation corresponding to a prescribed matrix of desired distances between residues (by gradual inertia projection and based on input MSA and predicted secondary structure) DRAGON = Distance Regularisation Algorithm for Geometry OptimisatioN
Multiple alignment C distance matrix Target matrix Predicted secondary structure N N 3 N N 100 randomised initial matrices 100 predictions CCHHHCCEEE Input data N • The C distance matrix is divided into smaller clusters. • Separately, each cluster is embedded into a local centroid. • The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.
Lysozyme 4lzm PDB DRAGON
Methyltransferase 1sfe PDB DRAGON
Phosphatase 2hhm-A PDB DRAGON