710 likes | 831 Views
Lecture 9.2: Homology and Structural Similarity (What do when you have no structure ...). Boris Steipe boris.steipe@utoronto.ca http://biochemistry.utoronto.ca/steipe Departments of Biochemistry and Molecular and Medical Genetics Program in Proteomics and Bioinformatics
E N D
Lecture 9.2:Homology and Structural Similarity(What do when you have no structure ...) Boris Steipe boris.steipe@utoronto.ca http://biochemistry.utoronto.ca/steipe Departments of Biochemistry and Molecular and Medical Genetics Program in Proteomics and Bioinformatics University of Toronto (This lecture is based in part on a lecture held by Chris Hogue, Toronto, for CBW in 2002)
Concepts • Domains are folding units, functional units and units of inheritance. • Homologous domains have similar structure. • Structural similarity can be measured and similar domains can be retrieved from databases. • Detection of similar folds can provide mechanistic explanations. • Threading methods can sometimes find similar folds. • Ab initio predictions of structure are highly experimental.
Concept 1: Domains are folding units, functional units, and units of inheritance.
Domains as units of inheritance - the PH domain story Dotlet - A dotplot of Pleckstrin (p47) reveals similarity between N-and C terminus !
Domains as units of inheritance - the PH domain story # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 100 # Identity: 31/100 (31.0%) # Similarity: 48/100 (48.0%) # Gaps: 6/100 ( 6.0%) 6 IREGYLVKKGSVFNTWKPMWVVLLEDG--IEFYKKKSDNSPKGMIPLKGS 53 |::|.|:|:|.....||....:|.||. :.:|.......|.|.|.|:|. 245 IKQGCLLKQGHRRKNWKVRKFILREDPAYLHYYDPAGAEDPLGAIHLRGC 294 54 TLTSPCQDFGKRMF----VFKITTTKQQDHFFQAAFLEERDAWVRDINKA 99 .:||...:...|.. :|:|.|..:..:|.|||..:||..|::.|..| 295 VVTSVESNSNGRKSEEENLFEIITADEVHYFLQAATPKERTEWIKAIQMA 344 Emboss - Optimal sequence alignment: 31% identity over ~100 amino acids.
Domains as units of inheritance - the PH domain story # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 100 # Identity: 31/100 (31.0%) # Similarity: 48/100 (48.0%) # Gaps: 6/100 ( 6.0%) 6 IREGYLVKKGSVFNTWKPMWVVLLEDG--IEFYKKKSDNSPKGMIPLKGS 53 |::|.|:|:|.....||....:|.||. :.:|.......|.|.|.|:|. 245 IKQGCLLKQGHRRKNWKVRKFILREDPAYLHYYDPAGAEDPLGAIHLRGC 294 54 TLTSPCQDFGKRMF----VFKITTTKQQDHFFQAAFLEERDAWVRDINKA 99 .:||...:...|.. :|:|.|..:..:|.|||..:||..|::.|..| 295 VVTSVESNSNGRKSEEENLFEIITADEVHYFLQAATPKERTEWIKAIQMA 344 ! N- -C Human p47 N- -C Human p47 Overlapping alignments may define domain boundaries ! We can search a database with this knowledge ...
Domains as units of inheritance - the PH domain story N- -C Human p47 Hits are smoothly bounded and extend over the entire domain. 486 hits ... etc.
Domains as units of inheritance - the PH domain story in contrast ... N- -C Human p47 Hits extend over the entire domain. PSI Blast would be difficult ... (Yeast only, for clarity)
Concept 2: Homologous domains have similar structure.
Homologous domains have similar structures 1PLS/2DYN: 23% ID 1PLS - PH domain (Human pleckstrin) 2DYN - PH domain (Human dynamin)
Homology and Structural Similarity Proteins that diverge in evolution maintain their global fold ! Russell et al. (1997) J Mol Biol 269: 423-439
Concept 3: Structural similarity can be measured and similar domains can be retrieved from databases.
RMSD metric To calculate the RMSD, a pairwise correspondence of points has to be defined first.
RMSDopt RMSDopt = min(RMSDcoord) RMSDopt = RMSDcoord(A, Rs x (B-Ts)) The translation vector Ts and the rotation matrix Ms define a superposition of the vector set B on A. An analytic solution of the superposition problem is available, but not straightforward (involves an eigenvalue problem).
Superposition in practice Prealigned structures • VAST(http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml) • FSSP(http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html) • Homstrad(http://www-cryst.bioc.cam.ac.uk/~homstrad/) 60 70 80 90 100 1dro ( 32 ) wdkVyMaAkAG-------rIsFykd-qkgyk----------snpelTfrg 1btn ( 23 ) whnVyCvin-------nqeMgFykd-aksaa----------sg--ipYh s1pls ( 21 ) wkpmwVVLle-------dgIeFykk-ksdn---------------spk-- 1fgya ( 281 ) wkrrwFiLTd-------ncLyYFey-ttdk---------------epr-- 1faoa ( 181 ) wktrwFtLhr-------neLkYfkd-qm sp---------------epi-- 1qqga ( 25 ) mhkrFFVLraaseaggparLEyYen-ekkwr----------hkssapk-- 1bak ( 576 ) wqrryFyLfp-------nrlewrge----------------geap----- 1dyna ( 30 ) skeYwFvLta-------enLsWykd-deek---------------ekk-- 1dbha ( 456 ) kherhIFLFd--------gLICCksnhgqprl--------pgasnaeyrL 1b55a ( 25 ) fkkrlFlLtv-------hkLsYyeydfe--r----------grrgskk-- 1mai ( 37 ) rreRfYkLqe-----dcktIwqesr-kv-----------------mrspe 1fhoa ( 25 ) pKlRyVfLfr-------nkimFtEqd---ast--------s---ppsyth 1foea (1288 ) ePeLaAfVFk-------tAVVLVykdgskqkkklvgshrlsiyeewdpfr bbbbbb bbbbb
Superposition in practice Web services • VAST(http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml) • CE(http://cl.sdsc.edu/ce.html) • LGA(http://predictioncenter.llnl.gov/local/lga/lga.html) • Prosup(http://lore.came.sbg.ac.at:8080/CAME/CAME_EXTERN/PROSUP/) (Note: Click on "Rasmol" on the results page to return the alignment) Useability and reliability of these services is variable. "Intelligent" algorithms can superimpose without the need for user definition of correspondence. The downside is that the user cannot define correspondences.
Superposition in practice - locally installed Many molecular modeling programs have superposition features: DeepView(http://ca.expasy.org/spdbv/) MolMol(http://www.mol.biol.ethz.ch/wuthrich/software/molmol/) O(http://alpha2.bmc.uu.se/~alwyn/o_related.html) WhatIf(http://www.cmbi.kun.nl/whatif/)
When is RMSD misleading ? Rigid body movement of domains or subdomains ... ?
Internal coordinates as an alternative to superposition a a' c' b' b c (a,a') (b,b') (c,c')
... and FSSP The prealigned fold-tree
Workflow: MMDB ... Open http://www.ncbi.nlm.nih.gov/ enter your search term ...
Workflow: MMDB ... Choose "Structure" ...
Workflow: MMDB ... Choose your protein of interest ...
Concept 4: Detection of similar folds can provide mechanistic explanations.
Protein Modules Modular interactions between biomolecules are responsible for the inner workings of the cell. There are far more modular interacting proteins than classical enzymes in the human genome – we have known this since S. cerevisiae. Pawson & Lin
ANK3 BH1 C1 C2 14-3-3 ARM CARD Death DED EFH EH EVH FYVE PDZ PH PTB SAM SH2 SH3 WD40 WW Protein Domains – an alphabet of functional modules
Workflow for domain architectures Starting from a citation ...
... link to domain architecture ... (from CDDdatabase - incl. SMART and Pfam)
Protein structure prediction What to do when no structure is known and no homologues are found ?
Three Paths to Protein Structure Prediction • Homology Modeling • Threading (Fold recognition) • Ab initio prediction
Concept 5: Threading methods can sometimes find similar folds.
Fold recognition ("Threading") Template Structure Query Sequence Query Sequence Query Sequence Query Sequence
Threading Database Search • Premise is that most sequences match some 3-D structure that is already known (1/2) • Given a database of known 3-D protein folds: • align the test sequence to each known protein • in real 3-D coordinate space (slow but exact) • in parameterized 1-D space (fast but approximate) • optimize some scoring function • sort out best sequence-structure alignment • assess alignments - statistically significant?
Threading Statistics • Z score (sequence composition correction) • number of standard deviations the found alignment is off from the mode of a randomized version of the structure or profile • P value (sequence length correction) • Shuffle the sequence - make a distribution of random threads… • Is the unscrambled thread any better than a randomly optimized sequence… • Z score of Z scores • Look for P values as a criterion for choosing a threading method...
Database Searching... • Sensitivity • High sensitivity implies finding all possible true positive matches in the database • Specificity • High specificity implies finding no false positive matches in the search.
Threading as a Database Search Method • Has INCREDIBLY poor sensitivity • %10-20 on a good day • Has INCREDIBLY poor specificity. • 90% of hits are false positives • So...
Interpret Threading Accordingly... • In a ranked list of 10 matches, expect that only one might be correct • Expect that none may be correct • Expect that the top ranked hit is a false positive...
How then does Threading find things? • If there is a true positive in a threading search hit list - People find it ... • It is most often found by FUNCTIONAL similarity. • Similar enzymatic mechanisms • Motifs, DART ... • Similar roles, cellular distributions ...