400 likes | 599 Views
Protein homology detection by HMM–HMM comparison Johannes Söding. A topic in Sequence analysis Presented by: Giriprasad Sridhara giri@UDel.edu CISC 841 Spring 2006 APR 20 2006. Organization of presentation. Introduction Theory Results Conclusion. Introduction. Paper Details:
E N D
Protein homology detectionby HMM–HMM comparisonJohannes Söding A topic in Sequence analysis Presented by: Giriprasad Sridhara giri@UDel.edu CISC 841 Spring 2006 APR 20 2006
Organization of presentation • Introduction • Theory • Results • Conclusion
Introduction • Paper Details: • Bioinformatics journal • Vol. 21 no. 7 2005, pages 951–960 • Author Details • Dr. Johannes Söding • Department of Protein Evolution, MaxPlanckInstitute for Developmental Biology, Spemannstrasse 35, D72076 Tübingen, Germany
Introduction • Tool Details: • A tool HHPred has been developed. • Described in Nucleic Acid Research, 2005, Vol 33 • A web server is available at http://www.protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hhpred
Introduction • Central theme in bioinformatics: • Homology and sequence alignment • Issues: • Problem of finding a close homolog with known function or structure which would allow to make inferences about the protein under observation. • New and highly sensitive methods could detect and align remotely homologous sequences that provide information about the protein’s function, structure or evolution.
Introduction • Methods (Tools) of homology detection: (In increasing order of sensitivity) • Sequence - Sequence • BLAST • FASTA • Profile - Sequence • PSIBLAST • More sensitive since it uses a sequence profile • Profile – Profile • COMPASS • PROF_SIM • Profile - HMM • HMMER • HMM-HMM • HHPred
Introduction • Sequence profiles • Built from a multiple alignment of homologous sequences • Contains more information about the sequence family than a single sequence. • Helps to distinguish between • conserved and non-conserved positions • Conserved are important for defining members of the family • Non-conserved are variable among the members of the family. • Describe exactly • what variation in amino acids is possible at each position • Done by recording the probability for the occurrence of each amino acid along the multiple alignment.
Introduction • Profile Hidden Markov Models (Profile HMMs) • Similar to simple sequence profiles • have amino acid frequencies as in the columns of a MSA • Also have position specific probabilities for inserts and deletions along the alignment • logarithms of these probabilities =position specific gap penalties • Perform better than sequence profiles in the detection of homologs and in the quality of alignments • Why higher sensitivity? • Position specific gap penalties penalize chance hits much more than true positives • which tend to have insertions or deletions at the same positions as the sequences from which the HMM was built.
Introduction • Pictorial representation of profile HMM With M, I and D states.
Theory • Align 2 HMM by maximizing a score • Score is log-sum-of-odds score. • What does a path through the 2 HMMs Represent? • A sequence co-emitted by both HMMs • How do we find this path? • Use dynamic programing (Viterbi) • Find path that maximizes log-sum-of-odds score
Theory • Advantages of HMM-HMM • Improves both sensitivity and alignment quality • Calibrate the score for additional sensitivity • Use scoring correlation function • Use secondary structure information • Even sequences that are distantly homologous will have similar secondary structures. • This can help distinguish real homologs from chance hits • Biologically, secondary structures diverge more slowly than sequences • This knowledge is utilized.
Theory • Additionally to enhance homology detection • Score secondary structure • Use other available additional information (like confidence – term covered later on in the slides) • Tool HHPred • Homology detection & structure prediction • Novelty • HMM-HMM comparison • Scores secondary structure • Reliability measured by • Probability of each match being a true positive • Used since e-values reported by most tools can be inaccurate
Theory (Log-sum-of-odds score) • Defined as • Numerator • probability that x1,…xL is co-emitted by both HMMs along the alignment path • Denominator • Standard null model probability • Summation • Runs over all sequences of L residues that can be emitted along the alignment path by both HMMs
Theory • How do we apply Viterbi algorithm? • Denote • 2 HMMs p and q • Probability of emitting amino acid a in match state i or j is qi(a) and pj(a) • Trans prob = qi(X, X’) and pj(Y,Y’) • X or Y can belong to {M, I or D} • f(a) = fixed background frequency • Let Xk and Yk be states in q and p in the k’th column of the alignment of q with p. • i(k) and j(k) be the corresponding columns from q and p. • qk(l)P (a) and pk(l)P(a) = emission prob from q and p.
Theory • Ρtr is the product of all transition probabilities for the path through p and q • qk(l)P (a) = qi(k) (a) for Xk = M • qk(l)P (a) = f (a) for Xk = I
Theory • Column score properties: • Positive when 2 distributions are similar • Negative otherwise • Insert states have vanishing column score • Completely non-conserved, pj(a) = f(a) • 1/f(a) • Weight factor to co-emission probability • For a rare amino acid • f(a) will be low 1/f(a) will be high • Weight of rarer amino acids increases in the score calculation as compared to common amino acids.
Theory • Pair-wise alignment of HMMs • Allowed transitions • Dynamic programing matrices for Viterbi
Theory • We use 5 DP matrices S xy one for each pair state XY belonging to {MM, MI, IM, DG, GD} • SMM (i, j) = Saa(qi,pj) + max { SMM(i-1,j-1) + log[q i-1(M,M) p j-1(M,M)], SMI(i-1,j-1) + log[q i-1(M,M) p j-1(I,M)] SIM(i-1,j-1) + log[q i-1(I,M) p j-1(M,M)] SDG(i-1,j-1) + log[q i-1(D,M) p j-1(M,M)] SGD(i-1,j-1) + log[q i-1(M,M) p j-1(D,M)] }
Theory • SMI (i, j) = max { SMM(i-1,j) + log[q i-1(M,M) p j-1(M,I)], SMI(i-1,j) + log[q i-1(M,M) p j-1(I,I)] } • SDG (i, j) = max { SMM(i-1,j) + log[q i-1(M,D)], SDG(i-1,j) + log[q i-1(D,D) } • Initialize SMM(I,0) = 0 = SMM(0,j) • S LSO = max over last row, col of S MM • Trace back from this cell.
Theory • Scoring correlations • Clustering • In an alignment of 2 homologous HMMs • Expect high column scores in • Clusters along the sequence • In an alignment of non-homologous HMMs • Do not Expect any clustering. • The above can help • Differentiate homologous and non-homologous alignments • If l’th pair state of optimum path aligns columns i(l) of q and j(l) of p • Sl = SAA(qi(l), pj(l)) iff l’th pair state = MM, else 0. • Auto-correlation function
Theory • Scoring correlations • Auto-correlation function describes correlation of Sl at a fixed sequence separation d • Expect • if 2 HMMs are homologous • A Positive g(d) for small d. • Add a correction factor • wcorr is found empirically to be 0.1 • The correction factor is added after the best alignment is found.
Theory • Scoring secondary structure • Allows to score predicted secondary structure against • Another predicted secondary structure • Or a known secondary structure • Predicted secondary structure vs. known secondary structure. • DSSP used to assign 1 of 7 states of observed secondary structure • PSIPRED used to predict secondary structure states, H, E or C. • Predict secondary structure of all domains in SCOP (filtered to twilight zone) • Compare the PSIPRED predictions with DSSP • Get the count of combination of (σ;ρ,c). • σ belongs to {H,E,C,G,B,S,T} • ρ belongs to {H,E,C} • c belongs to {0,1,…,9}
Theory • Scoring secondary structure • Derive 10 3*7 substitution matrices (one for each confidence value) Mss(σ;ρ,c) = log (P (σ;ρ,c)/P(σ)P(ρ,c)) • Let • Column i of HMM q have pred sec struct ρiq and confidence value ciq • Column j of HMM p have known sec struct σjp (Note: known sec struct secondary structure of seed seq of alignment) • Define • Sss(q I p j) = wss Mss(σjp;ρiq ciq) • Empirically Wss is 1/7. • This score is added to amino acid column score Saa(qi, pj) for use in the Viterbi algorithm.
Theory • Scoring secondary structure (predicted vs predicted) • The above matrix informs • How much more probable is it to get the predictions ρiq ciq and ρjp cjp for a pair of aligned homologous residues than to get them independently of each other. • Sss(q I p j) = wss Mss(ρiq ciq ρjp cjp) • Empirically Wss is 1/7. • This score is added to amino acid column score Saa(qi, pj) for use in the Viterbi algorithm.
Results and Discussion • All-against-All comparison with the following similarity search tools: • Sequence-Sequence • BLAST • Profile-Sequence • PSI-BLAST • HMM-Sequence • HMMER • Profile-Profile • COMPASS • PROF_SIM • Test • Input below the twilight zone • Ability to detect remote homologs • Ability to give high-quality alignments.
Results and Discussion • Different versions of tool used for better juxtaposition of results • HHSearch 0 • Simple profile-profile comparison • Gap opening penalty = -3.5, Gap Extension = -0.2 • Above used instead of transition prob log • HHSearch 1 • Basic HMM-HMM version • HHSearch 2 • Version 1 + inclusion of correlation score • HHSearch 3 • Version 2 + usage of predicted vs predicted secondary structure • HHSearch 4 • Version 3 + usage of predicted vs known secondary structure
Results and Discussion • SCOP (structural classification of proteins) database with filtering for twilight zone used. • Detection of homologs: • Domain in SCOP • Family or superfamily or fold or class • Pair of domains are homologous • If they are members of the same super family • Domains from different classes are classified as non-homologous • We present a chart of TP vs FP • TP homologous pairs • FP non-homologous pairs.
Results and Discussion • The figure shows classical sensitivity in the benchmark test.
Results and Discussion • Alternative definition of TP and FP • A pair is a TP • If the domains belong to same SCOP super-family • Or if the seq based alignment gives structural alignment with a “maxSub” score of at least 0.1 • A pair is a FP • If it is from different classes and has 0 MaxSub score • What is MaxSub score? • Informally • Defined such that a value > 0 occurs very rarely by chance • It tells what fraction of the query residues can be superposed structurally with the aligned residues from the other structure. • Formally • Weighted number of aligned pairs that can be superimposed with a maximum distance per pair of 3.5 Angstrom units/number of residues in the query sequence • Pairs with 0 Angstorm deviation wieght 1 • Pairs with 3.5 Angstorm deviation wieght 0.5
Results and Discussion • Plot of TP vs FP with new definition of TP and FP
Results and Discussion • Observation • More sensitive tools which use secondary structure (HHSearch 3, 4) improve • Reason • Reclassification of “harder to detect” ones as TP helps the more sensitive tools, since they would detect these.
Results and Discussion(Alignment quality) • Sequence alignment assessed by • Looking at the spatial distances between aligned pair of residues • upon superposition of the 3D structures • 2 scores used. • maxSub score • Drawback • Does not penalize over-prediction • Developer’s score • S Dev = N correct/min (Lq, Lp) • N Correct = No of residue pairs that are present in the max subset identified by maxSub • Lq and Lp = No of residues in the 2 sequences to be aligned. • Modeler’s score • S Mod = N correct / L ali • L Ali = No of aligned residue pairs in the seq alignment. • Does not penalize under-prediction. • Balanced score • S balanced = (S dev + S mod) / 2 • Penalizes both under and over prediction
Results and Discussion • HHSearch3 performs the best • Family level • Aligns 58% of all pairs with balanced score >= 0.3 • 1.23 times more than COMPASS • 1.28 times more than PROF_SIM • 1.34 times more than HMMER • 1.57 times more than PSI_BLAST • 4.4 times more than BLAST • Super family level • Aligns 27% of all pairs with balanced score >= 0.3 • 1.7 times more than COMPASS • 1.9 times more than PROF_SIM • 2.2 times more than HMMER • 2.9 times more than PSI_BLAST • 14 times more than BLAST
Results and Discussion • HHSearch3 performs the best • Fold level • Aligns 4.5% of all pairs with balanced score >= 0.3 • 3.3 times more than COMPASS • 6.0 times more than PROF_SIM • 7.3 times more than HMMER • 9.4 times more than PSI_BLAST • 63 times more than BLAST • Actually 4.5% at fold level is a lot • Pairs aligned at fold level are deemed non-homologous by SCOP • So we do not expect any good alignments at all
Conclusion • A generalization of HMM – Sequence alignment • Pairwise alignment of profile HMMs • Algorithm to maximize log-sum-of-odds score • Generalization of log-odds score • Increased sensitivity of 5-10% • Due to derivation of novel correlation score • Statistical methods for • Scoring predicted vs known secondary structure • Predicted vs predicted secondary structure • Uses confidence values of secondary structure prediction
Conclusion • HHPred • New tool based on the research paper • Benchmarking • With 5 other homology detection tools • Dataset in twilight zone • Results • Improvement in • Sensitivity • Alignment quality
Thank you. Have a nice day!