1 / 55

Protein Multiple Alignment

Protein Multiple Alignment. by Konstantin Davydov. Papers. MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar

kacia
Download Presentation

Protein Multiple Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein Multiple Alignment by Konstantin Davydov

  2. Papers • MUSCLE: a multiple sequence alignment method with reduced time and space complexityby Robert C Edgar • ProbCons: Probabilistic Consistency-based Multiple Sequence Alignmentby Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou

  3. Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion

  4. ACCTGCA ACCTGCA-- ACTTCAA AC--TTCAA Introduction • What is multiple protein alignment? Given N sequences of amino acids x1, x2 … xN: Insert gaps in each of the xis so that • All sequences have the same length • Score of the global map is maximum

  5. Introduction • Motivation • Phylogenetic tree estimation • Secondary structure prediction • Identification of critical regions

  6. Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion

  7. ACCTGCA ACCTGCA-- ACTTCAA AC--TTCAA Background • Aligning two sequences

  8. z x y Background AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

  9. Background • Unfortunately, this can get very expensive • Aligning N sequences of length L requires a matrix of size LN, where each square in the matrix has 2N-1 neighbors • This gives a total time complexity of O(2N LN)

  10. Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion

  11. MUSCLE

  12. MUSCLE • Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied • Three stages • At end of each stage, a multiple alignment is available and the algorithm can be terminated

  13. Three Stages • Draft Progressive • Improved Progressive • Refinement

  14. Stage 1: Draft Progressive • Similarity Measure • Calculated using k-mer counting. ACCATGCGAATGGTCCACAATG k-mer: ATG CCA score: 3 2

  15. X1 X2 X3 X4 X 0.5 0.7 0.3 X1 X X 0.2 0.8 X2 X3 X X X 0.6 X X X X X4 Stage 1: Draft Progressive • Distance estimate • Based on the similarities, construct a triangular distance matrix.

  16. X1 X2 X3 X4 X1 X1 X 0.5 0.7 0.3 X4 X1 X1 X4 X4 X X 0.2 0.8 X2 X2 X3 X2 X X X 0.6 X3 X2 X3 X X X X X3 X4 Stage 1: Draft Progressive • Tree construction • From the distance matrix we construct a tree

  17. Stage 1: Draft Progressive

  18. X1 X1 X4 Alignment of X1, X2, X3, X4 X1 X4 X4 X2 X3 X2 X2 X3 X3 Stage 1: Draft Progressive • Progressive alignment • A progressive alignment is built by following the branching order of the tree. This yields a multiple alignment of all input sequences at the root.

  19. X1 X2 X3 X4 X X1 X1 X4 X X X2 X2 X3 X3 X X X X X X X X4 Stage 2: Improved Progressive • Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated.

  20. TCC--AA TCA--GA TCC--AA TCA--AA TCA--AA G--ATAC T--CTGC Stage 2: Improved Progressive • Similarity Measure • Similarity is calculated for each pair of sequences using fractional identity computed from their mutual alignment in the current multiple alignment

  21. X1 X2 X3 X4 X X1 X X X2 X3 X X X X X X X X4 Stage 2: Improved Progressive • Tree construction • A tree is constructed by computing a Kimura distance matrix and applying a clustering method to it

  22. Stage 2: Improved Progressive • Tree comparison • The new tree is compared to the previous tree by identifying the set of internal nodes for which the branching order has changed

  23. X2 X2 X4 X2 X4 New Alignment X4 X1 X3 X1 X1 X3 X3 Stage 2: Improved Progressive • Progressive alignment • A new progressive alignment is built

  24. Stage 3: Refinement • Performs iterative refinement

  25. X1 X3 X2 X4 X5 Stage 3: Refinement • Choice of bipartition • An edge is removed from the tree, dividing the sequences into two disjoint subsets X1 X3 X2 X4 X5

  26. TCC--AA TCCAA X1 TCC--AA TCA--AA TCAAA X2 TCA--GA X3 TCA--AA TCA--GA X4 G--ATAC G--ATAC X5 T--CTGC T--CTGC Stage 3: Refinement • Profile Extraction • The multiple alignment of each subset is extracted from current multiple alignment. Columns made up of indels only are removed

  27. T--CCAA TCCAA T--CAAA TCAAA TCA--GA TCA--GA G--ATAC G--ATAC T--CTGC T--CTGC Stage 3: Refinement • Re-alignment • The two profiles are then realigned with each other using profile-profile alignment.

  28. T--CCAA New Old TCC--AA T--CAAA TCA--GA TCA--AA OR TCA--GA G--ATAC G--ATAC T--CTGC T--CTGC Stage 3: Refinement • Accept/Reject • The score of the new alignment is computed, if the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded.

  29. MUSCLE Review • Performance • For alignment of N sequences of length L • Space complexity: O(N2+L2) • Time complexity: O(N4+NL2) • Time complexity without refinement: O(N3+NL2)

  30. Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion

  31. Hidden Markov Models (HMMs) X M :X AGCC-AGC Y -GCCCAGT :Y IMMMJMMM J I -- -- X Y

  32. Pairwise Alignment • Viterbi Algorithm • Picks the alignment that is most likely to be the optimal alignment • However: the most likely alignment is not the most accurate • Alternative: find the alignment of maximum expected accuracy

  33. 4. F 4. T 4. F 4. F 4. F B A- A A- A 4. F 4. F 4. T 4. F 4. F B- B+ B+ B- C Lazy Teacher Analogy • 10 students take a 10-question true-false quiz • How do you make the answer key? • Viterbi Approach: Use the answer sheet of the best student • MEA Approach: Weighted majority vote

  34. Viterbi vs MEA • Viterbi • Picks the alignment with the highest chance of being completely correct • Maximum Expected Accuracy • Picks the alignment with the highest expected number of correct predictions

  35. ProbCons • Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment. • Uses Maximum Expected Accuracy instead of the Viterbi alignment. • 5 steps

  36. Notation • Given N sequences • S = {s1, s2, … sN} • a* is the optimal alignment

  37. ProbCons • Step 1: Computation of posterior-probability matrices • Step 2: Computation of expected accuracies • Step 3: Probabilistic consistency transformation • Step 4: Computation of guide tree • Step 5: Progressive alignment • Post-processing step: Iterative refinement

  38. Step 1: Computation of posterior-probability matrices • For every pair of sequences x,yS, compute the matrix Pxy • Pxy(i, j) = P(xi~yj a* | x, y), which is the probability that xi and yj are paired in a*

  39. # of correct predicted matches length of shorter sequence accuracy(a, a*) = Step 2: Computation of expected accuracies • For a pairwise alignment a between x and y, define the accuracy as:

  40. Step 2: Computation of expected accuracies (continued) • MEA alignment is found by finding the highest summing path through the matrix • Mxy[i, j] = P(xi is aligned to yj | x, y)

  41. Consistency zk z xi x y yj yj’

  42. Step 3: Probabilistic consistency transformation • Re-estimate the match quality scores P(xi~yj a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison: P(xi~yj a* | x, y) P(xi~yj a* | x, y, z)

  43. Step 3: Probabilistic consistency transformation (continued)

  44. Step 3: Probabilistic consistency transformation (continued) • Since most of the values of Pxz and Pzy will be very small, we ignore all the entries in which the value is smaller than some threshold w. • Use sparse matrix multiplication • May be repeated

  45. X1 X2 X3 X4 X X1 X X X2 X3 X X X X X X X X4 Step 4: Computation of guide tree • Use E(x,y) as a measure of similarity • Define similarity of two clusters by the sum-of-pairs

  46. Step 5: Progressive alignment • Align sequence groups hierarchically according to the order specified in the guide tree. • Alignments are scored using sum-of-pairs scoring function. • Aligned residues are scored according to the match quality scores P(xi~yj a* | x, y) • Gap penalties are set to 0.

  47. X1 X1 X3 X3 X2 X2 X4 X4 X5 X5 Post-processing step: iterative refinement • Much like in MUSCLE • Randomly partition alignment into two groups of sequences and realign • May be repeated

  48. ProbCons overview • ProbCons demonstrated dramatic improvements in alignment accuracy • Longer running time • Doesn’t use protein-specific alignment information, so can be used to align DNA sequences with improved accuracy over the Needleman-Wunsch algorithm.

  49. Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion

  50. Conclusion • MUSCLE demonstrated poor accuracy, but very short running time. • ProbCons demonstrated dramatic improvements in alignment accuracy, however, is much slower than MUSCLE.

More Related