551 likes | 820 Views
Protein Multiple Alignment. by Konstantin Davydov. Papers. MUSCLE: a multiple sequence alignment method with reduced time and space complexity by Robert C Edgar
E N D
Protein Multiple Alignment by Konstantin Davydov
Papers • MUSCLE: a multiple sequence alignment method with reduced time and space complexityby Robert C Edgar • ProbCons: Probabilistic Consistency-based Multiple Sequence Alignmentby Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou
Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion
ACCTGCA ACCTGCA-- ACTTCAA AC--TTCAA Introduction • What is multiple protein alignment? Given N sequences of amino acids x1, x2 … xN: Insert gaps in each of the xis so that • All sequences have the same length • Score of the global map is maximum
Introduction • Motivation • Phylogenetic tree estimation • Secondary structure prediction • Identification of critical regions
Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion
ACCTGCA ACCTGCA-- ACTTCAA AC--TTCAA Background • Aligning two sequences
z x y Background AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Background • Unfortunately, this can get very expensive • Aligning N sequences of length L requires a matrix of size LN, where each square in the matrix has 2N-1 neighbors • This gives a total time complexity of O(2N LN)
Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion
MUSCLE • Basic Strategy: A progressive alignment is built, to which horizontal refinement is applied • Three stages • At end of each stage, a multiple alignment is available and the algorithm can be terminated
Three Stages • Draft Progressive • Improved Progressive • Refinement
Stage 1: Draft Progressive • Similarity Measure • Calculated using k-mer counting. ACCATGCGAATGGTCCACAATG k-mer: ATG CCA score: 3 2
X1 X2 X3 X4 X 0.5 0.7 0.3 X1 X X 0.2 0.8 X2 X3 X X X 0.6 X X X X X4 Stage 1: Draft Progressive • Distance estimate • Based on the similarities, construct a triangular distance matrix.
X1 X2 X3 X4 X1 X1 X 0.5 0.7 0.3 X4 X1 X1 X4 X4 X X 0.2 0.8 X2 X2 X3 X2 X X X 0.6 X3 X2 X3 X X X X X3 X4 Stage 1: Draft Progressive • Tree construction • From the distance matrix we construct a tree
X1 X1 X4 Alignment of X1, X2, X3, X4 X1 X4 X4 X2 X3 X2 X2 X3 X3 Stage 1: Draft Progressive • Progressive alignment • A progressive alignment is built by following the branching order of the tree. This yields a multiple alignment of all input sequences at the root.
X1 X2 X3 X4 X X1 X1 X4 X X X2 X2 X3 X3 X X X X X X X X4 Stage 2: Improved Progressive • Attempts to improve the tree and uses it to build a new progressive alignment. This stage may be iterated.
TCC--AA TCA--GA TCC--AA TCA--AA TCA--AA G--ATAC T--CTGC Stage 2: Improved Progressive • Similarity Measure • Similarity is calculated for each pair of sequences using fractional identity computed from their mutual alignment in the current multiple alignment
X1 X2 X3 X4 X X1 X X X2 X3 X X X X X X X X4 Stage 2: Improved Progressive • Tree construction • A tree is constructed by computing a Kimura distance matrix and applying a clustering method to it
Stage 2: Improved Progressive • Tree comparison • The new tree is compared to the previous tree by identifying the set of internal nodes for which the branching order has changed
X2 X2 X4 X2 X4 New Alignment X4 X1 X3 X1 X1 X3 X3 Stage 2: Improved Progressive • Progressive alignment • A new progressive alignment is built
Stage 3: Refinement • Performs iterative refinement
X1 X3 X2 X4 X5 Stage 3: Refinement • Choice of bipartition • An edge is removed from the tree, dividing the sequences into two disjoint subsets X1 X3 X2 X4 X5
TCC--AA TCCAA X1 TCC--AA TCA--AA TCAAA X2 TCA--GA X3 TCA--AA TCA--GA X4 G--ATAC G--ATAC X5 T--CTGC T--CTGC Stage 3: Refinement • Profile Extraction • The multiple alignment of each subset is extracted from current multiple alignment. Columns made up of indels only are removed
T--CCAA TCCAA T--CAAA TCAAA TCA--GA TCA--GA G--ATAC G--ATAC T--CTGC T--CTGC Stage 3: Refinement • Re-alignment • The two profiles are then realigned with each other using profile-profile alignment.
T--CCAA New Old TCC--AA T--CAAA TCA--GA TCA--AA OR TCA--GA G--ATAC G--ATAC T--CTGC T--CTGC Stage 3: Refinement • Accept/Reject • The score of the new alignment is computed, if the score is higher than the old alignment, the new alignment is retained, otherwise it is discarded.
MUSCLE Review • Performance • For alignment of N sequences of length L • Space complexity: O(N2+L2) • Time complexity: O(N4+NL2) • Time complexity without refinement: O(N3+NL2)
Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion
Hidden Markov Models (HMMs) X M :X AGCC-AGC Y -GCCCAGT :Y IMMMJMMM J I -- -- X Y
Pairwise Alignment • Viterbi Algorithm • Picks the alignment that is most likely to be the optimal alignment • However: the most likely alignment is not the most accurate • Alternative: find the alignment of maximum expected accuracy
4. F 4. T 4. F 4. F 4. F B A- A A- A 4. F 4. F 4. T 4. F 4. F B- B+ B+ B- C Lazy Teacher Analogy • 10 students take a 10-question true-false quiz • How do you make the answer key? • Viterbi Approach: Use the answer sheet of the best student • MEA Approach: Weighted majority vote
Viterbi vs MEA • Viterbi • Picks the alignment with the highest chance of being completely correct • Maximum Expected Accuracy • Picks the alignment with the highest expected number of correct predictions
ProbCons • Basic Strategy: Uses Hidden Markov Models (HMM) to predict the probability of an alignment. • Uses Maximum Expected Accuracy instead of the Viterbi alignment. • 5 steps
Notation • Given N sequences • S = {s1, s2, … sN} • a* is the optimal alignment
ProbCons • Step 1: Computation of posterior-probability matrices • Step 2: Computation of expected accuracies • Step 3: Probabilistic consistency transformation • Step 4: Computation of guide tree • Step 5: Progressive alignment • Post-processing step: Iterative refinement
Step 1: Computation of posterior-probability matrices • For every pair of sequences x,yS, compute the matrix Pxy • Pxy(i, j) = P(xi~yj a* | x, y), which is the probability that xi and yj are paired in a*
# of correct predicted matches length of shorter sequence accuracy(a, a*) = Step 2: Computation of expected accuracies • For a pairwise alignment a between x and y, define the accuracy as:
Step 2: Computation of expected accuracies (continued) • MEA alignment is found by finding the highest summing path through the matrix • Mxy[i, j] = P(xi is aligned to yj | x, y)
Consistency zk z xi x y yj yj’
Step 3: Probabilistic consistency transformation • Re-estimate the match quality scores P(xi~yj a* | x, y) by applying the probabilistic consistency transformation which incorporates similarity of x and y to other sequences from S into the x-y comparison: P(xi~yj a* | x, y) P(xi~yj a* | x, y, z)
Step 3: Probabilistic consistency transformation (continued)
Step 3: Probabilistic consistency transformation (continued) • Since most of the values of Pxz and Pzy will be very small, we ignore all the entries in which the value is smaller than some threshold w. • Use sparse matrix multiplication • May be repeated
X1 X2 X3 X4 X X1 X X X2 X3 X X X X X X X X4 Step 4: Computation of guide tree • Use E(x,y) as a measure of similarity • Define similarity of two clusters by the sum-of-pairs
Step 5: Progressive alignment • Align sequence groups hierarchically according to the order specified in the guide tree. • Alignments are scored using sum-of-pairs scoring function. • Aligned residues are scored according to the match quality scores P(xi~yj a* | x, y) • Gap penalties are set to 0.
X1 X1 X3 X3 X2 X2 X4 X4 X5 X5 Post-processing step: iterative refinement • Much like in MUSCLE • Randomly partition alignment into two groups of sequences and realign • May be repeated
ProbCons overview • ProbCons demonstrated dramatic improvements in alignment accuracy • Longer running time • Doesn’t use protein-specific alignment information, so can be used to align DNA sequences with improved accuracy over the Needleman-Wunsch algorithm.
Outline • Introduction • Background • MUSCLE • ProbCons • Conclusion
Conclusion • MUSCLE demonstrated poor accuracy, but very short running time. • ProbCons demonstrated dramatic improvements in alignment accuracy, however, is much slower than MUSCLE.