Multiple Sequence Alignment

A - T A G - G T T G G G G T G G - - T - A T T A - - A - T A C C A C C C - G C - G - Possible alignment Possible alignment Multiple Sequence Alignment S1=AGGTC S2=GTTCG S3=TGAAC

Multiple Sequence Alignment (cont) Input: Sequences S1, S2,…, Sk over the same alphabet Output: Gapped sequences S’1, S’2,…, S’k of equal length • |S’1|= |S’2|=…= |S’k| • Removal of spaces from S’iobtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

Multiple Sequence Alignment Example Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Scoring scheme: match - 0 mismatch/indel - -1 SP score: -4 -3 -5 =-12

Multiple Sequence AlignmentComplexity • Given kstrings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment: • Instead of a 2-dimensional table we have a k-dimensional table • Each dimension is of length ‘n’+1 • Each entry depends on 2k-1 adjacent entries Complexity:O(2knk) This problem is known to be NP-hard (no polynomial-time algorithm)

Multiple Sequence Alignment Approximation Algorithm • We use cost instead of score •  Find alignment of minimal cost • Assumption:the cost function δ is a distance function • δ(x,x) = 0 • δ(x,y) = δ(y,x) ≥ 0 • δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) • (e.g. cost of MM ≤ cost of two indels) D(S,T) - cost of minimum global alignment between S and T

Multiple Sequence Alignment Approximation Algorithm • The ‘star’ algorithm: • Input: Γ - set of k strings S1,…,Sk. • Find the string S’ (center) that minimizes • Denote S1=S’and the rest of the strings as S2,…,Sk • Iteratively add S2,…,Sk to the alignment as follows: • Suppose S1,…,Si-1are alreadyaligned as S’1,…,S’i-1 • AlignSi to S’1 to produce S’i and S’’1 aligned • AdjustS’2,…,S’i-1by adding spaces where spaces were added to S’’1 • Replace S’1 by S’’1

total complexity Multiple Sequence Alignment Approximation Algorithm • Time analysis: • Choosing S1 – execute DP for all sequence-pairs - O(k2n2) • Adding Si to the alignment -execute DP for Si , S’1 - O(i·n2). • (In the ith stage the length of S’1can be up-to i· n)

Multiple Sequence Alignment Approximation Algorithm • Approximation ratio: • M* - optimal alignment • M - The alignment produced by this algorithm • d(i,j) - the distanceMinduces on the pair Si,Sj For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

Multiple Sequence Alignment Approximation Algorithm Triangle inequality Approximation ratio: Definition of S1:

A - T A G - G T T G G G G T G G - - T - A T T A - - A - T A C C A C C C - G C - G - Possible alignment Possible alignment Multiple Sequence AlignmentReminder S1=AGGTC S2=GTTCG S3=TGAAC

Multiple Sequence AlignmentReminder Input: Sequences S1, S2,…, Sk over the same alphabet Output: Gapped sequences S’1, S’2,…, S’k of equal length • |S’1|= |S’2|=…= |S’k| • Removal of spaces from S’iobtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

Multiple Sequence AlignmentReminder • The ‘star’ algorithm: • Input: Γ - set of k strings S1,…,Sk. • Find the string S1 (center) that minimizes • Iteratively add S2,…,Sk to the alignment • Finds MA costing at most twice the optimal cost! Problem: Conventional MA does not model correctly evolutionary relationships

Tree Alignment • Input:X - set of sequences • T – phylogenetic tree on X (leaves labeled by X) • Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. • How do we label internal vertices? • Sequences • Profiles (multiple alignments)

A - T G G G G - - T T A - T A C C C - G - Profile Alignment A profile of a MA of length n over alphabet Σ is a (|Σ|+1)*n table. Column i holds the distribution of Σ (and gap) in that position : 3

Profile Alignment • Aligning a sequence to a profile: • Matching letter to position: weighted average of scores • Indels: introducing new columns gets special consideration • (same goes for aligning two profiles) : 3

Clustal Algorithm • Iteratively constructs MA for intermediate nodes • At each point holds profiles for all leaves • Chooses closest pair of neighbors • neighbors – have common father in T • distance - cost of optimal (pairwise) alignment • Aligns the two profiles to get the ‘father-profile’ • Replaces the two leaves with their father • Analysis: • Initialization – O(k2) alignments • k-1 iterations • Iteration i involves k-i-1 new pairwise alignments ClustalW – more advanced version. Sequences/profiles are weighted

S4 S4 S5 S2 S5 S1 S2 S3 S4 S6 Lifted Tree Alignments Lifted tree alignment – each internal node is labeled by one of the labels of its daughters Internal nodes are sequences and not profiles Example: We’ll show: DP algorithm for optimal lifted tree alignment Optimal lifted alignment is 2-approximation of optimal tree alignment

S4 S4 S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsAlgorithm Input:X - set of sequences T – phylogenetic tree on X (leaves labeled by X) Output:lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. Basic principle: calculate for every node v in T, and sequence S in X: d(v,S) - the optimal cost of v’s subtree when it is labeled by S The cost of optimal tree is

S4 S4 S2 O(k2depth(T))=O(k3) S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsAlgorithm d(v,S) - the optimal cost of v’s subtree when it is labeled by S Initialization: for leaf v labeled Sv - Recurrence: for internal node v with daughters u1,…ul - Correctness: check for suboptimal solution property Complexity:O(k2) pairwise alignments - O(n2k2). k-1 iterations For internal node v - O(kv2) work Total: O(k2(n2+depth(T)))

S4 S4 S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsApproximation analysis • Claim: Optimal LTA 2-approximates general tree alignments • We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes • (? can be generalized for profile-labeled nodes ?) • Notations: • T* - optimal TA labels • Sv* - label of node v in T* • TL– our constructed LTA • SvL - label of node v in TL

S4 S4 S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsApproximation analysis • Construction: • We label the nodes bottom-up. • For node v with daughters u1,…ul – • we choose the label (from Su1L ,…,SulL) closest to Sv* • We need to show: D(TL) ≤ 2D(T*)

S4 S4 triangle inequality choice of i triangle inequality S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsApproximation analysis • Analysis: • Some edges in TL have cost 0 • Observe edges (v,u) of cost > 0: • Si- label of father(v) • Sj- label of daughter (u) • P(v,u) – the path in T* from v to the leaf labeled by Sj • D(Si,Sj) ≤ D(Si,Sv*) + D(Sj,Sv*) ≤ 2D(Sj,Sv*) ≤ 2D(P(v,u))

Q.E.D. S4 S4 S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsApproximation analysis • D(Si,Sj) ≤ 2D(P(v,u)) If (u,v) and (u’,v’) are two different edges with cost > 0 in TL, then P(u,v) and P(u’,v’) are mutually disjoint in edges • Final Remarks: • Lifted tree alignment TL is only conceptual (we don’t have T*) • Optimal LTA cannot cost more than TL • In case of profile-labeled nodes: • construction and analysis OK when cost is still distance function

Multiple Sequence Alignment