1 / 23

Multiple Sequence Alignment

A - T. A G -. G T T. G G G. G T G. G - -. T - A. T T A. - - A. - T A. C C A. C C C. - G C. - G -. Possible alignment. Possible alignment. Multiple Sequence Alignment. S 1 = AGGTC. S 2 = GTTCG. S 3 = TGAAC. Multiple Sequence Alignment (cont).

badrani
Download Presentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A - T A G - G T T G G G G T G G - - T - A T T A - - A - T A C C A C C C - G C - G - Possible alignment Possible alignment Multiple Sequence Alignment S1=AGGTC S2=GTTCG S3=TGAAC

  2. Multiple Sequence Alignment (cont) Input: Sequences S1, S2,…, Sk over the same alphabet Output: Gapped sequences S’1, S’2,…, S’k of equal length • |S’1|= |S’2|=…= |S’k| • Removal of spaces from S’iobtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

  3. Multiple Sequence Alignment Example Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Scoring scheme: match - 0 mismatch/indel - -1 SP score: -4 -3 -5 =-12

  4. Multiple Sequence AlignmentComplexity • Given kstrings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment: • Instead of a 2-dimensional table we have a k-dimensional table • Each dimension is of length ‘n’+1 • Each entry depends on 2k-1 adjacent entries Complexity:O(2knk) This problem is known to be NP-hard (no polynomial-time algorithm)

  5. Multiple Sequence Alignment Approximation Algorithm • We use cost instead of score •  Find alignment of minimal cost • Assumption:the cost function δ is a distance function • δ(x,x) = 0 • δ(x,y) = δ(y,x) ≥ 0 • δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) • (e.g. cost of MM ≤ cost of two indels) D(S,T) - cost of minimum global alignment between S and T

  6. Multiple Sequence Alignment Approximation Algorithm • The ‘star’ algorithm: • Input: Γ - set of k strings S1,…,Sk. • Find the string S’ (center) that minimizes • Denote S1=S’and the rest of the strings as S2,…,Sk • Iteratively add S2,…,Sk to the alignment as follows: • Suppose S1,…,Si-1are alreadyaligned as S’1,…,S’i-1 • AlignSi to S’1 to produce S’i and S’’1 aligned • AdjustS’2,…,S’i-1by adding spaces where spaces were added to S’’1 • Replace S’1 by S’’1

  7. total complexity Multiple Sequence Alignment Approximation Algorithm • Time analysis: • Choosing S1 – execute DP for all sequence-pairs - O(k2n2) • Adding Si to the alignment -execute DP for Si , S’1 - O(i·n2). • (In the ith stage the length of S’1can be up-to i· n)

  8. Multiple Sequence Alignment Approximation Algorithm • Approximation ratio: • M* - optimal alignment • M - The alignment produced by this algorithm • d(i,j) - the distanceMinduces on the pair Si,Sj For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

  9. Multiple Sequence Alignment Approximation Algorithm Triangle inequality Approximation ratio: Definition of S1:

  10. A - T A G - G T T G G G G T G G - - T - A T T A - - A - T A C C A C C C - G C - G - Possible alignment Possible alignment Multiple Sequence AlignmentReminder S1=AGGTC S2=GTTCG S3=TGAAC

  11. Multiple Sequence AlignmentReminder Input: Sequences S1, S2,…, Sk over the same alphabet Output: Gapped sequences S’1, S’2,…, S’k of equal length • |S’1|= |S’2|=…= |S’k| • Removal of spaces from S’iobtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

  12. Multiple Sequence AlignmentReminder • The ‘star’ algorithm: • Input: Γ - set of k strings S1,…,Sk. • Find the string S1 (center) that minimizes • Iteratively add S2,…,Sk to the alignment • Finds MA costing at most twice the optimal cost! Problem: Conventional MA does not model correctly evolutionary relationships

  13. Tree Alignment • Input:X - set of sequences • T – phylogenetic tree on X (leaves labeled by X) • Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. • How do we label internal vertices? • Sequences • Profiles (multiple alignments)

  14. A - T G G G G - - T T A - T A C C C - G - Profile Alignment A profile of a MA of length n over alphabet Σ is a (|Σ|+1)*n table. Column i holds the distribution of Σ (and gap) in that position : 3

  15. Profile Alignment • Aligning a sequence to a profile: • Matching letter to position: weighted average of scores • Indels: introducing new columns gets special consideration • (same goes for aligning two profiles) : 3

  16. Clustal Algorithm • Iteratively constructs MA for intermediate nodes • At each point holds profiles for all leaves • Chooses closest pair of neighbors • neighbors – have common father in T • distance - cost of optimal (pairwise) alignment • Aligns the two profiles to get the ‘father-profile’ • Replaces the two leaves with their father • Analysis: • Initialization – O(k2) alignments • k-1 iterations • Iteration i involves k-i-1 new pairwise alignments ClustalW – more advanced version. Sequences/profiles are weighted

  17. S4 S4 S5 S2 S5 S1 S2 S3 S4 S6 Lifted Tree Alignments Lifted tree alignment – each internal node is labeled by one of the labels of its daughters Internal nodes are sequences and not profiles Example: We’ll show: DP algorithm for optimal lifted tree alignment Optimal lifted alignment is 2-approximation of optimal tree alignment

  18. S4 S4 S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsAlgorithm Input:X - set of sequences T – phylogenetic tree on X (leaves labeled by X) Output:lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. Basic principle: calculate for every node v in T, and sequence S in X: d(v,S) - the optimal cost of v’s subtree when it is labeled by S The cost of optimal tree is

  19. S4 S4 S2 O(k2depth(T))=O(k3) S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsAlgorithm d(v,S) - the optimal cost of v’s subtree when it is labeled by S Initialization: for leaf v labeled Sv - Recurrence: for internal node v with daughters u1,…ul - Correctness: check for suboptimal solution property Complexity:O(k2) pairwise alignments - O(n2k2). k-1 iterations For internal node v - O(kv2) work Total: O(k2(n2+depth(T)))

  20. S4 S4 S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsApproximation analysis • Claim: Optimal LTA 2-approximates general tree alignments • We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes • (? can be generalized for profile-labeled nodes ?) • Notations: • T* - optimal TA labels • Sv* - label of node v in T* • TL– our constructed LTA • SvL - label of node v in TL

  21. S4 S4 S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsApproximation analysis • Construction: • We label the nodes bottom-up. • For node v with daughters u1,…ul – • we choose the label (from Su1L ,…,SulL) closest to Sv* • We need to show: D(TL) ≤ 2D(T*)

  22. S4 S4 triangle inequality choice of i triangle inequality S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsApproximation analysis • Analysis: • Some edges in TL have cost 0 • Observe edges (v,u) of cost > 0: • Si- label of father(v) • Sj- label of daughter (u) • P(v,u) – the path in T* from v to the leaf labeled by Sj • D(Si,Sj) ≤ D(Si,Sv*) + D(Sj,Sv*) ≤ 2D(Sj,Sv*) ≤ 2D(P(v,u))

  23. Q.E.D. S4 S4 S2 S5 S5 S1 S2 S3 S4 S6 Lifted Tree AlignmentsApproximation analysis • D(Si,Sj) ≤ 2D(P(v,u)) If (u,v) and (u’,v’) are two different edges with cost > 0 in TL, then P(u,v) and P(u’,v’) are mutually disjoint in edges • Final Remarks: • Lifted tree alignment TL is only conceptual (we don’t have T*) • Optimal LTA cannot cost more than TL • In case of profile-labeled nodes: • construction and analysis OK when cost is still distance function

More Related