Aligning Alignments

Aligning Alignments Soni Mukherjee 11/11/04

Pairwise Alignment • Given two sequences, find their optimal alignment • Score = (#matches) * m - (#mismatches) * s - (#gaps) * d • Optimal alignment is the alignment with the maximum score

Dynamic Programming • We want to align x1…xm and y1…yn • D(i,j) = optimal score of aligning x1…xi and y1…yj • Solution is D(m, n)

Three possible cases for computing D(i,j): Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG

Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG

Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG D(i,j) = D(i-1, j-1) + m, if xi = yj -s, otherwise

Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG D(i,j) = D(i-1, j-1) + m, if xi = yj -s, otherwise D(i,j) = D(i-1, j) - d

Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG D(i,j) = D(i-1, j-1) + m, if xi = yj -s, otherwise D(i,j) = D(i-1, j) - d D(i,j) = D(i, j-1) - d

Dynamic Programming • Inductive assumption: • D(i-1, j-1), D(i-1, j) and D(i, j-1) are optimal • D(i, j) = max Where s(xi, yj) = m if xi = yj; -s otherwise • D(i-1, j-1) + s(xi, yj) • D(i-1, j) - d • D(i, j-1) - d

Dynamic Programming • Matrix D +s(X[i],Y[j]) -d -d

Every non-decreasing path from (0,0) to (M,N) corresponds to an alignment of the two sequences Needleman-Wunsch y1 ……………………………… yN xM ……………………………… x1

Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d

Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d • Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1)

Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d • Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1) D(i, j) = max D(i-1, j-1) + s(xi, yj) maxk=0…i-1 D(k, j) – p(i-k) maxk=0…j-1 D(i, k) – p(j-k)

Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d • Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1) D(i, j) = max D(i-1, j-1) + s(xi, yj) maxk=0…i-1 D(k, j) – p(i-k) maxk=0…j-1 D(i, k) – p(j-k) 3 Running time = O(N )

Affine Gaps • p(n) = d + n*e d = gap open penalty e = gap extend penalty e d

Affine Gaps • p(n) = d + n*e d = gap open penalty e = gap extend penalty • Now we need three matrices: D(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to yj H(i, j) = score of alignment x1…xi to y1…yj ifyj aligns to a gap V(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to a gap e d

Needleman-Wunsch with Affine Gaps • D(i,j) = max • H(i,j) = max • V(i,j) = max D(i-1, j-1) + s(xi, yj) H(i-1, j-1) + s(xi, yj) V(i-1, j-1) + s(xi, yj) D(i, j-1) - d H(i, j-1) - e V(i, j-1) - d D(i-1, j) - d H(i-1, j) - d V(i-1, j) - e

Needleman-Wunsch with Affine Gaps • D(i,j) = max • H(i,j) = max • V(i,j) = max D(i-1, j-1) + s(xi, yj) H(i-1, j-1) + s(xi, yj) V(i-1, j-1) + s(xi, yj) D(i, j-1) - d H(i, j-1) - e V(i, j-1) - d Running time = O(MN) D(i-1, j) - d H(i-1, j) - d V(i-1, j) - e

Affine Gaps • Essentially, when there is a gap, the algorithm looks back one space to determine whether or not this gap opened a gap or continued a previous one: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap

Multiple Sequence Alignment • Given N sequences x1, x2,…, xN, insert gaps in each sequence xi such that: • All sequences have the same length L • Global score is maximum • Motivation: • Faint similarity between two sequences becomes significant if present in many • Multiple alignments can help improve pairwise alignments

Induced Pairwise Alignments • Multiple alignment: x:AC_GCGG_C y:AC_GC_GAG z:GCCGC_GAG • Induces three pairwise alignments: x: ACGCGG_C x: AC_GCGG_C y: AC_GCGAG y: ACGC_GAC z: GCCGC_GAG z: GCCGCGAG

Sum of Pairs • Sum of Pairs score of a multiple alignment is the sum of the scores of all induced pairwise alignments: S(m) = k<l s(mk, ml) wheres(mk, ml) = score of induced alignment (k, l)

Multidimensional Dynamic Programming • Example in 3-D (3 sequences) • 7 neighbors per cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, -), F(i ,j ,k-1)+S( -, -, xk) }

Multidimensional Dynamic Programming • L = length of each sequence • N = number of sequences • Size of matrix = LN • Neighbors per cell = 2N – 1 • Running time = O(2N LN)

Progressive Alignment • Align two of the sequences xi and xj • Fix that alignment • Align a third sequence/alignment to the alignment xixj • Repeat until all sequences are aligned

Progressive Alignment • When evolutionary tree is known: • Align closest first, in order of the tree: • Align (x, y) • Align (w, z) • Align (xy, wz) x y z w

Multidimensional Dynamic Programming Progressive Alignment Alignment three sequences Y Y Z Z X X

Multidimensional Dynamic Programming Progressive Alignment Aligning three sequences Y Y Z X X

Score at each entry adds score of aligning the column in y to the column in the alignment xz Sequence vs Alignment x1 ……………………………… xM z1 ……………………………… zL yN ……………………………… y1

Example • ith Ietter of y: A • jth column of xz: • D(i, j) = max - A D(i-1, j-1) – d + s(A, A) D(i-1, j) – d – d D(i, j-1) + 0 – d

Affine Gaps • ith letter of y matched with jth column of xz • (j-1)th column of xz gapped y: - A x: - - z: A A • This induces the yx alignment: y: - A x: - -

Affine Gaps • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap

Affine Gaps • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap • When aligning a sequence and an alignment, a fourth case arises: - x Starts or continues - - a gap???

Aligning AlignmentsJohn D. Kececioglu and Weiqing Zhang, 1998 • Optimistic and pessimistic gap counts for sequence vs alignment • Exact gap counts for sequence vs alignment

Sequence vs Alignment • A = a1 … am is a sequence of length m • B is a multiple alignment of length n of k sequences • represented by a k x n matrix • each entry bij is either a letter or gap

Optimistic and Pessimistic Gap Counts • When we have - x - - • Optimistic gap count assumes that this continues a previous gap • Pessimistic gap count assumes this starts a new gap • Running time = O(kmn)

Exact Gap Counts • Recall matrices: D(i, j) = score of alignment a1…ai to b1…bj if ai aligns to bj H(i, j) = score of alignment a1…ai to b1…bj if bj aligns to a gap V(i, j) = score of alignment a1…ai to b1…bj if ai aligns to a gap • Only ways to get are the cases HH, HV, and HD, generalized as HX - x - -

Exact Gap Counts • Three possibilities: • … DH…HX • … VH…HX • H………HX

Exact Gap Counts • Three possibilities: • … DH…HX • … VH…HX • H………HX • Is bij the first character in its row encountered during the run?

Exact Gap Counts • Three possibilities: • … DH…HX • … VH…HX • H………HX • Is bij the first character in its row encountered during the run? • Algorithm with lots of matrices runs in O(kn + kmn + mn ) 2 2

Sequence vs Alignment Alignment vs Alignment Comparison

Sequence vs Alignment Only three types of paths can cause Alignment vs Alignment Comparison … - - x … - - -

Sequence vs Alignment Only three types of paths can cause Alignment vs Alignment Any path can cause Comparison … - - x … - - - … - - x … - - -

Aligning Alignments ExactlyJohn Kececioglu and Dean Starrett, 2003 • Aligning two alignments is NP-complete • Exact algorithm • Time and space complexity • Pruning • Results

NP-Completeness • Reduction from the Maximum Cut Problem • Still NP-compete if: • Strings are of length at most 5 • Every row has at most 3 gaps • At most 1 gap in the interior of each string

Exact Algorithm • Sufficient to know relative order of the rightmost element in the row for each pair: x: - A y: - - • If x’s rightmost element is to the right of y’s rightmost element, this is an extension • Otherwise, it is a startup

Shapes A: -AGGCTATCACCTGACCTCCAGG B: TAG-CTATCAC--GACCGC---- C: CAG-CTATCAC--GACCGC---- D: CAGCCTATCACC-GAACGCCA--

Aligning Alignments

Aligning Alignments

Presentation Transcript

Pairwise sequence alignments

Sequence Alignments

Large-Scale Global Alignments Multiple Alignments

Multiple Sequence Alignments

Aligning

Pairwise Alignments

Sequence Alignments

Pairwise Alignments

Alignments

Multiple Sequence Alignments

Multiple alignment by aligning alignments

Rapid Global Alignments

Sequence Alignments

Sequence Alignments

Aligning Alignments Exactly

Multiple Alignments

Large-Scale Global Alignments Multiple Alignments

Pairwise alignments

Alignments

Pairwise Alignments

Pairwise alignments

Pairwise alignments