610 likes | 904 Views
Aligning Alignments. Soni Mukherjee 11/11/04. Pairwise Alignment. Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches) * s - (#gaps) * d Optimal alignment is the alignment with the maximum score. Dynamic Programming. We want to align
E N D
Aligning Alignments Soni Mukherjee 11/11/04
Pairwise Alignment • Given two sequences, find their optimal alignment • Score = (#matches) * m - (#mismatches) * s - (#gaps) * d • Optimal alignment is the alignment with the maximum score
Dynamic Programming • We want to align x1…xm and y1…yn • D(i,j) = optimal score of aligning x1…xi and y1…yj • Solution is D(m, n)
Three possible cases for computing D(i,j): Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG
Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG
Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG D(i,j) = D(i-1, j-1) + m, if xi = yj -s, otherwise
Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG D(i,j) = D(i-1, j-1) + m, if xi = yj -s, otherwise D(i,j) = D(i-1, j) - d
Three possible cases for computing D(i,j): xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj Dynamic Programming C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG D(i,j) = D(i-1, j-1) + m, if xi = yj -s, otherwise D(i,j) = D(i-1, j) - d D(i,j) = D(i, j-1) - d
Dynamic Programming • Inductive assumption: • D(i-1, j-1), D(i-1, j) and D(i, j-1) are optimal • D(i, j) = max Where s(xi, yj) = m if xi = yj; -s otherwise • D(i-1, j-1) + s(xi, yj) • D(i-1, j) - d • D(i, j-1) - d
Dynamic Programming • Matrix D +s(X[i],Y[j]) -d -d
Every non-decreasing path from (0,0) to (M,N) corresponds to an alignment of the two sequences Needleman-Wunsch y1 ……………………………… yN xM ……………………………… x1
Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d
Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d • Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1)
Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d • Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1) D(i, j) = max D(i-1, j-1) + s(xi, yj) maxk=0…i-1 D(k, j) – p(i-k) maxk=0…j-1 D(i, k) – p(j-k)
Scoring Gaps More Accurately • Linear gap model: Gap of length n incurs penalty p(n) = n*d • Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1) D(i, j) = max D(i-1, j-1) + s(xi, yj) maxk=0…i-1 D(k, j) – p(i-k) maxk=0…j-1 D(i, k) – p(j-k) 3 Running time = O(N )
Affine Gaps • p(n) = d + n*e d = gap open penalty e = gap extend penalty e d
Affine Gaps • p(n) = d + n*e d = gap open penalty e = gap extend penalty • Now we need three matrices: D(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to yj H(i, j) = score of alignment x1…xi to y1…yj ifyj aligns to a gap V(i, j) = score of alignment x1…xi to y1…yj ifxi aligns to a gap e d
Needleman-Wunsch with Affine Gaps • D(i,j) = max • H(i,j) = max • V(i,j) = max D(i-1, j-1) + s(xi, yj) H(i-1, j-1) + s(xi, yj) V(i-1, j-1) + s(xi, yj) D(i, j-1) - d H(i, j-1) - e V(i, j-1) - d D(i-1, j) - d H(i-1, j) - d V(i-1, j) - e
Needleman-Wunsch with Affine Gaps • D(i,j) = max • H(i,j) = max • V(i,j) = max D(i-1, j-1) + s(xi, yj) H(i-1, j-1) + s(xi, yj) V(i-1, j-1) + s(xi, yj) D(i, j-1) - d H(i, j-1) - e V(i, j-1) - d Running time = O(MN) D(i-1, j) - d H(i-1, j) - d V(i-1, j) - e
Affine Gaps • Essentially, when there is a gap, the algorithm looks back one space to determine whether or not this gap opened a gap or continued a previous one: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap
Multiple Sequence Alignment • Given N sequences x1, x2,…, xN, insert gaps in each sequence xi such that: • All sequences have the same length L • Global score is maximum • Motivation: • Faint similarity between two sequences becomes significant if present in many • Multiple alignments can help improve pairwise alignments
Induced Pairwise Alignments • Multiple alignment: x:AC_GCGG_C y:AC_GC_GAG z:GCCGC_GAG • Induces three pairwise alignments: x: ACGCGG_C x: AC_GCGG_C y: AC_GCGAG y: ACGC_GAC z: GCCGC_GAG z: GCCGCGAG
Sum of Pairs • Sum of Pairs score of a multiple alignment is the sum of the scores of all induced pairwise alignments: S(m) = k<l s(mk, ml) wheres(mk, ml) = score of induced alignment (k, l)
Multidimensional Dynamic Programming • Example in 3-D (3 sequences) • 7 neighbors per cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, -), F(i ,j ,k-1)+S( -, -, xk) }
Multidimensional Dynamic Programming • L = length of each sequence • N = number of sequences • Size of matrix = LN • Neighbors per cell = 2N – 1 • Running time = O(2N LN)
Progressive Alignment • Align two of the sequences xi and xj • Fix that alignment • Align a third sequence/alignment to the alignment xixj • Repeat until all sequences are aligned
Progressive Alignment • When evolutionary tree is known: • Align closest first, in order of the tree: • Align (x, y) • Align (w, z) • Align (xy, wz) x y z w
Multidimensional Dynamic Programming Progressive Alignment Alignment three sequences Y Y Z Z X X
Multidimensional Dynamic Programming Progressive Alignment Aligning three sequences Y Y Z X X
Multidimensional Dynamic Programming Progressive Alignment Aligning three sequences Y Y Z X X
Score at each entry adds score of aligning the column in y to the column in the alignment xz Sequence vs Alignment x1 ……………………………… xM z1 ……………………………… zL yN ……………………………… y1
Example • ith Ietter of y: A • jth column of xz: • D(i, j) = max - A D(i-1, j-1) – d + s(A, A) D(i-1, j) – d – d D(i, j-1) + 0 – d
Affine Gaps • ith letter of y matched with jth column of xz • (j-1)th column of xz gapped y: - A x: - - z: A A • This induces the yx alignment: y: - A x: - -
Affine Gaps • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap
Affine Gaps • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap • When aligning a sequence and an alignment, a fourth case arises: - x Starts or continues - - a gap???
Affine Gaps • Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - x Starts z x Starts z x Continues y - new gap y - new gap - - old gap • When aligning a sequence and an alignment, a fourth case arises: - x Starts or continues - - a gap???
Aligning AlignmentsJohn D. Kececioglu and Weiqing Zhang, 1998 • Optimistic and pessimistic gap counts for sequence vs alignment • Exact gap counts for sequence vs alignment
Sequence vs Alignment • A = a1 … am is a sequence of length m • B is a multiple alignment of length n of k sequences • represented by a k x n matrix • each entry bij is either a letter or gap
Optimistic and Pessimistic Gap Counts • When we have - x - - • Optimistic gap count assumes that this continues a previous gap • Pessimistic gap count assumes this starts a new gap • Running time = O(kmn)
Exact Gap Counts • Recall matrices: D(i, j) = score of alignment a1…ai to b1…bj if ai aligns to bj H(i, j) = score of alignment a1…ai to b1…bj if bj aligns to a gap V(i, j) = score of alignment a1…ai to b1…bj if ai aligns to a gap • Only ways to get are the cases HH, HV, and HD, generalized as HX - x - -
Exact Gap Counts • Three possibilities: • … DH…HX • … VH…HX • H………HX
Exact Gap Counts • Three possibilities: • … DH…HX • … VH…HX • H………HX • Is bij the first character in its row encountered during the run?
Exact Gap Counts • Three possibilities: • … DH…HX • … VH…HX • H………HX • Is bij the first character in its row encountered during the run? • Algorithm with lots of matrices runs in O(kn + kmn + mn ) 2 2
Sequence vs Alignment Alignment vs Alignment Comparison
Sequence vs Alignment Only three types of paths can cause Alignment vs Alignment Comparison … - - x … - - -
Sequence vs Alignment Only three types of paths can cause Alignment vs Alignment Any path can cause Comparison … - - x … - - - … - - x … - - -
Aligning Alignments ExactlyJohn Kececioglu and Dean Starrett, 2003 • Aligning two alignments is NP-complete • Exact algorithm • Time and space complexity • Pruning • Results
NP-Completeness • Reduction from the Maximum Cut Problem • Still NP-compete if: • Strings are of length at most 5 • Every row has at most 3 gaps • At most 1 gap in the interior of each string
Exact Algorithm • Sufficient to know relative order of the rightmost element in the row for each pair: x: - A y: - - • If x’s rightmost element is to the right of y’s rightmost element, this is an extension • Otherwise, it is a startup
Shapes A: -AGGCTATCACCTGACCTCCAGG B: TAG-CTATCAC--GACCGC---- C: CAG-CTATCAC--GACCGC---- D: CAGCCTATCACC-GAACGCCA--