230 likes | 359 Views
Aligning Alignments Exactly. By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng. Background Definition Hardness An Exponential time algorithm. Alignments.
E N D
Aligning Alignments Exactly By John Kececioglu, Dean StarrettCS Dept. Univ. of ArizonaAppeared in 8th ACM RECOME 2004, Presented by Jie Meng
Background • Definition • Hardness • An Exponential time algorithm
Alignments • Given two (DNA or Protein) sequences, an alignment puts them against each other such that the similar parts are aligned as close as possible, for example: A T – C – T C G C T- T G - A T G – A T There are four kinds of alignments Match Insertion; Deletion; Mismatch
Scoring Alignments • There are four types of aligned columns: • Match – Score match = 0. • Mismatch – Score mismatch 0. • Insertion – Score insertion 0. • Deletion – Score deletion 0. • The score of an alignment is defined to be the sum of the score of the aligned columns. • The goal is to minimize the score
Gap-cost • We can extend the score indel by open and extension, then for a gap of size x, we have open +x* extension instead of x* indel . • AT----CGCTTCAT -TGCAT—AT----- • open +4* extension
Multiple Alignments • In general we also need compare multiple sequences and find the similarities. • Multiple alignment generalizes the alignment idea to handle many sequences. • AT-C-TCGAT -TGCAT--AT ATCCA-CGCT
Sum-of-Pairs (SP) Score • Given a multiple alignment, the sum-of-pairs (SP) score is given by the sum of the induced pairwise alignment scores of each pair in the alignment. • AT-C-TCGAT -TGCAT--AT ATCCA-CGCT • • AT-C-TCGAT -TGCAT--AT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT + +
BAD NEWS • Multiple alignment is NP-hard • One methods is to approximate the optimal value; • Progressive alignments • A problem arised natually: Aligning Alignments
Aligning Alignments • Let S be a collection of strings s1, s2, s3…sk, over alphabet ; • An alignment of S is a matrix A with k rows such that:i) Each entry is either a letter or a space;ii) No column is all space;iii) Reading across row i and remove space, we get string si; • Like before, we have three types of aligning score:match, mismatch and substitution;
Aligning Alignments • Given two alignments A with k sequences of length N, B with l sequences of length M, we want to align the columns of A and B; AT-C-TCGAT-TGCAT--ATATCCA-CGAT CT-ATTGGAT-TTAT-G--TCTTA-GGGAT
Aligning Alignments • In other word, We treat the columns of A and B as single letters, just like aligning two sequences. • CT GT -T • AT -T GT C-TG-T--T -AT--T-GT
Aligning Alignments • The score function is still sum-of-pair, namely • We note that the alignment of Ai’ and Bj’ may contain space in both sequences, so we just remove the space here Ai’: a----aa-a Bj’: aaa-a-a-a
Aligning Alignments • Without gap cost, aligning alignments is polynomial time solvable. We can apply dynamic programming like we did in aligning sequences; the only difference here is that we align columns.
Aligning Alignments • With gap cost, this problem is NP-complete • We can use a reduction from MAX-CUT problem • MAX-CUT: Given a graph G=(V, E), and a integer c, ask whether there is a partition of V: V= L R and , such that the size of the cut is no less than c; • By cut, it means the set of edges which have one end vertex in L and another is in R;
NP-hardness • Given an instance of MAX-CUT G=(V,E), V={v1, v2, …vn} and E={e1, e2, … em},and a integer c; • we construct two multiple alignments A and B over alphabet {0,1}: both A and B has m edge rows and k dummy rows, each edge rows corresponding an edge; A has 2n columns, every two continuous columns correspond a vertex; B has 3n columns, every three continuous columns correspond a vertex;
NP-hardness • The dummy rows in A are (0-)n, dummy rows in B are (0--)n; • As to the edge rows in A: suppose the row for e, and e=(vi, vj), then in columns i and j, there are substring, “-1”, and space elsewhere; • As to the edge rows in B: suppose the row for e, and e=(vi, vj), (i<j), then in columns i, there is a substring “010”, in columns j, there is a substring “-10”
NP-hardness • Simply we let score for match is 0, score for mismatch is 1, and gap open cost is 2, gap extension cost is 1 ask whether there is an alignment such that the score is less then d-c; So we have an instance of Aligning Alignments.
HOMEWORK4 • Given a set of multiple alignments {A1, A2, … An}, each Ai is a multiple alignment with ki sequences, without gap cost, is the problem of multiple alignment on those alignments {A1, A2, … An} hard or easy, use the method in this paper to align multiple alignments, i.e. align columns. If hard, prove it; otherwise, give an efficient algorithm and prove complexity and correctness.
Exact Algorithm • The basic idea is still dynamic programming; • We have to remember extra information by a set, so-called shape, S : for each row in a multiple alignment, we record the columns of the right-most letters.
Exact Algorithm • S(i, j)=
Exact Algorithm • C(i,j,t)=min • Where g(A[i], B[j], s) means the total number of gaps initiated by appending column A[i] and B[j] onto an alignment that ends in shape s;
Exact Algorithm • The optimum value is • The problem here is the number of shapes maybe too many, so in the worst case the time and space complexity is
Any Questions? 423B jmeng@cs.tamu.edu