1 / 19

Multiple Sequence Alignment

Multiple Sequence Alignment. By Yuan Li. Multiple Sequence Alignment. Lots of foundational problems in molecular biology are NP-hard Multiple Sequence Alignment Phylogeny Construction DNA sequencing (Shorest Common Superstring) ‏ RNA Structure Crossing Alignment K-mean Clustering.

derica
Download Presentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment By Yuan Li

  2. Multiple Sequence Alignment • Lots of foundational problems in molecular biology are NP-hard • Multiple Sequence Alignment • Phylogeny Construction • DNA sequencing (Shorest Common Superstring)‏ • RNA Structure Crossing Alignment • K-mean Clustering

  3. Multiple Sequence Alignment • A sequence alignment of three or more biological sequences, generally protein, DNA, or RNA • The input set of sequences share a lineage and a common ancestor • Sequence homology can be inferred and phylogenetic analysis can be conducted to MSA • Be used to access sequence conservation of proteins domain, DNA primary/secondary/tertiary structures

  4. Pairwise Alignment • Mutations: substitution, insertion, deletion • Input: Given two sequences, s1 and s2 • Output: The least number of mutations needed to convert s1 to s2, which is also the distance between s1 and s2 • Example: • S1 = AAGG–TGC • S2 = A– GTATCC • d(s1, s2) = 4

  5. Multiple Sequence Alignment • Input: a set of n sequences, {s1,s2,...,sn} • Output: a n*L matrix, so that a certain criteria is optimal • Input: GTAAC, TAAC, GTAC • Output: • GTAAC • - TAAC • GTA- C • Criteria: sum of pairs score, star align, tree align

  6. Star Align - Optimization • Input: Given a set of strings S={s1, s2,..., sn} • Output: a optimal string c, such that the sum of distance between c and si (where 1<=i<=n), is minimum.

  7. Star Align - Decision • Input: Given a set of strings S={s1, s2,..., sn}, and a interger k • Question: Is there a string c, such that the sum of distance between c and si (where 1<=i<=n), is less or equal to k?

  8. NPC Problem • 1) It is a decision problem • 2) It is in the set NP • Given a string c, the sum of distance between c and every string in S can be calculated in polynomial time and thus verify the correctness • 3) Reduce to Vertex Cover • Given ins(VC), an arbitrary instance of VC, construct an instance of star align, ins(SA)‏ • Proof that ins(VC) is true iff ins(SA) is true

  9. Star Alignment A set of strings, S A optimal string, c=DDCDD Reduction • Vertex Cover • A graph (V,E)‏ • |V|=n, |E|=m • Minimum cover, v'

  10. Construction Idea • Define Three types of Components • Base Component = {E,G} • Selection Component = {E,S(i,j)} • Ground Component = {G} • Construction • vertice--> {E,G} • edge(Vi,Vj)-->{E,S(i,j)}

  11. Definition • Paddings, P • 0s 1s 0s, s>=(n+1)‏ • 0..0 1..1 0..0 • Block1, B1 (vertex position = 1)‏ • P1P, i.e. 0..0 1..1 0..0 1 0..0 1..1 0..0 • Block0, B0 (vertex position = 0)‏ • P0P, i.e. 0..0 1..1 0..0 0 0..0 1..1 0..0 • String for vertex i, Vi • (B0)i-1 B1(B0)n-i

  12. Definition • Delimiter String, D • 1111111...111111, of length |Vi| • Cover String, C • (B1|B0)n • Base String, c = DDCDD • Enforcing String, E = DD (B1)n DD • Ground String, G = DD (B0)n DD • Selection String, S(i,j) = ViDVj

  13. Comparision

  14. Base Component • Base Component {E,G} • # = n, for each vertex, construct a base component {E, G} • E = DD (B1)n DD • G = DD (B0)n DD • Lemma • The only optimal alignment of E and G is the direct match • If d(E,x)+d(G,x)<d(E,G)+1, x is base string, DDCDD.

  15. Selection Component • Selection Component {E, S(i,j)} • # = m, for each edge(vi,vj), construct a selection component E, S(i,j)‏ • E = DD (B1)n DD, votes 1 in all vertex positions. • S(i,j) = Vi D Vj, votes 0 in all except vertex position i or j, so that either vertex i or vertex j is part of the vertex cover • D D | C | D D • Vi D | Vj | • | Vi | D Vj

  16. Ground Component • Ground Component {G} • # = 1, only construct one ground component • G = DD(B0)n DD • c = DDCDD • d(G, c) means align • ....0....0...0...0...0...0... • ....?....?...?...?...?...?... • G will penalze each 1 in vertex positions, so that the sum of d(c, si) is minimum <--> the size of vertex cover v' is minimum.

  17. Component • Base component {E,G} → c = DDCDD • Selection component {E, S(i,j)} →c <--> Vertex Cover • Ground component {G} →minimum cover

  18. Conclusion • Vertex Cover is a NP-Complete Problem • Vertex Cover can be transformed to Star Alignment in polynomial time • So that Star Alignment is also a NP-Complete Problem

  19. Reference • Isaac Elias, Settling the intractability of multiple alignment, in Proc. of the 14th Ann. Int. Symp. on Algorithms and Computation (ISAAC), 2003, p352--363

More Related