180 likes | 281 Views
DNA Sequencing Problem. Lin Zhou. Contents. Background Information Formal Definition NP-Completeness Conclusions. Background Information. DNA Sequences too long to be sequenced (billions of base pairs) Shear DNA into millions of small fragments
E N D
DNA Sequencing Problem Lin Zhou
Contents • Background Information • Formal Definition • NP-Completeness • Conclusions
Background Information • DNA Sequences too long to be sequenced (billions of base pairs) • Shear DNA into millions of small fragments • Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)
Background Information • Task: assemble individual short fragments (reads) into a single genomic sequence (“superstring”)
Formal Definition • Shortest Common Superstring (SCSS) • Given: Strings s1, s2,…., sn • Question: Find a string T that contains all strings s1, s2,…., snas substrings, such that the length of T is minimized
Formal Definition as Decision Problem • Shortest Common Superstring (SCSS) • Given: Set of strings S={s1, s2,…., sn} and integer k; • Question: Does there exist a string T such that∀si ∈ S, T ∩ si=si, and |T| < k ?
NP-Completeness • To show NP-Completeness, needs to show • SCSS problem is in NP • Another problem that is in NP-C is reducible to SCSS problem
NP-Completeness • SCSS Problem is in NP • To show: SCSS is verifiable by a deterministic machine in polynomial time • If we are given a “yes” instance, we can check if it includes all the strings in linear time. • Thus the problem is in NP.
NP-Completeness • Another NP-C problem is reducible to SCSS in polynomial time • Which problem to choose? • Minimum Set Cover (MSC) • Hamiltonian Path (HP)
MSC SCSS? • Attempt: Generate a string from each set? • SCSS looks only for the prefix/suffix match • Set Cover matching can be element from any where of the set (in fact the content of a set does not have order) • “No” instances of SCSS constructed from MSC may not be “No” instances of MSC. • Not feasible
HP SCSS • Hamiltonian Path Problem (Directed): • Given: Graph G=(V,E) where V={v1,v2,…,vn} is the set of vertices and E={e1,e2,…,em} is the set of directed edges. • Question: Is it possible to find a path that visited all the vertices in V exactly once? • How to transform?
HP SCSS • For each vertex vi in V : • For each outgoing edge and its end vertex (ej,vj) , • We generate string v'ivjv'iand vjv'ivj+1, where v'i is the complement of vi and vj+1 is the end vertex of the next outgoing edge ej+1. • eg, for a vertex v1 with edge (e3,v3), (e4, v4), (e5,v5) we need to generate: • v'1v3v'1, v3v'1v4, v'1v4v'1, v4v'1v5, v'1v5v'1, v5v'1v3 • Create connector strings vi#v'i. • For start and end vertex v1 and vn, we create^#v1 and vn#$ • There is a HP iff there is a superstring of length 2m+3n
HP SCSS • Given v'1v3v'1, v3v'1v4, v'1v4v'1, v4v'1v5, v'1v5v'1, v5v'1v3 • v1 to connect to (in Hamiltonian Path) : • v3: v1#v'1v3v'1v4v'1v5v'1v3#v'3… • v4: v1#v'1v4v'1v5v'1v3v'1v4#v'4… • v5: v1#v'1v5v'1v3v'1v4v'1v5#v'5… • This configuration forces any vertex with x outgoing edges to form a superstring of 2x+2 starting with v'1 and ending with one neighboring vertex which is the next vertex on the Hamiltonian Path.
HP SCSS • v1#v'1v3v'1v4v'1v5v'1v3#v'3… • If we also count the starting ‘v1’and the ‘#’, in average each vertex will need 2x+3 characters in the super string, where x is the number of outgoing edges from v1. • Thus if given m edges and n vertices, the result shortest super string would have 2m+3n characters.
HP SCSS • If there is a super string of 3m+2n length. The definition of superstring guaranteed visit of each vertices at least once. • Suppose if a vertex is visited more than once, the length would be > 3m+2n. Thus the path only visited each vertex only once. • If there is a HP in the graph, we can generate the super string by the trace of the path, and the generated super string will have the length of 3m+2n.
Thank you! Questions and answers