1 / 37

On the k -Closest Substring and k -Consensus Pattern Problems

On the k -Closest Substring and k -Consensus Pattern Problems. Yishan Jiao, Jingyi Xu Institute of Computing Technology Chinese Academy of Sciences Ming Li University of Waterloo July 5, 2004. Outline. Motivation & background Our contributions A PTAS for k -Closest Substring Problem

medge-cruz
Download Presentation

On the k -Closest Substring and k -Consensus Pattern Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the k-Closest Substring and k-Consensus Pattern Problems Yishan Jiao, Jingyi Xu Institute of Computing Technology Chinese Academy of Sciences Ming Li University of Waterloo July 5, 2004

  2. Outline • Motivation & background • Our contributions • A PTAS for k -Closest Substring Problem • The NP-hardness of (2-)-approximation of the HRC problem • A PTAS for k -Consensus Pattern Problem • Conclusion

  3. L N sequences L Motivation • Given n protein sequences, find a “conserved” region separately: • Red/blue regions are different conserved regions, or motifs. • They don’t have to be exactly the same. • They match with higher scores than other regions.

  4. Focused problem • k -Closest Substring Problem(k -CSS) A special case when k =2 

  5. S L L L … … … 2-KCSS

  6. Related work k-Closest Substring problem L=m K=1 geometric Geometric k-center problem Closest Substring problem Hamming Radius k-clustering problem (HRC) counterpart L=m Closest String problem • Closest Substring problem: • A PTAS; M.Li et al. ,JACM 49(2):157-171,2002 • Hamming Radius O(1)-clustering problem (O(1)-HRC): • A RPTAS for Hamming Radius O(1)-clustering problem; Doctoral dessertation,J.Jansson,2003.

  7. Outline • Motivation & background • Our contributions • A PTAS for k -Closest Substring Problem • The NP-hardness of (2- )-approximation of the HRC problem • A PTAS for k -Consensus Pattern Problem • Conclusion

  8. The PTAS for k-CSS • Difficulties: • How to choose n closest substrings? • How to partition strings into k sets accordingly? • Method: • Extend random sampling strategy in [M.Li et al. , JACM 49(2):157-171,2002] • Construct h to approximate the Hamming distance. • Result: • A PTAS for O(1)- CSS.

  9. L positions P Q …… R … P-Q decomposition

  10. P-Q decomposition

  11. Random sampling strategy : • The random sampling strategy • R1(R2):randomly pick O(log(mn)) positions from P1(P2) ????

  12. Random sampling Strategy h approximate Hamming distance well.

  13. Scheme of PTAS

  14. Scheme of PTAS 5. Get final approximating center strings

  15. Sum up:

  16. Outline • Motivation & background • Our contributions • A PTAS for k -Closest Substring Problem • The NP-hardness of (2- )-approximation of the HRC problem • A PTAS for k -Consensus Pattern Problem • Conclusion

  17. The NP-hardness of (2-)-approximation of the HRC problem • Main Ideas: • Given any instance G=(V,E) of the Vertex Cover Problem, |V|=n, |E|= m' . • Construct an instance <S ,k > of the Hamming radius k-clustering problem, which has a k-clustering with the maximum cluster radius not exceeding 2 . if and only if • G has a vertex cover with k-m' vertices.

  18. Thus finding an approximate solution within an approximation factor less than 2 is no easier than finding an exact solution.

  19. We can proof: • Given k  2m', k-m' vertices in V can cover E , if and only if there is a k-clustering of S with the maximum cluster radius equal to 2. • if there is a polynomial algorithm for the Hamming radius k -clustering problem within an approximation factor less than 2 the exact vertex cover number of any instance G can be solved in polynomial time. • This is a contradiction.

  20. Outline • Motivation & background • Our contributions • A PTAS for k -Closest Substring Problem • the NP-hardness of (2- )-approximation of the HRC problem • A PTAS for k -Consensus Pattern Problem • Conclusion

  21. Conclusion • A nice combination of Combinatorial argument (P-Q decomposition) with the random sampling strategy in solving k -CSS problem. • An alternative and direct proof of the NP-hardness of (2- )-approximation of the HRC problem.

  22. Contact Us • Authors • Yishan Jiao, Jingyi Xu : {jys,xjy}@ict.ac.cn • Bioinformatics lab, Institute of Computing Technology, Chinese Academy of Sciences • Ming Li: mli@uwaterloo.ca • University of Waterloo

  23. Thank You!

  24. Outline • Motivation & background • Our contributions • The PTAS for k-Closest Substring Problem • the NP-hardness of (2-)-approximation of the HRC problem • The PTAS for k-Consensus Pattern Problem • Conclusion

  25. Deterministic PTAS for O(1)-Consensus Pattern problem 1 • k-Consensus Pattern problem • Most related works: • The Hamming O(1) -median clustering problem •  O(1)-Consensus Pattern problem when L= m. • A RPTAS ; R. Ostrovsky et al. ,JACM 49(2):139-156,2002 • The Consensus Pattern problem •  k-Consensus Pattern problem when k= 1. • A PTAS; M.Li et al., STOC’99. • 给出O(1)-Consensus Pattern Problem的一个确定性PTAS,并证明。

  26. DPTAS for O(1)-CP 1 • Outline: 1.Suppose in the optimal solution: ({c1,c2}, {t1,t2,…,tn}, {C1,C2}) C1,C2: instances of Consensus Pattern problem 2.Trying all possibilities, get and satisfying Lemma 3 in M.Li et al., STOC’99.

  27. DPTAS for O(1)-CP 2 • Outline: • 3. Get c1’,c2’ • c1’: the column-wise majority string of • c2’: the column-wise majority string of • 4.Partition each into C1’,C2’ as follows: • otherwise • 5.Get closest substrings (tl’) in T1’,T2’ satisfying

  28. DPTAS for O(1)-CP 3 • Outline: • 6.Get a good approximation solution where c1”,c2” are the column-wise majority string of all string in T1’,T2’ respectively. • 7.Conclusion: • Output a solution in polynomial time with total cost at most

  29. PTAS for 2-Consensus Pattern problem

  30. Definition of PTAS • A family of approximation algorithms for problem P,{Ak}k, is called a polynomial (time) approximation scheme or PTAS, if algorithm Ak is a (1+k)-approximation algorithm and its running time is polynomial in the size of the input for a fixed k.

  31. Vertex-cover problem • Vertex cover: given an undirected graph G=(V,E), then a subset V'V such that if (u,v)E, then uV' or v V' (or both). • Size of a vertex cover: the number of vertices in it. • Vertex-cover problem: find a vertex-cover of minimal size.

  32. Vertex-cover problem • Vertex-cover problem is NP-complete. (See section 34.5.2). • Vertex-cover belongs to NP. • Vertex-cover is NP-hard (CLIQUEPvertex-cover.) • Reduce <G,k> where G=<V,E> of a CLIQUE instance to <G',|V|-k> where G'=<V,E'> where E'={(u,v): u,vV, uv and <u,v>E} of a vertex-cover instance. • So find an approximate algorithm.

  33. Conclusion for the approximation solution • Outline • Get a good approximation solution where • 10.Conclusion: • Outputs (c1”, c2”) in polynomial time Satisfying with high probability: • Can be derandomized by standard method [MR95]. • Extend to k=O(1) case: trivial

  34. PTAS for 2-CSS

  35. Notation

  36. L positions P Q … R P-Q decomposition ……

More Related