370 likes | 477 Views
On the k -Closest Substring and k -Consensus Pattern Problems. Yishan Jiao, Jingyi Xu Institute of Computing Technology Chinese Academy of Sciences Ming Li University of Waterloo July 5, 2004. Outline. Motivation & background Our contributions A PTAS for k -Closest Substring Problem
E N D
On the k-Closest Substring and k-Consensus Pattern Problems Yishan Jiao, Jingyi Xu Institute of Computing Technology Chinese Academy of Sciences Ming Li University of Waterloo July 5, 2004
Outline • Motivation & background • Our contributions • A PTAS for k -Closest Substring Problem • The NP-hardness of (2-)-approximation of the HRC problem • A PTAS for k -Consensus Pattern Problem • Conclusion
L N sequences L Motivation • Given n protein sequences, find a “conserved” region separately: • Red/blue regions are different conserved regions, or motifs. • They don’t have to be exactly the same. • They match with higher scores than other regions.
Focused problem • k -Closest Substring Problem(k -CSS) A special case when k =2
S L L L … … … 2-KCSS
Related work k-Closest Substring problem L=m K=1 geometric Geometric k-center problem Closest Substring problem Hamming Radius k-clustering problem (HRC) counterpart L=m Closest String problem • Closest Substring problem: • A PTAS; M.Li et al. ,JACM 49(2):157-171,2002 • Hamming Radius O(1)-clustering problem (O(1)-HRC): • A RPTAS for Hamming Radius O(1)-clustering problem; Doctoral dessertation,J.Jansson,2003.
Outline • Motivation & background • Our contributions • A PTAS for k -Closest Substring Problem • The NP-hardness of (2- )-approximation of the HRC problem • A PTAS for k -Consensus Pattern Problem • Conclusion
The PTAS for k-CSS • Difficulties: • How to choose n closest substrings? • How to partition strings into k sets accordingly? • Method: • Extend random sampling strategy in [M.Li et al. , JACM 49(2):157-171,2002] • Construct h to approximate the Hamming distance. • Result: • A PTAS for O(1)- CSS.
L positions P Q …… R … P-Q decomposition
Random sampling strategy : • The random sampling strategy • R1(R2):randomly pick O(log(mn)) positions from P1(P2) ????
Random sampling Strategy h approximate Hamming distance well.
Scheme of PTAS 5. Get final approximating center strings
Outline • Motivation & background • Our contributions • A PTAS for k -Closest Substring Problem • The NP-hardness of (2- )-approximation of the HRC problem • A PTAS for k -Consensus Pattern Problem • Conclusion
The NP-hardness of (2-)-approximation of the HRC problem • Main Ideas: • Given any instance G=(V,E) of the Vertex Cover Problem, |V|=n, |E|= m' . • Construct an instance <S ,k > of the Hamming radius k-clustering problem, which has a k-clustering with the maximum cluster radius not exceeding 2 . if and only if • G has a vertex cover with k-m' vertices.
Thus finding an approximate solution within an approximation factor less than 2 is no easier than finding an exact solution.
We can proof: • Given k 2m', k-m' vertices in V can cover E , if and only if there is a k-clustering of S with the maximum cluster radius equal to 2. • if there is a polynomial algorithm for the Hamming radius k -clustering problem within an approximation factor less than 2 the exact vertex cover number of any instance G can be solved in polynomial time. • This is a contradiction.
Outline • Motivation & background • Our contributions • A PTAS for k -Closest Substring Problem • the NP-hardness of (2- )-approximation of the HRC problem • A PTAS for k -Consensus Pattern Problem • Conclusion
Conclusion • A nice combination of Combinatorial argument (P-Q decomposition) with the random sampling strategy in solving k -CSS problem. • An alternative and direct proof of the NP-hardness of (2- )-approximation of the HRC problem.
Contact Us • Authors • Yishan Jiao, Jingyi Xu : {jys,xjy}@ict.ac.cn • Bioinformatics lab, Institute of Computing Technology, Chinese Academy of Sciences • Ming Li: mli@uwaterloo.ca • University of Waterloo
Outline • Motivation & background • Our contributions • The PTAS for k-Closest Substring Problem • the NP-hardness of (2-)-approximation of the HRC problem • The PTAS for k-Consensus Pattern Problem • Conclusion
Deterministic PTAS for O(1)-Consensus Pattern problem 1 • k-Consensus Pattern problem • Most related works: • The Hamming O(1) -median clustering problem • O(1)-Consensus Pattern problem when L= m. • A RPTAS ; R. Ostrovsky et al. ,JACM 49(2):139-156,2002 • The Consensus Pattern problem • k-Consensus Pattern problem when k= 1. • A PTAS; M.Li et al., STOC’99. • 给出O(1)-Consensus Pattern Problem的一个确定性PTAS,并证明。
DPTAS for O(1)-CP 1 • Outline: 1.Suppose in the optimal solution: ({c1,c2}, {t1,t2,…,tn}, {C1,C2}) C1,C2: instances of Consensus Pattern problem 2.Trying all possibilities, get and satisfying Lemma 3 in M.Li et al., STOC’99.
DPTAS for O(1)-CP 2 • Outline: • 3. Get c1’,c2’ • c1’: the column-wise majority string of • c2’: the column-wise majority string of • 4.Partition each into C1’,C2’ as follows: • otherwise • 5.Get closest substrings (tl’) in T1’,T2’ satisfying
DPTAS for O(1)-CP 3 • Outline: • 6.Get a good approximation solution where c1”,c2” are the column-wise majority string of all string in T1’,T2’ respectively. • 7.Conclusion: • Output a solution in polynomial time with total cost at most
Definition of PTAS • A family of approximation algorithms for problem P,{Ak}k, is called a polynomial (time) approximation scheme or PTAS, if algorithm Ak is a (1+k)-approximation algorithm and its running time is polynomial in the size of the input for a fixed k.
Vertex-cover problem • Vertex cover: given an undirected graph G=(V,E), then a subset V'V such that if (u,v)E, then uV' or v V' (or both). • Size of a vertex cover: the number of vertices in it. • Vertex-cover problem: find a vertex-cover of minimal size.
Vertex-cover problem • Vertex-cover problem is NP-complete. (See section 34.5.2). • Vertex-cover belongs to NP. • Vertex-cover is NP-hard (CLIQUEPvertex-cover.) • Reduce <G,k> where G=<V,E> of a CLIQUE instance to <G',|V|-k> where G'=<V,E'> where E'={(u,v): u,vV, uv and <u,v>E} of a vertex-cover instance. • So find an approximate algorithm.
Conclusion for the approximation solution • Outline • Get a good approximation solution where • 10.Conclusion: • Outputs (c1”, c2”) in polynomial time Satisfying with high probability: • Can be derandomized by standard method [MR95]. • Extend to k=O(1) case: trivial
L positions P Q … R P-Q decomposition ……