120 likes | 429 Views
Longest Common Rigid Subsequence. Bin Ma and Kaizhong Zhang Department of Computer Science University of Western Ontario Ontario, Canada. (Rigid) Subsequence. Subsequence: C OMBINATORIAL P ATTERN M ATCHING CPM Rigid Subsequence: 0123456789012345678901234567
E N D
Longest Common Rigid Subsequence Bin Ma and Kaizhong Zhang Department of Computer Science University of Western Ontario Ontario, Canada.
(Rigid) Subsequence • Subsequence: COMBINATORIALPATTERNMATCHING CPM • Rigid Subsequence: 0123456789012345678901234567 COMBINATORIALPATTERNMATCHING CPM, (13,7)
Common (Rigid) Subsequence • Longest Common Subsequence (LCS) • combinatorial pattern matching • longest common rigid subsequence comnienc • Longest Common Rigid Subsequence (LCRS) • combinatorial pattern matching • longest common rigid subsequence comni,(1,1,3,5)
Previous Results • LCS and LCRS of two strings: • polynomial time solvable • LCS of many strings: • Cannot be approximated within ratio in polynomial time (Jiang and Li 1995, SIAM J COMP). • For random instances, a simple greedy algorithm can give an almost optimal solution with only small error. • LCRS of many strings: • Exponential time algorithms. • Our CPM paper tries to answer the time complexity.
Motivation in Bioinformatics • In biochemistry, a motif is a recurring pattern in DNA/protein sequences. • A protein motif (SH3 domain binding motif) in J. Biological Chemistry 269:24034-9. • Many motifs can be found at PROSITE database of ExPASy.
Motivation • Rigoutsos and Floratos proposed the following problem (Bioinformatics 14:55-67,1998). • Given n strings and a positive number K, find a longest “rigid pattern” (rigid subsequence) that occurs in at least K of the n strings. • When K=n, it is LCRS. • Exponential time algorithms were studied. • NP-hardness unknown.
Our Results • LCRS is MAX-SNP hard • Therefore, Rigoutsos and Floratos’ problem is also MAX-SNP hard. • For random instances, there is an algorithm solves LCRS with quasi-polynomial average running time. • The algorithm also works for Rigoutsos and Floratos’ problem with simple modifications.
MAX-SNP hard • L-reduction from Max-Cut edge edge edge edge vertex vertex delimiter delimiter delimiter
The construction of each edge aaa aba bab contributes 0 aaa aba bab contributes 1 aaa aba bab contributes 1 Three possible configurations in an ungapped alignment
The Algorithm • Let Si be the set of length-i common rigid subsequences. • We only need to prove that
Sketch of Proof • For each rigid subsequence in Si, the probability it occurs in one random string of length n • The prob. that it occurs in every input string • There are in total length i rigid subsequences. • This can be done by two cases i<=2 log n and i> 2 logn.
Acknowledgement • Supported by NSERC, PREA and CRC.