220 likes | 440 Views
A Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs in a Sequence. Hiroki Arimura (Hokkaido University) Takeaki Uno (National Institute of Informatics).
E N D
A Polynomial Space and Polynomial Delay Algorithm for Enumeration of Maximal Motifs in a Sequence Hiroki Arimura (Hokkaido University) Takeaki Uno (National Institute of Informatics) This work is partly supported by MEXT Grant-in-Aid for Specially Promoted Research "Semi-structured Data Mining", 2005-2007 and Cooperative Fund by National Institute of Informatics 2005
Our problem: Maximal Motif Enumeration • An integerq 0 (quorum) • An input strings in S* q= 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABCABRRABRABCABABRABBC pos = 7 pos = 15 pos = 0 • Motif with wildcards:- a string x in(S {o}) starting and ending with a constant symbol in S. • Motif must be requent:: |L(x)| q ABoAB A motif • The problem of enumeratingall maximal motifs in an input sequence without duplicates for the class of repeated motifs with wildcards • AlphabetS = {A, B, ...} and the wildcard "o" • Matching • Location list : the list L(x) = {pos1, ..., posm}of the positions of x in s.
Our problem: Maximal Motif Enumeration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABCABRRABRABCABABRABBC non-maximal BoAB pos = 8 pos = 16 pos = 1 is contained in ABoAB maximal pos = 7 pos = 15 pos = 0 • A maximal motif: A representative motif x that is not properly contained in any other motif ywith the same location list under displacement. • There exists no motif ysuch that (1) x in contained in y and (2) L(x) = L(y) + dfor some (possibly negative) integer d • Motifx • Location listL(x)
Our problem: Maximal Motif Enumeration 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABCABRRABRABCABABRABBC • The problem of enumeratingall maximal motifs in an input sequence without duplicates for the class of repeated motifs with wildcards • An integerq 0 (quorum) • An input strings in S* q= 3 • Task: Enumerate all maximal motifs without duplicates Solutions ABRAB AB BoABoAB e ABoAB B BoABoooooB BoAB BC BoooooB
Why maximal motifs? • In real datasets, the number of maximal motifs is much smaller than that of motifs containing the complete information • Succinct representation for all (frequent) motifs
How many solutions? • How many solutions • Th 1: There exist 2Q(n) maximal motifs in s in general. • Succinctness of maximal motifs • Th 2: There exists an infinite series of input strings (sn)n such that the numbers of the (frequent) motifs and the maximal (frequent) motifs in sn are 2W(n) and O(n), resp. • By reduction from maximal bi-clique enum. problem (closed set enum.) • From this thm, we know that a naive generate-and-test algorithm using frequent motif enumeration does not work for maximal motifs • Hardness of counting • Fact 3: (not included in the paper) The counting version of the maximal motif enumeration is #P-complete • By reduction for TH2 and the #P-completeness of the maximal bi-clique enumeration problem.
Classes of Enumeration Algorithms • No known output-polynomial time algorithms exist for maximal motif enumeration Input size M Input • Output-polynomial (OUT-POLY) • Total time is poly(Input, Output) Output size M • polynomial-time enumeration (POLY-ENUM) • Amotized delay is poly(Input), or • Total time is Output·poly(Input) Delay D Total Time T + • polynomial-delay (POLY-DELAY) • Maximum of delay is poly(Input) • polynomial-space(POLY-SPACE)
Related works An approach with the basis of maximal motifs Generating the set M of maximal motifs from a small subset B • Parida et al. [SODA'00] • The basis of irredundant motifsBI • Claimed that the size of BI is at most linear in n = |s| for any quorum q 0 [Th 1., Parida SODA'00], which finally turns out false by [Pisanti et al. MFCS'03] • Parida et al. [CPM'00] • Output poly-time enumeration of flexible motifs using the claim of [Parida et al. SODA'00]. • Pisanti et al. [MFCS'03] (Pelfrene et al.[CPM'03]) • The basis of tiling motifsBT • Showed that the size of BI and BT are Q(nq-1) [Pisanti et al., MFCS'03]. • It is not known whether there exists any output-polynomial time algorithm for enumerating all maximal motifs from input sequence s.
Main result • Th 8: There exists an algorithm that enumerates all maximal motifs in an input sequence with quorum q in O(n2m) delay and O(km) spacewithout duplicates. • n: the length of input sequence s • k: the length of a longest maximal motif (k = O(n)). • m: the length of the location list L(x) of motif x (m = O(n)). • Corollary 2: The maximal motif enumeration problem is solvable in polynomial space polynomial delay in the input size n. This seems to be the first output-polynomial time result for the maximal motif enumeration problem.
Basic Idea: Incremental Generation by Back-tracking ⊥ maximal BoABoAB B BoAB maximal AB BoABoooooB BC BoooooB ABoAB ABRAB BCoB no maximal BCA no maximal motifs exist
Difficulties in enumerating all maximal motifs • How to test the maximality of a generated motif? • How to define tree-shaped search route over all maximal motifs? • How to perform depth-first search on the tree with polynomial space and delay? BC ABoAB BCoB ABRAB BCA
The closure Clo(x) of a motif x Procedure Closure(Q) := Merge(L(Q)) Q BAB L • STEP1: Compute the location list of Q: L(Q) = {d1, ..., dm} AB|BCABRABRABCABABRA... AB|BCABRABRABCABABRA... • STEP2: Align the copies of the input sequences at the occurrence positionS(P) = {s - d1, ..., s - dm} Merge ABBCA|BRABRABCABABRA... ABBCABRA|BRABCABABRA... Closure Clo(Q) • STEP3: Compute the common letters at each positions.return R = Merge(S(P)) BABAB
How to define a tree-shaped search route? • The parent of a maximal motif Q Pa(Q) = Clo( Q[1..kmin ] ) where kmin = core_i(Q)-1 is the core index of Q. • Lemma: The parent relation Pa(.) defines a spanning tree over all maximal motifs as a tree-shaped search route. • Assign to each maximal motif y the unique parent Pa(y)
Prefix-Preserving Closure Extension • Given a parent, compute all of its children • Input: the parent maximal motif X, and its "core index"k = core_i(x) • The length of the shortest prefix p of x that has equivalent location list, L(x) = L(p) • method: For all index ∀i = k+1, ..., nand all letter c = c1 ...,cs (∈S), do the followings: • Q := P 〈i := c〉 • Compute the closure R := Clo( Q ) • Check the prefix is idnetical check if P[1 .. i-1] = R[1 .. i-1]? • if succeeded, then return the motif R Input Sequence
Difficulties in enumerating all maximal motifs • How to test the maximality of a generated motif? • How to define tree-shaped search route over all maximal motifs? • How to perform depth-first search on the tree with polynomial space and delay? BC ABoAB BCoB ABRAB BCA
Algorithm MAXMOTIF • A polynomial space polynomial delay algorithm for maximal motifs • Based on depth-first search using PPC-extension for maximal motifs
Out method (MAXMOTIF) • Depth-search of tree • Memory efficient, quick, simple Comparison to the previous methods • Previous method • Breadth-first search of dag • large memory footprint Memory = (Depth of tree) X (lenght of location list) Memory proportional to the output size Input Sequence
Main result • Th 8: There exists an algorithm that enumerates all maximal motifs in an input sequence with quorum q in O(n2m) delay and O(km) spacewithout duplicates. • n: the length of input sequence s • k: the length of a longest maximal motif (k = O(n)). • m: the length of the location list L(x) of motif x (m = O(n)). • Corollary 2: The maximal motif enumeration problem is solvable in polynomial space polynomial delay in the input size n. This seems to be the first output-polynomial time result for the maximal motif enumeration problem.
Conclusion • Maximal motif enumeration problem • Difficulties in maximal motif enumeration • A polynomial time polynomial delay algorithmMAXMOTIF • Enumerates all maximal motifsx in an input string of length n in O(n2m)delay and O(lm) space without duplicates, where m = |L(x)| and l = |x|. • Output-polynomial time enumerability of the problem • Future research • Extension of the algorithm for the maximal motif problems over combinatorial objects such as trees and graphs [Arimura and Uno, ILP2005]