360 likes | 548 Views
On Minimizing Pattern Splitting in Multi-track String Matching. Kjell Lemström and Veli Mäkinen Department of Computer Science University of Helsinki. Minimum splitting problem.
On Minimizing Pattern Splitting in Multi-track String Matching Kjell Lemström and Veli MäkinenDepartment of Computer Science University of Helsinki
Minimum splitting problem • We study the following problem. Given a pattern string P and K parallel text strings Tk, 1· k · K,find the smallest integer k > 0 such that P can be split into k pieces P=P1LPk, where each Pi has an occurrence in some text track and these partial occurrences retain the order. P T1 T2 T3 On Minimizing Pattern Splitting in Multi-track String Matching
Motivation • Music information retrieval. • Text tracks represent different instruments. • Finding splitted pattern occurrences allows the query-melody to jump between instruments. • Useful in Query-by-Humming applications, where the pattern is monophonic and the music in database are polyphonic. On Minimizing Pattern Splitting in Multi-track String Matching
On Minimizing Pattern Splitting in Multi-track String Matching
Minimum splitting problem... • We study different versions of the problem:- Gap between the occurrences of two consecutive pattern pieces is limited by a.- Length of each piece must be ¸g. - Transposition-invariant occurrences; there is an occurrence if the pattern is found with a constant c added to each character. On Minimizing Pattern Splitting in Multi-track String Matching
A splitting with k=4 and transposition c=2: P4 = 4 7 8 5, P4 + c= (4+c) (7+c) (8+c) (5+c) = 6 9 10 7 P1 P2 P3 P4=k P T1 T2 T3 a g On Minimizing Pattern Splitting in Multi-track String Matching
Parallel texts assumption • To represent the different tracks as parallel strings, we need to add empty characters to make the tracks aligned. • Therefore it makes more sense to consider splittings where the jumps over empty characters are not counted. 4-6--7---3--9 T= ...-5--784--2-8-... 3-3-453-8--8- P=464538289 On Minimizing Pattern Splitting in Multi-track String Matching
Related work • All related work assume that texts are parallel. • The exact search (a=0), when the number of splits is not minimized, can be casted into a subset matching problem. - Running time O(Kn log2(Kn)) can be achieved using an algorithm of Cole and Hariharan, 2002.-O((Kn+mn)d|S|/we) can be achieved using bit- parallelism, see Iliopoulos and Kurokawa, 2002. On Minimizing Pattern Splitting in Multi-track String Matching
Related work... • Lemström and Tarhio, 2003, have developed an efficient filter and a checking algorithm for the transposition-invariant version of the exact search problem on multi-track texts. On Minimizing Pattern Splitting in Multi-track String Matching
Summary of results • Let M={(i,j,k) | pi=tkj} be the set of matching character pairs, where 1· i · m, 1· j · n, and 1· k · K. For simplicity, let us assume that the alphabet is S={1,2,...,Kn+m}, and m,K<n. • The minimum splitting problem with a > 0, and with or without the parallel text assumption, can be solved in O(m+Kn+|M|) time.- Corollary: the transposition-invariant splitting problem can be solved in O(mKn) time. On Minimizing Pattern Splitting in Multi-track String Matching
Summary of results... • Let (i,j,k)(i+1,j+1,k)L(i+l-1,j+l-1,k) be a maximal sequence of points in M, i.e. a maximal (diagonal) line segment of M. Let S be the set of all maximal line segments of M. • The minimum splitting problem can be solved in O(m2+Kn+|S|log n) time. • The minimum splitting problem with a > 0 can be solved in O(m2+Kn+|R|log n) time, where |R|· min(|S|2,|M|). On Minimizing Pattern Splitting in Multi-track String Matching
Summary of results... • The minimum splitting problem with a = 0 can be solved in O(m2+kKn) time, where k is a given threshold. On Minimizing Pattern Splitting in Multi-track String Matching
O(|M|) algorithm • The idea is to compute an m£ n £ K matrix sparsely, so that each computed cell di,j,k stores the minimum splitting needed between P1...i and the text tracks upto tkj. The recurrence is On Minimizing Pattern Splitting in Multi-track String Matching
On Minimizing Pattern Splitting in Multi-track String Matching
O(|M|) algorithm... • Initializing d1,j,k = 0 for each (1,j,k) 2 M, we have that k = 1+min{dm,j,k | (m,j,k) 2 M}. • It is easy to construct M so that diagonally consecutive elements are linked to enable constant time evaluation of line (1) of the recurrence. • Evaluating M column-by-column, rows bottom to up, we can maintain the minimum value at each row to enable constant time evaluation of line (2) of the recurrence. On Minimizing Pattern Splitting in Multi-track String Matching
O(|M|) algorithm... • To solve case a > 0, we use a technique from Crochemore et al., 2002; keep sliding window minima at each row during column-by-column evaluation. • Min-deques (Gajewska and Tarjan, 1986) support constant time access to the minimum value in a list as well as insertion to the tail and deletion from the head of the list. On Minimizing Pattern Splitting in Multi-track String Matching
O(|M|) algorithm... • Each step of the algorithm takes constant amortized time. Thus the overall running time is O(|M|). On Minimizing Pattern Splitting in Multi-track String Matching
Transposition-invariance • In Navarro et al., 2003, the following connection between sparse dynamic programming and transposition-invariance was given. • Lemma: Let d(P,T) be a distance between strings P and T such that its value is determined by the set M={(i,j) | pi=tj}. If an algorithm computes d(P,T) in O(|M| f(m,n)) time, then the transposition invariant distance can be computed in O(mn f(m,n)) time. On Minimizing Pattern Splitting in Multi-track String Matching
Transposition-invariance... • In our problem, the relevant match sets for transposition invariant computation are the non-empty Mc={(i,j,k) | pi+c=tkj} for c 2 [-1,1]. • We can construct them all in O(mKn) time with pointers between diagonally consecutive elements in each set. • For each set we need O(|Mc|) time computation, which is O(mKn) overall. On Minimizing Pattern Splitting in Multi-track String Matching
Line segment algorithms • We will now show how to solve the minimum splitting problem doing computation only at the endpoints of the line segments of S. • After that, the construction of S is given and the solution to the case a = 0. • In the sequel, we assume a single track text for simplicity. On Minimizing Pattern Splitting in Multi-track String Matching
Interpretation as a minimum jump distance On Minimizing Pattern Splitting in Multi-track String Matching
Interpretation as a minimum jump distance • We denote the two endpoints of a line segment S by start(S), end(S) 2 M. • Let minimum jump distance d((i,j)) to (i,j) 2S be the number of horizontal jumps (from (i’-1,j’’) 2 S’’ to (i’,j’) 2 S’, j’<j, S’’,S’ 2S) needed for traversing through line segments of S from row 1 to (i,j). • Then di,j = 1+d((i,j)), where di,j denotes the minimum splitting upto (i,j). On Minimizing Pattern Splitting in Multi-track String Matching
Interpretation as a minimum jump distance... • Lemma: The minimum jump distanced(end(S))equalsd(start(S)). Let us denote this value d(S). • Idea of the algorithm: Traverse the endpoints of the line segments row-by-row. Keep the active segments (those intersecting previous row) in a balanced binary search tree with the diagonal numbers as the keys. Maintain subtree minima of d(S) values to answer range minimum queries [-1,j-i). On Minimizing Pattern Splitting in Multi-track String Matching
Interpretation as a minimum jump distance... • The required operations on binary search tree can be supported in O(log n) time. • Thus, the algorithm works in O(|S|log n) time. On Minimizing Pattern Splitting in Multi-track String Matching
Minimum splitting with a>0 • One can prove that it is enough to recompute the values of line segments only in their intersections with the so called a-greedy paths. • With some care in the implementation, one gets time bound O(m2+Kn+|R|log n), where |R|· min(|S|2,|M|). On Minimizing Pattern Splitting in Multi-track String Matching
Minimum splitting with a>0... On Minimizing Pattern Splitting in Multi-track String Matching
Constructing S • We will give a more general algorithm that constructs set Sg, i.e., the set of maximal line segments of length at leastg. • Let Prefix(A,B) denote the length of the longest common prefix of strings A and B. • Let MaxPrefix(j) be max{Prefix(Pi...m,Tj...n) | 1· i · m} and H(j) some index i giving the maximum. A=aaabbbb B=aaabcbb Prefix(A,B)=aaab On Minimizing Pattern Splitting in Multi-track String Matching
Constructing S... • Let Jump(i,j) denote Prefix(Pi...m,Tj...n). • Lemma (Ukkonen and Wood, 1993): Jump(i,j)=min(MaxPrefix(j),Prefix(Pi...m,PH(j)...m). • Ukkonen and Wood show how to allow constant access to any Jump(i,j) value after O(m2+n) time preprocessing. • Observation: If we manage to callJump(i,j)only at points(i,j)=start(S), S 2Sg, we have anO(|Sg|)construction algorithm forSg. On Minimizing Pattern Splitting in Multi-track String Matching
Constructing S... • To find points (i,j)=start(S), S 2Sg, we construct the suffix array A of P. • We make a copy As of A for each distinct character s of P. Then we remove from each As the suffixes i such that pi-1=s. • Now, if we query Tj...j+g-1 from the suffix array As where s = tj-1(or from A if As does not exist), the resulting positions of P give the line segments that start at column j. On Minimizing Pattern Splitting in Multi-track String Matching
Constructing S... • If we associate all suffix arrays with LCP values, the overall complexity of constructing Sg is O(m2+Kn(g+log m)+|Sg|). • Using suffix trees instead gives a bound O(m2|P|+Kng+|Sg|) • A more direct approach gives O(m2+Kn+|S|) for the case g = 1. On Minimizing Pattern Splitting in Multi-track String Matching
Minimum splitting with a=0 • Fact: Let there be a splitting of the pattern into k pieces, starting at position j of the multi-track text, without gaps between the partial occurrences. Then there is an equally good occurrence that can be found as follows: Select track Tk whose jth suffix has the longest common prefix, say length l, with the pattern. Iterate the same algorithm from position j+l with pattern suffix Pl+1...m, until a splitting into k pieces is found. On Minimizing Pattern Splitting in Multi-track String Matching
Minimum splitting with a=0... • In the above algorithm, we need k queries to Jump(i,j) for each track at each position j. • Thus, after O(m2+Kn) preprocessing for Jump(i,j) queries, the problem can be solved in O(kKn) time. On Minimizing Pattern Splitting in Multi-track String Matching
Implementation • We implemented the O(|M|) time algorithm, with the aforementioned skipping of empty characters. • Instead of using min-deques to support sliding window minima computation, we used a modification of the linear time construction of Cartesian trees (simple to implement and fast in practice). • The algorithm is plugged into the C-BRAHMS music search engine. On Minimizing Pattern Splitting in Multi-track String Matching
http://www.cs.helsinki.fi/group/cbrahms/demoengine/ On Minimizing Pattern Splitting in Multi-track String Matching
Extension and open problems • The O(|M|) and O(|S|log n) algorithms can be extended to the case where the cost is the sum of the lengths of the gaps between the partial occurrences. • Open: Computation in the case of the g restriction on the lengths of the partial occurrences. • Open: Can one achieve O(m+Kn+|Sg|) time for constructing Sg? On Minimizing Pattern Splitting in Multi-track String Matching