300 likes | 371 Views
Efficient algorithms for ( δ , γ , α )-matching. Kimmo Fredriksson Dept. of Computer Science Univ. of Joensuu, Finland kfredrik@cs.joensuu.fi. Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl. PSC, Prague, August 2006. Problem setting.
E N D
Efficient algorithmsfor (δ,γ,α)-matching Kimmo FredrikssonDept. of Computer Science Univ. of Joensuu, Finlandkfredrik@cs.joensuu.fi Szymon GrabowskiComputer Engineering Dept.,Tech. Univ. of Łódź, Polandsgrabow@kis.p.lodz.pl PSC, Prague, August 2006
Problem setting String matching in its classic form: given text T = t0t1 ... tn–1, and pattern P = p0p1 ... pm–1over a finite alphabet Σ of size σ, report all occurences of P in T. Such simple problem variant (exact matching)is not very useful for many applications. For example, it’s not relevant formusic information retrieval (MIR) and molecular biology. Several approximate matching models have thus been developed... K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Future work, hopefully... Models & applications – music information retrieval We allow classes of characters: the classes are continuous intervals (of equal width, 2δ+1, for all pattern positions). This corresponds to handling little distortions of the melody (singer / whistler unskilled or under influence...). Limitation on the sum of individual errors γ (< mδ). Gaps also allowed – this is to skip ornamentation (esp. in classical music). We assume all gaps are in [0, α] range. Transposition invariance – the key of the melody can be arbitrary, i.e. everything can be shifted up or down by a fixed value. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Problem we consider here (δ,γ,α)-matching Two symbols a, bΣ delta-match ( we write a =δ b ) iff |a – b| δ. We say that a pattern P (δ,γ,α)-matches the text substring ti0ti1 ... ti(m–1), if pj =δ tij for j{0 ... m–1},where 0 < ij+1 – ijα+1,and K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Previous work on similar models (δ,α)-matching: Crochemore et al., 2002: O(mn) time (worst, avg, and best case). Cantone et al., 2005a: also O(mn) in every case to find not only the end positions of the occurences but also all the matching sequences. Cantone et al., 2005b: achieving O(n) on avg (for constant α) and retaining O(mn) in the worst case. Navarro & Raffinot, 2003; Cantone et al., 2005b: nondeterministic finite automaton with O(nmα/ w) worst case time.Along these lines: Fredriksson & Grabowski, 2006: more compact automaton with O(nm log(α)/ w) worst case time. Fredriksson & Grabowski, 2006: bit-par alg with O(nδ + n/ w m)worst case time. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Surprisingly little work specifically on the(δ,γ,α)-matching problem... Crochemore et al., 2002: dynamic programming alg,runs in O(mn) worst-case time. Uses a min-queue. Of course, also a brute-force DP alg is possible:O(mn α) time, but may be faster in practice than the more sophisticated alg above (as α usually small). K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Our contributions We improve the basic dynamic programming based algorithm to run in O(nαδ/σ) average time. We propose a simple sparse DP alg with O(n) avg timeand O(min(mn, |M|α)) worst-casetime, where M = { (i,j) | pi =δ tj}. We develop a bit-parallel algorithm that runs in O(nδ + mn log γ/ w) worst case time. Its avg time complexity is close to O(n log γα(δ/σ)/ w + n), assuming small α. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Basic dynamic programing Let us have matrix D, with each cell (i, j) corresponding to the search state of pattern prefix p0 ... piin text T. More precisely, a γ-bounded value of Di,j will denote that p0 ... pi matches T at the end position j. Brute-force computation in O(mnα) time and O(n) space (enough to store only the curr and prev row). We can also proceed column-wise: same time but O(αm) space instead. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Cut-off trick for improving the avg time(Ukkonen, 1985; Cantone et al., 2005) Usually, calculating all the matrix cells is an overkill. Observation:if Di...m–1,j–α...j > γ then Di+1...m–1,j+1 > γ.Read: it’s not so easy to get out of a ‘dead zone’. m K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
DP-CO, cont’d The avg time is O(n (αδ/σ)2). (Pessimistic analysis, we weren’t able to take the gamma restriction into account.) The worst case remains O(mnα),but as in (Crochemore et al., 2002) it can be improved to O(mn). The difference is we handle m queues as we proceed column-wise. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Simple algorithm(ingenious name, eh?) In a few words: naïve brute force DP algorithm but applied only locally. We work on lists Li, corresponding to individual rows. We start with L0 = { j | tj=δ p0} (obtained in O(n) time). For i=1...m–1:Li = { j | tj=δ piAND Di–1,j’ + |pi–tj| γ AND0 < j–j’α +1 } We put each j only once into Li (if there are many j’ that can cause it, we choose the one that minimizes the new Di,j). Obtaining list Li takes O(α|Li–1|) time. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Simple algorithm, cont’d Complexity All lists have length |M| in total in the worst case.Which implies O(|M|α) worst case time. But: (i) on average this is much better,(ii) we can improve somewhat the worst case. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Simple algorithm, cont’d Average case analysis The length of list L0 is O(n δ/σ) on avg.Hence L1 is computed in O(n αδ/σ) avg time.But its avg length is only O(n δ/σαδ/σ). ...........................In general, computing Li takes O(n (αδ/σ)i) avg time. The total time will be summation over m such components. Note that α, δ, σ are fixed for a given problem instance.In other words, αδ/σ can be considered a constant. If the constant α(2δ+1)/σ is less than 1, we have a geometric series with O(n) sum. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Simple algorithm, cont’d Improving the worst case Idea: avoid brute-force handling of overlapping windows of α+1 size. We make use of a min-queue (Gajewska & Tarjan, 1986), similarly to the concept from (Crochemore et al., 2002). The queue always keeps up to α+1 integers, namely the error sums corresponding to the sliding window area in the previous row. For each processed cell 0 or 1 values are inserted to the front of the queue (O(1) time) and from 0 to α+1 values deleted from the tail. But we can’t remove more than we’ve inserted. Hence O(1) amortized cost per cell. This improves the worst-case time complexity to O(min(mn, |M|α)). K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Bit-parallelism technique(in stringology) Baeza–Yates (1989) noticed that CPU registers are usually longer than 1 bit...And he made use of this fact. In O(1) time we can peform operations like logical and(&), or(|), shifts (<<, >>)etc. on a whole machine word (usu. 32 or 64 bits). Nowadays, bit-parallelism is a very popular techniquein string matching algorithms, in theory and in practice. Also useful for many approximate matching variants. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Bit-parallel dynamic programming Modified DP alg: let the cells of D be chunks of O(log γ) bits. We’ll be able to compute O(w / log γ) cells in parallel. More precisely, each cell will use l + 1 bits, where l = log2(2γ +1). Error sum zero will be encoded as 2l–1 – (γ +1),γ +1 (the lowest ‘illegal’ value) will be thus 2l–1(old trick, e.g., Fredriksson & Navarro, 2004; Crochemore et al., 2005). This representation can solve 3 issues:(i) checking in parallel if some counters exceed γ,(ii) parallel handling of counter overflows,(iii) computing pairwise minima over two sets of countersin parallel. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
BP-DP, cont’d Tiling the DP matrix with C = w / (l+1) × 1 vectors (C = 8). The dark gray cell of the current tile depends on the light gray cells of the two tiles in the previous row (α= 4). We are in row i. Thx to preprocessing, we know the delta-errors between all chars in the current tile (C cells) and P[i]. Problem: How to calculate the new values of Di,*? K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
BP-DP, cont’d Solution #1. Naïve shifts (chunk by chunk) and minimizations with O(α) factor. Solution #2. Similar but with a halving technique: first shift by α / 2 counter positions, then by α / 4 etc. performing the minimization at each step.It yields O(log α) time factor. Solution #3. Use a precomputed function.Which we choose, as it gives O(1) time for a O(w)-bit chunk (in practice some w’, e.g. w’=w / 4). K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Pre-emptying the computation in the BP-DP search The cut-off trick can again be used. With some modification since now we calculate C cells in parallel. (Read: the picture at slide 9 will be less jagged and the trick is somewhat less efficient here.) Avg search time is (upper bound estimation, maybe not tight):O(n/ Cαδ/σ + n). K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
How to find minima in parallel for the O(w / log γ)-sized chunks Precomputing as usual (ugly...) or an old trick (Paul & Simon, 1980) K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Preprocessing in BP-DP Preprocessing is simple.We build a helper bit-matrix V such that Vi,j = |pi – tj|if pi =δtj , and γ+1 otherwise. Note that the numbers of rows in V can be reduced to the # of unique symbols in P (why storing completely repeating rows?), which is σP. We call this terse representation V’. First we fillV’ with γ+1 values in O(n / CσP) time. Then we scan T and set 0..δ in at most 2δ +1 rows of V’ (those that δ-match the current char from T). Worst case time of the latter phase: O(nδ). Less on avg. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Lazy preprocessing Note that in the previous scheme (with cut-off) the avg time may be even O(n) but the preprocessing typically superlinear (even if not much). To avoid costly preprocessing in the case when search will be fast (i.e. the cut-off thing will work efficiently),we can interweave the preprocessing and search phases. This leads to O(n/ Cαδ/σ + n) avg preprocessing time (pessimistic analysis), i.e. matches the avg search time. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Multiple patterns The bit-par alg has relatively high preprocessing cost:O(nδ + Pn / w / log γ ) in the worst case. If we are however about to search for r patterns, the search time is multiplied by r,but the good news is that the preprocessing is increasedmuch more mildly: to O(nδ + Pn / w / log γ +rm),where P is now the # of distinct symbols in the whole pattern set. Practical (well-known) trick for r patterns if r small compared to / δ: superimpose pattern (then verify). K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Test methodology All algorithms implemented in C, compiled with icc 9.0. Test machine: P4 2.4 GHz, 512 MB, running GNU/Linux 2.4.20. Avg times reported over 100 trials (randomly extracted patt.). Text files: 1. Concatenation of 7543 music pieces (MIDI, stripped off of anything except pitch values), totalling 1.8 MB. Alphabet: [0..127] range, but far from random: only 55 values actually occur, and only 6 most freq symbols cover ~50% of the whole text. 2. Uniformly random data in 0..127 range. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Compared algorithms BP Cut-off: bit-parallel dynamic programming with cut-off (without the lazy preprocessing). BP Filter: the (δ,α)-matching version of BP Cut-off (Fredriksson & Grabowski, 2006)used as a filter, and DP-CO used for verifications. DP Cut-off: dynamic programming with cut-off. Simple: simple sparse DP (in the O(|M|α) worst case time version). K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Experimental results, MIDI δ = 1, γ = 4, α = 1 K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Experimental results, MIDI δ = 4, γ = 16, α = 2 K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Experimental results, randomδ = 4, γ = 16, α = 2 K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Conclusions Bit-parallelism works well also for the (δ,γ,α) search problem... ...But it works even better if regions of text where matches cannot be extended are quickly discarded. Still, BP-DP for (δ,γ,α) disappoints compared to BP-DP for (δ,α) used as a filter. (Problem: the γ counters need many bits...) Consistently best alg in the tests was a simpleheuristic (called Simple alg). Fortunately, it doesn’t have competitive worst-case time. K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Future plans Research on extended models: most importantly with transposition invariance. Some purely theoretical variants(e.g., better complexity for large alpha). Injecting compression to represent bit vectors more succinctly and thus speed up the search? Can we replace the log γ factor in the bit-par algwith log δ?(Hint: in each step we increase the counters by at most δ only.) K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching