430 likes | 1k Views
Horspool Algorithm. Source : Practical fast searching in strings R. NIGEL HORSPOOL Advisor: Prof. R. C. T. Lee Speaker: H. M. Chen. Text. Pattern. Definition of String Matching Problem.
E N D
Horspool Algorithm Source : Practical fast searching in strings R. NIGEL HORSPOOL Advisor: Prof. R. C. T. Lee Speaker: H. M. Chen
Text Pattern Definition of String Matching Problem • Given a pattern string P of length m and a text string T of length n, we would like to know whether there exists an occurrence of P in T.
Rule 2: Character Matching Rule • For any character X in T, find the nearest X in P which is to the left of X in T.
Suffix search Text α ß Pattern σ match • For each position of the window, we compare its last character(ß) with the last character of the pattern. • If they match, we scan the window backwardly against the pattern until we either find the pattern or fail on a text character.
Suffix search Text ß Text α ß ß Safe shift Pattern σ no ß in this part match • Then, no matter whether there is a match or not, we shift the window so that the pattern matches ß. Note that ß is the last character of the previous window.
Preprocessing phase HpBc table The value bmBc for a particular alphabet is defined as the rightmost position of that character in the pattern – 1. Example : T : GCATCGCAGAGAGTATACAGTACG P : GCAGAGAG 7 6 5 4 3 2 1
Pseudo code Horspool (P = p1p2…pm,T = t1t2…tn) Preprocessing For c ∑ Do d[c] ← m For j 1…m-1 Do d[pj] ← m - j Searching pos←0 While pos ≤ n-m Do j ←m While j > 0 And tpos+j = pj Do j ← j-1 If j = 0 Then report an occurrence at pos+1 pos ← pos +d[tpos+m] End of while
Preprocessing phase Step1: For c ∑ Do d[c] ← m c {A C G T} d[A]=8 , d[C]=8 d[G]=8 , d[T]=8 for example : T : GCATCGCAGAGAGTATACAGTACG P : GCAGAGAG Step2: For j 1…m-1 Do d[pj] ← m – j d[A]=1 , d[C]=6 d[G]=2 , d[T]=8
Example(1/3) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 GCATCGCAGAGAGTATACAGTACG GCAGAGAG pos ← 0 + d[t0+7] , pos ← 0 + d[A], pos ← 1 GCATCGCAGAGAGTATACAGTACG GCAGAGAG pos ← 1+ d[t1+7] , pos ← 1+ d[G], pos ← 3 GCATCGCAGAGAGTATACAGTACG GCAGAGAG pos ← 3+ d[t3+7] , pos ← 3 + d[G], pos ← 5 pos ← pos +d[tpos+m] A C G * 1 6 2 8
Example(2/3) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 GCATCGCAGAGAGTATACAGTACG GCAGAGAG While j > 0 And tpos+j = pj Do j ← j-1 If j = 0 Then report an occurrence at pos+1 pos ← 5+ d[t5+7] , pos ← 5+ d[G], pos ← 7 GCATCGCAGAGAGTATACAGTACG GCAGAGAG pos ← 7+ d[t7+7] , pos ← 7+ d[A], pos ← 8 GCATCGCAGAGAGTATACAGTACG GCAGAGAG pos ← 8+ d[t8+7] , pos ← 8+ d[T], pos ← 16 A C G * 1 6 2 8
Example(3/3) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 GCATCGCAGAGAGTATACAGTACG GCAGAGAG pos ← 16+ d[t16+7] , pos ← 16+ d[G], pos ← 18 pos > n-m // pos >23-7 jump out of while loop A C G * 1 6 2 8
Example(1/2) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 AGATACGATATATAC ATATA d[A] = 2 AGATACGATATATAC ATATA G ≠A,d[G] = 5 for example : T : AGATACGATATATAC P : ATATA
Example(2/2) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 AGATACGATATATAC ATATA We verify backward the window and find the occurrence. We then shift by re-using the last character of the window, d[A] = 2 AGATACGATATATAC ATATA We find the pattern. We shift by the last character of then window, d[A] = 2. Then, pos > n-m and the search stops. A T * 2 1 5
Time complexity • preprocessing phase in O(m+ п) time and O(п) space complexity. • searching phase in O(mn) time complexity. • the average number of comparisons for one text character is between 1/п and 2/(п+1). (п is the number of storing characters)
References • AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical Computer Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter 5, pp 255-300, Elsevier, Amsterdam. • BAEZA-YATES, R.A., RÉGNIER, M., 1992, Average running time of the Boyer-Moore-Horspool algorithm, Theoretical Computer Science 92(1):19-31. • BEAUQUIER, D., BERSTEL, J., CHRÉTIENNE, P., 1992, Éléments d'algorithmique, Chapter 10, pp 337-377, Masson, Paris. • CROCHEMORE, M., HANCART, C., 1999, Pattern Matching in Strings, in Algorithms and Theory of Computation Handbook, M.J. Atallah ed., Chapter 11, pp 11-1--11-28, CRC Press Inc., Boca Raton, FL. • HANCART, C., 1993. Analyse exacte et en moyenne d'algorithmes de recherche d'un motif dans un texte, Ph. D. Thesis, University Paris 7, France. • HORSPOOL R.N., 1980, Practical fast searching in strings, Software - Practice & Experience, 10(6):501-506. • LECROQ, T., 1995, Experimental results on string matching algorithms, Software - Practice & Experience 25(7):727-765. • STEPHEN, G.A., 1994, String Searching Algorithms, World Scientific.