算法基础第八讲：串匹配算法

算法基础第八讲：串匹配算法 主讲: 顾乃杰教授单位: 计算机科学技术学院学期: 2016-2017 (秋)

主要内容 • The Naive Algorithm (Brute Force ) • The Knuth-Morris-Pratt Algorithm • The SHIFT-OR Algorithm • The Boyer-Moore Algorithm • The Boyer-Moore-Horspool Algorithm • The Karp-Rabin Algorithm • Conclusion 本教案参考了下述有关 String Searching Algorithm的教案，在此表示感谢： • 中国台湾省国立中山大学黃三益教授的教案 • Princeton University • Kevin Wayne • Theory of Algorithms • COS 42 Department of Computer Science & Technology

8.0 串匹配问题 • String-matching Problem: • Find one occurrences of a patternin a text; • Find out all the occurrences of a pattern in a text. • Applications require two kinds of solution depending on which string, the pattern or the text, is given first. • Algorithms based on the use of automata or combinatorial properties of strings are commonly implemented to preprocess the pattern and solve the first kind of problem. • The notion of indexes realized by trees or automata is used in the second kind of solutions. Department of Computer Science & Technology

串匹配及其应用 • Some applications. • Word processors. • Virus scanning. • Text information retrieval systems. (Lexis, Nexis) • Digital libraries. • Natural language processing. • Specialized databases. • Computational molecular biology. • Web search engines. • Bioinformatic. Department of Computer Science & Technology

串匹配示例 Search Pattern Search Pattern n n e e e e d d l l e e Successful Search n n e e n l e d e n e e n e e d l e n l d Search Text n n e e n l e d e n e e n e e d l e n l d Department of Computer Science & Technology

常用述语和定义 • Parameters. 记文本串为 T，模式串为 P • n: the length of the text. • m : the length of the pattern(string). • Typically, n >> m. • e.g., n = 1 million, m = 1 hundred • σ : the size of the alphabet. • ∑ : the alphabet. • Cn: the expected number of comparisonsperformed by an algorithm while searching the pattern in a text of length n Department of Computer Science & Technology

串匹配算法概述 • 目前教科书上所介绍的串匹配算法基本原理是： • 利用一个大小等同于模式长度的 window对文本串进行扫描; • 首先将模式串与文本串的左端对齐； • 对模式串与文本串的对应字符进行对比----称为一次 attempt • 在每次成功匹配或每次失配之后，将 window右移； • 重复3，4两步直到 window的右端超出文本串的右端。 • 这种方法称为 sliding window mechanism. • 在将文本串中的当前window部份与模式串对比时可以：从左到右，也可以从右到左，甚至可以用特定次序。 Department of Computer Science & Technology

串匹配算法概述 (续) • From left to right • Karp and Rabin • Knuth，Morris and Pratt • From right to left • Boyer and Moore • Horspool • In any order • Brute Force Algorithms（ Naive Algorithm） Department of Computer Science & Technology

8.1 Brute Force算法 • Brute force： Check for pattern starting at every text position， trying to match any substring of length m in the text with the pattern。 Analysis of brute force. • Running time depends on pattern and text. • can be slow when strings repeat themselves Worst case: O(MN) comparisons. • too slow when M and N are large Department of Computer Science & Technology

Brute Force算法伪代码1 Brute-Force-1 (T,P) ; i =0 ; while i≤n-m do j = 0; //* left to right scan of P while j < m and P[j+1] = T[i+j+1] do j = j+1; if j=m then Report_match_at_position(i-j+1); i = i+1; Return. Department of Computer Science & Technology

Brute Force算法伪代码2 Char text[], pat[] ; int n, m ; { int i, j, k, lim ; lim=n-m+1 ; for (i=1 ; i<=lim ; i++) /* search */ { k=i ; for (j=1 ; j<=m && text[k]==pat[j]; j++) k++; if (j>m) Report_match_at_position(i-j+1); } } Department of Computer Science & Technology

Search Pattern Search Pattern n n e e e e d d l l e e Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Text n n n n n n n n n n n e e e e e e e e e e e e e e e e e e e e e e d d d d d d d d d d d l l l l l l l l l l l e e e e e e e e e e e n n e e n l e d e n e e n e e d l e n l d Brute Force串匹配实例

Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Text Search Text n n n n n n n n n e e e e e e e e e e e e e e e e e e d d d d d d d d d l n n n n n n n n e e e e e e e e e n n n n e e e e n n e l e e d n e e n n e e e e d n n e e n e d l l e e n n l l d d Brute Force串匹配实例超出右端边界，停止！

Brute Force算法分析 Search Pattern a a a a a b Search Text a a a a a a a a a a a a a a a a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b • Analysis of brute force. • Running time depends on pattern and text. • can be slow when strings repeat themselves • Worst case: MN comparisons. • too slow when M and N are large Department of Computer Science & Technology

How To Save Comparisons Search Pattern a a a a a b Search Text a a a a a a a a a a a a a a a a a a a a b a a a a a b a a a a a b • How to avoid recomputation? • Pre-analyze search pattern. • Ex: suppose that first 5 characters of pattern are all a's. • If T[1..5] matches P[1..5] then T[2..5] matches P[1..4]. • no need to check i = 2, j = 1, 2, 3, 4 • saves 4 comparisons • Need better ideas in general. Department of Computer Science & Technology

Homework 32.1 • Page 579 : 32.1-2, 32.1-4.

8.2 KMP算法 • KMP算法思想：当遇到一个不成功匹配后，充分利用已经得到的有关前一部分匹配的信息，避免多余的测试，加快匹配过程。 • 在匹配过程过程中，文本串中的当前指针只会向右移(增加)，不会向左倒退； • 在测试文本串的 T[i+1..i+m]这一段时，得到一个部分匹配： P[1..j] = T[i+1..i+j], 但 P[j+1] ≠ T[i+j+1]; • 设 k 是使得 P[1..k] = P[(j-k+1)..j]的最大整数（即 P[1..j]中最长的与真前缀相同的真后缀的长度） • 定义: Next[j+1]= max{ k+1|P[1..k]是P[1..j]的后缀，j>k≥0 } • Next 函数值的计算只与模式串有关，与文本串内容无关，可以在进行匹配前预先计算得到 Next[1..m]. Department of Computer Science & Technology

KMP算法中的Next函数 算法的运行时间为：O (m) • 由于 i - j ≥ 0，而且 i - j 单调增； • i – j 不变的次数不超过m; 3. i – j ≤ m, 其增加的次数 ≤ m。 Next(P[1..m]) 1. j ← 0; 2. m ← Length(P) ; 3. For i ← 1 to m do 4. Next[i] ← j ; 5. While j > 0 and P[i]≠P[j] do 6. j ← Next[j] ; 7. j ← j + 1; 在《算法导论》一书中定义了另一种类似的函数 π[i]，两种函数之间的关系为： π[i] = Next[i+1] -1 Department of Computer Science & Technology

KMP算法伪代码 算法运行时间： O (n) • 由于 i - j ≥ 0，而且 i - j 单调增； • i – j 不变的次数不超过n; • i – j ≤ n, 其增加的次数 ≤ n。 KMP(T, P) 1 j ← 1; 2 For i ← 1 to n do 3 while j > 0 and T[i] ≠ P[j] do 4 j ← Next[j] ; 5 if j = m then // 找到一个成功匹配 6 return (i-m+1) ; 7 j ← j +1 ; 8 return （none） Department of Computer Science & Technology

Department of Computer Science & Technology

Search Pattern 0 n e 1 e 1 1 n e 2 e 3 Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 Next(P) 1. j ← 0; 2. m ← Length(P) ; 3. For i ← 1 to m do 4. Next[i] ← j ; 5. While j > 0 and P[i]≠P[j] do 6. j ← Next[j] ; 7. j ← j + 1; i = 1, j = 0, Next[1] = 0 i = 2, j = 1, Next[2] = 1 i = 3, j = 1, Next[3] = 1 i = 4, j = 1, Next[4] = 1 i = 5, j = 1, Next[5] = 2 i = 6, j = 2, Next[6] = 3

Search Pattern Search Pattern n 0 n 1 e e e 1 e 1 n n e e 2 3 e e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例从文本串T的最左端开始匹配 KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern n 0 n 1 e e e 1 e 1 n n e e 2 3 e e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? j = 1, i = 1, P[1] = T[1], 则： j = j + 1, i = i + 1, 比较 P[j]和T[i] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern Search Pattern 0 n n n e 1 e e e e 1 e n 1 n n e e 2 e 3 e e e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 j = 2, i = 2, P[2] ≠ T[2], 则： j = Next[2], 比较 P[j]和 T[i] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern 0 n n e e 1 1 e e n n 1 e e 2 e 3 e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? j = 1, i = 2, P[1] = T[2], 则： j = j + 1, i = i+1; 比较 P[2]和 T[3] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

j = 2, i = 3, P[2] = T[3], 则： j = j + 1, i = i+1; 比较 P[3]和 T[4] Search Pattern Search Pattern n n 0 e e 1 e 1 e n 1 n e e 2 e e 3 Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

j = 6, i = 7, P[6] ≠ T[7], 则： j = Next[j], 比较 P[3]和 T[7] Search Pattern Search Pattern n n 0 e 1 e e 1 e n n 1 e e 2 e 3 e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern 0 n n e e 1 1 e e n n 1 e e 2 e 3 e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? j = 3, i = 7, P[3] ≠ T[7], 则： j = Next[j], 比较 P[j]和 T[7] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern 0 n n e e 1 1 e e n n 1 e e 2 e 3 e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? j = 1, i = 7, P[1] ≠ T[7], 则： j = Next[j], 比较 P[j]和 T[8] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern 0 n n e 1 e 1 e e 1 n n e e 2 e e 3 j = 1, i = 8, P[1] ≠ T[8], 则: j = Next[j], 比较 P[1]和 T[9] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern 0 n n e 1 e 1 e e 1 n n e e 2 e e 3 j = 1, i = 9, P[1] ≠ T[9], 则: j = Next[j], 比较 P[1]和 T[10] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern 0 n n e 1 e 1 e e 1 n n e e 2 e e 3 j = 1, i = 10, P[1] = T[10], 则: j = 2, i = 11, 比较 P[2]和 T[11] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern n n 0 1 e e e 1 e n 1 n e 2 e e e 3 j = 2, i = 11, P[2] = T[11], 则: j = 3, i = 12, 比较 P[3]和 T[12] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern n n 0 1 e e 1 e e 1 n n e 2 e e e 3 j = 3, i = 12, P[3] = T[12], 则: j = 4, i = 13, 比较 P[4]和 T[13] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Search Pattern Search Pattern n n 0 1 e e 1 e e 1 n n e 2 e e e 3 j = 6, i = 15, P[6] = T[15], 则: j = m, 找到匹配，返回 10 Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例算法结束！ KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return （none）

Summary of KMP • KMP summary. • Build FSA from pattern. • Run FSA on text. • O(M + N) worst case string search. • Good efficiency for patterns and texts with much repetition. • binary files • graphics formats • Less useful for text strings. • On-line algorithm. • virus scanning • Internet spying Department of Computer Science & Technology

KMP算法的历史 • History of KMP. • Inspired by theorem of Cook that says O(M + N) algorithm should be possible. • Discovered in 1976 independently by two groups. • Knuth-Pratt. • Morris was hacker trying to build an editor. • annoying problem that you needed a buffer when performing text search • Resolved theoretical and practical problems. • Surprise when it was discovered. • In hindsight, seems like right algorithm. Department of Computer Science & Technology

Homework 32.4 • Page 593 : 32.4-1, 32.4-2, 32.4-5.

8.3 Shift-Or 算法 • uses bitwise techniques; • efficient if the pattern length is no longer than the memory-word size of the machine; • preprocessing phase in O(m +σ) time and space complexity; • searching phase in O(n) time complexity (independent from the alphabet size σ and the pattern length); • adapts easily to approximate string matching. Department of Computer Science & Technology

Shift-Or算法思想 • Let R be a bit array of size m. • Vector Rj is the value of the array R after text character T[j] has been processed (见下页的图). • It contains informations about all matches of prefixes of P that end at position j in the text for 1 < i≤m: 注意：有的书上算法中所用数组下标从0开始，不是从1开始 ! Department of Computer Science & Technology

Shift-Or算法思想(续) j T P[1] i = 1 1 P[1..2] i= 2 0 P[1..3] i= 3 1 P[1..m]] i= m 0 Rj Department of Computer Science & Technology

Shift-Or算法思想(续) • The vector Rj+1 can be computed after Rjas follows. • For each Rj[i]=0: and • If Rj+1[m]=0 then a complete match can be reported. • The transition from Rj to Rj+1 can be computed in two steps. Department of Computer Science & Technology

Shift-Or算法思想(续) • Step 1: For each c in ∑, let Sc be a bit array of size m such that: for 1≤i < m, Sc[i]=0 iff P[i]=c. 例如： ∑={a,b,c,d} be the alphabet, and ababcthe pattern 则： Sa[5..1] = (11010)2,Sb[5..1]= (10101)2, Sc[5..1]= (01111)2,Sd[5..1]= (11111)2. • The array Scdenotes the positions of the character c in the pattern P. Each Sc can be preprocessed before the search. Department of Computer Science & Technology

Shift-Or算法思想(续) • Step 2: The computation of Rj+1 reduces to two operations, shift and or: Rj+1= SHIFT( Rj) ORST[j+1] • Assuming that the pattern length is no longer than the memory-word size of the machine, the space and time complexity of the preprocessing phase is O(m+σ). • The time complexity of the searching phase is O(n), thus independent from the alphabet size and the pattern length. Department of Computer Science & Technology

算法基础 第八讲：串匹配算法