1 / 106

算法基础 第八讲:串匹配算法

算法基础 第八讲:串匹配算法. 主 讲 : 顾 乃 杰 教授 单 位 : 计算机科学技术学院 学 期 : 2016-2017 ( 秋 ). 主要内容. The Naive Algorithm (Brute Force ) The Knuth-Morris-Pratt Algorithm The SHIFT-OR Algorithm The Boyer-Moore Algorithm The Boyer-Moore-Horspool Algorithm The Karp-Rabin Algorithm

bhost
Download Presentation

算法基础 第八讲:串匹配算法

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 算法基础第八讲:串匹配算法 主 讲: 顾 乃 杰 教授 单 位: 计算机科学技术学院 学 期: 2016-2017 (秋)

  2. 主要内容 • The Naive Algorithm (Brute Force ) • The Knuth-Morris-Pratt Algorithm • The SHIFT-OR Algorithm • The Boyer-Moore Algorithm • The Boyer-Moore-Horspool Algorithm • The Karp-Rabin Algorithm • Conclusion 本教案参考了下述有关 String Searching Algorithm的教案,在此表示感谢: • 中国台湾省 国立中山大学 黃三益教授的教案 • Princeton University • Kevin Wayne • Theory of Algorithms • COS 42 Department of Computer Science & Technology

  3. 8.0 串匹配问题 • String-matching Problem: • Find one occurrences of a patternin a text; • Find out all the occurrences of a pattern in a text. • Applications require two kinds of solution depending on which string, the pattern or the text, is given first. • Algorithms based on the use of automata or combinatorial properties of strings are commonly implemented to preprocess the pattern and solve the first kind of problem. • The notion of indexes realized by trees or automata is used in the second kind of solutions. Department of Computer Science & Technology

  4. 串匹配及其应用 • Some applications. • Word processors. • Virus scanning. • Text information retrieval systems. (Lexis, Nexis) • Digital libraries. • Natural language processing. • Specialized databases. • Computational molecular biology. • Web search engines. • Bioinformatic. Department of Computer Science & Technology

  5. 串匹配示例 Search Pattern Search Pattern n n e e e e d d l l e e Successful Search n n e e n l e d e n e e n e e d l e n l d Search Text n n e e n l e d e n e e n e e d l e n l d Department of Computer Science & Technology

  6. 常用述语和定义 • Parameters. 记文本串为 T,模式串为 P • n: the length of the text. • m : the length of the pattern(string). • Typically, n >> m. • e.g., n = 1 million, m = 1 hundred • σ : the size of the alphabet. • ∑ : the alphabet. • Cn: the expected number of comparisonsperformed by an algorithm while searching the pattern in a text of length n Department of Computer Science & Technology

  7. 串匹配算法概述 • 目前教科书上所介绍的串匹配算法基本原理是: • 利用一个大小等同于模式长度的 window对文本串进行扫描; • 首先将模式串与文本串的左端对齐; • 对模式串与文本串的对应字符进行对比----称为一次 attempt • 在每次成功匹配或每次失配之后,将 window右移; • 重复3,4两步直到 window的右端超出文本串的右端。 • 这种方法称为 sliding window mechanism. • 在将文本串中的当前window部份与模式串对比时可以: 从左到右,也可以从右到左,甚至可以用特定次序。 Department of Computer Science & Technology

  8. 串匹配算法概述 (续) • From left to right • Karp and Rabin • Knuth,Morris and Pratt • From right to left • Boyer and Moore • Horspool • In any order • Brute Force Algorithms( Naive Algorithm) Department of Computer Science & Technology

  9. 8.1 Brute Force算法 • Brute force: Check for pattern starting at every text position, trying to match any substring of length m in the text with the pattern。 Analysis of brute force. • Running time depends on pattern and text. • can be slow when strings repeat themselves Worst case: O(MN) comparisons. • too slow when M and N are large Department of Computer Science & Technology

  10. Brute Force算法伪代码1 Brute-Force-1 (T,P) ; i =0 ; while i≤n-m do j = 0; //* left to right scan of P while j < m and P[j+1] = T[i+j+1] do j = j+1; if j=m then Report_match_at_position(i-j+1); i = i+1; Return. Department of Computer Science & Technology

  11. Brute Force算法伪代码2 Char text[], pat[] ; int n, m ; { int i, j, k, lim ; lim=n-m+1 ; for (i=1 ; i<=lim ; i++) /* search */ { k=i ; for (j=1 ; j<=m && text[k]==pat[j]; j++) k++; if (j>m) Report_match_at_position(i-j+1); } } Department of Computer Science & Technology

  12. Search Pattern Search Pattern n n e e e e d d l l e e Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Text n n n n n n n n n n n e e e e e e e e e e e e e e e e e e e e e e d d d d d d d d d d d l l l l l l l l l l l e e e e e e e e e e e n n e e n l e d e n e e n e e d l e n l d Brute Force串匹配实例

  13. Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Pattern Search Text Search Text n n n n n n n n n e e e e e e e e e e e e e e e e e e d d d d d d d d d l n n n n n n n n e e e e e e e e e n n n n e e e e n n e l e e d n e e n n e e e e d n n e e n e d l l e e n n l l d d Brute Force串匹配实例 超出右端边界,停止!

  14. Brute Force算法分析 Search Pattern a a a a a b Search Text a a a a a a a a a a a a a a a a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b a a a a a b • Analysis of brute force. • Running time depends on pattern and text. • can be slow when strings repeat themselves • Worst case: MN comparisons. • too slow when M and N are large Department of Computer Science & Technology

  15. How To Save Comparisons Search Pattern a a a a a b Search Text a a a a a a a a a a a a a a a a a a a a b a a a a a b a a a a a b • How to avoid recomputation? • Pre-analyze search pattern. • Ex: suppose that first 5 characters of pattern are all a's. • If T[1..5] matches P[1..5] then T[2..5] matches P[1..4]. • no need to check i = 2, j = 1, 2, 3, 4 • saves 4 comparisons • Need better ideas in general. Department of Computer Science & Technology

  16. Homework 32.1 • Page 579 : 32.1-2, 32.1-4.

  17. 8.2 KMP算法 • KMP算法思想:当遇到一个不成功匹配后,充分利用已经得到的有关前一部分匹配的信息,避免多余的测试,加快匹配过程。 • 在匹配过程过程中,文本串中的当前指针只会向右移(增加),不会向左倒退; • 在测试文本串的 T[i+1..i+m]这一段时,得到一个部分匹配: P[1..j] = T[i+1..i+j], 但 P[j+1] ≠ T[i+j+1]; • 设 k 是使得 P[1..k] = P[(j-k+1)..j]的最大整数(即 P[1..j]中最长的与真前缀相同的真后缀的长度) • 定义: Next[j+1]= max{ k+1|P[1..k]是P[1..j]的后缀,j>k≥0 } • Next 函数值的计算只与模式串有关,与文本串内容无关,可以在进行匹配前预先计算得到 Next[1..m]. Department of Computer Science & Technology

  18. KMP算法中的Next函数 算法的运行时间为:O (m) • 由于 i - j ≥ 0, 而且 i - j 单调增; • i – j 不变的次数不超过m; 3. i – j ≤ m, 其增加的次数 ≤ m。 Next(P[1..m]) 1. j ← 0; 2. m ← Length(P) ; 3. For i ← 1 to m do 4. Next[i] ← j ; 5. While j > 0 and P[i]≠P[j] do 6. j ← Next[j] ; 7. j ← j + 1; 在《算法导论》一书中定义了另一种类似的函数 π[i], 两种函数之间的关系为: π[i] = Next[i+1] -1 Department of Computer Science & Technology

  19. KMP算法伪代码 算法运行时间: O (n) • 由于 i - j ≥ 0, 而且 i - j 单调增; • i – j 不变的次数不超过n; • i – j ≤ n, 其增加的次数 ≤ n。 KMP(T, P) 1 j ← 1; 2 For i ← 1 to n do 3 while j > 0 and T[i] ≠ P[j] do 4 j ← Next[j] ; 5 if j = m then // 找到一个成功匹配 6 return (i-m+1) ; 7 j ← j +1 ; 8 return (none) Department of Computer Science & Technology

  20. Department of Computer Science & Technology

  21. Department of Computer Science & Technology

  22. Search Pattern 0 n e 1 e 1 1 n e 2 e 3 Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 Next(P) 1. j ← 0; 2. m ← Length(P) ; 3. For i ← 1 to m do 4. Next[i] ← j ; 5. While j > 0 and P[i]≠P[j] do 6. j ← Next[j] ; 7. j ← j + 1; i = 1, j = 0, Next[1] = 0 i = 2, j = 1, Next[2] = 1 i = 3, j = 1, Next[3] = 1 i = 4, j = 1, Next[4] = 1 i = 5, j = 1, Next[5] = 2 i = 6, j = 2, Next[6] = 3

  23. Search Pattern Search Pattern n 0 n 1 e e e 1 e 1 n n e e 2 3 e e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 从文本串T的最左端开始匹配 KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  24. Search Pattern Search Pattern n 0 n 1 e e e 1 e 1 n n e e 2 3 e e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? j = 1, i = 1, P[1] = T[1], 则: j = j + 1, i = i + 1, 比较 P[j]和T[i] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  25. Search Pattern Search Pattern Search Pattern 0 n n n e 1 e e e e 1 e n 1 n n e e 2 e 3 e e e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 j = 2, i = 2, P[2] ≠ T[2], 则: j = Next[2], 比较 P[j]和 T[i] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  26. Search Pattern Search Pattern 0 n n e e 1 1 e e n n 1 e e 2 e 3 e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? j = 1, i = 2, P[1] = T[2], 则: j = j + 1, i = i+1; 比较 P[2]和 T[3] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  27. j = 2, i = 3, P[2] = T[3], 则: j = j + 1, i = i+1; 比较 P[3]和 T[4] Search Pattern Search Pattern n n 0 e e 1 e 1 e n 1 n e e 2 e e 3 Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  28. j = 3, i = 4, P[3] = T[4], 则: j = j + 1, i = i+1; 比较 P[4]和 T[5] Search Pattern Search Pattern n n 0 e e 1 e 1 e n 1 n e e 2 e e 3 Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  29. j = 4, i = 5, P[4] = T[5], 则: j = j + 1, i = i+1; 比较 P[5]和 T[6] Search Pattern Search Pattern n n 0 e e 1 e 1 e n 1 n e e 2 e e 3 Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  30. j = 5, i = 6, P[5] = T[6], 则: j = j + 1, i = i+1; 比较 P[6]和 T[7] Search Pattern Search Pattern n n 0 e e 1 e 1 e n 1 n e e 2 e e 3 Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  31. j = 6, i = 7, P[6] ≠ T[7], 则: j = Next[j], 比较 P[3]和 T[7] Search Pattern Search Pattern n n 0 e 1 e e 1 e n n 1 e e 2 e 3 e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  32. Search Pattern Search Pattern 0 n n e e 1 1 e e n n 1 e e 2 e 3 e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? j = 3, i = 7, P[3] ≠ T[7], 则: j = Next[j], 比较 P[j]和 T[7] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  33. Search Pattern Search Pattern 0 n n e e 1 1 e e n n 1 e e 2 e 3 e Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? j = 1, i = 7, P[1] ≠ T[7], 则: j = Next[j], 比较 P[j]和 T[8] KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  34. Search Pattern Search Pattern 0 n n e 1 e 1 e e 1 n n e e 2 e e 3 j = 1, i = 8, P[1] ≠ T[8], 则: j = Next[j], 比较 P[1]和 T[9] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  35. Search Pattern Search Pattern 0 n n e 1 e 1 e e 1 n n e e 2 e e 3 j = 1, i = 9, P[1] ≠ T[9], 则: j = Next[j], 比较 P[1]和 T[10] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  36. Search Pattern Search Pattern 0 n n e 1 e 1 e e 1 n n e e 2 e e 3 j = 1, i = 10, P[1] = T[10], 则: j = 2, i = 11, 比较 P[2]和 T[11] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  37. Search Pattern Search Pattern n n 0 1 e e e 1 e n 1 n e 2 e e e 3 j = 2, i = 11, P[2] = T[11], 则: j = 3, i = 12, 比较 P[3]和 T[12] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  38. Search Pattern Search Pattern n n 0 1 e e 1 e e 1 n n e 2 e e e 3 j = 3, i = 12, P[3] = T[12], 则: j = 4, i = 13, 比较 P[4]和 T[13] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  39. Search Pattern Search Pattern n n 0 1 e e 1 e e 1 n n e 2 e e e 3 j = 4, i = 13, P[4] = T[13], 则: j = 5, i = 14, 比较 P[5]和 T[14] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  40. Search Pattern Search Pattern n n 0 1 e e 1 e e 1 n n e 2 e e e 3 j = 5, i = 14, P[5] = T[14], 则: j = 6, i = 15, 比较 P[6]和 T[15] Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 ? KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  41. Search Pattern Search Pattern n n 0 1 e e 1 e e 1 n n e 2 e e e 3 j = 6, i = 15, P[6] = T[15], 则: j = m, 找到匹配,返回 10 Search Text n n e e n e l d e n e e n e e d l e n l d KMP算法实例 算法结束! KMP(T, P) • j ← 1; • For i ← 1 to n do • while j > 0 and T[i] ≠ P[j] do • j ← Next[j] ; • if j = m then // 找到一个成功匹配 • return (i-m+1) ; • j ← j +1 ; • return (none)

  42. Summary of KMP • KMP summary. • Build FSA from pattern. • Run FSA on text. • O(M + N) worst case string search. • Good efficiency for patterns and texts with much repetition. • binary files • graphics formats • Less useful for text strings. • On-line algorithm. • virus scanning • Internet spying Department of Computer Science & Technology

  43. KMP算法的历史 • History of KMP. • Inspired by theorem of Cook that says O(M + N) algorithm should be possible. • Discovered in 1976 independently by two groups. • Knuth-Pratt. • Morris was hacker trying to build an editor. • annoying problem that you needed a buffer when performing text search • Resolved theoretical and practical problems. • Surprise when it was discovered. • In hindsight, seems like right algorithm. Department of Computer Science & Technology

  44. Homework 32.4 • Page 593 : 32.4-1, 32.4-2, 32.4-5.

  45. 8.3 Shift-Or 算法 • uses bitwise techniques; • efficient if the pattern length is no longer than the memory-word size of the machine; • preprocessing phase in O(m +σ) time and space complexity; • searching phase in O(n) time complexity (independent from the alphabet size σ and the pattern length); • adapts easily to approximate string matching. Department of Computer Science & Technology

  46. Shift-Or算法思想 • Let R be a bit array of size m. • Vector Rj is the value of the array R after text character T[j] has been processed (见下页的图). • It contains informations about all matches of prefixes of P that end at position j in the text for 1 < i≤m: 注意:有的书上算法中所用数组下标从0开始,不是从1开始 ! Department of Computer Science & Technology

  47. Shift-Or算法思想(续) j T P[1] i = 1 1 P[1..2] i= 2 0 P[1..3] i= 3 1 P[1..m]] i= m 0 Rj Department of Computer Science & Technology

  48. Shift-Or算法思想(续) • The vector Rj+1 can be computed after Rjas follows. • For each Rj[i]=0: and • If Rj+1[m]=0 then a complete match can be reported. • The transition from Rj to Rj+1 can be computed in two steps. Department of Computer Science & Technology

  49. Shift-Or算法思想(续) • Step 1: For each c in ∑, let Sc be a bit array of size m such that: for 1≤i < m, Sc[i]=0 iff P[i]=c. 例如: ∑={a,b,c,d} be the alphabet, and ababcthe pattern 则: Sa[5..1] = (11010)2,Sb[5..1]= (10101)2, Sc[5..1]= (01111)2,Sd[5..1]= (11111)2. • The array Scdenotes the positions of the character c in the pattern P. Each Sc can be preprocessed before the search. Department of Computer Science & Technology

  50. Shift-Or算法思想(续) • Step 2: The computation of Rj+1 reduces to two operations, shift and or: Rj+1= SHIFT( Rj) ORST[j+1] • Assuming that the pattern length is no longer than the memory-word size of the machine, the space and time complexity of the preprocessing phase is O(m+σ). • The time complexity of the searching phase is O(n), thus independent from the alphabet size and the pattern length. Department of Computer Science & Technology

More Related