1 / 110

String Algorithms

String Algorithms. David Kauchak cs302 Spring 2012. Strings. Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …, z) A string is any member of Σ *, i.e. any sequence of 0 or more members of Σ ‘ this is a string ’  Σ * ‘ this is also a string ’  Σ * ‘ 1234 ’  Σ *. String operations.

kirima
Download Presentation

String Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. String Algorithms David Kauchak cs302 Spring 2012

  2. Strings • Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …, z) • A string is any member of Σ*, i.e. any sequence of 0 or more members of Σ • ‘this is a string’ Σ* • ‘this is also a string’ Σ* • ‘1234’ Σ*

  3. String operations • Given strings s1 of length n and s2 of length m • Equality: is s1 = s2? (case sensitive or insensitive) • Running time • O(n) where n is length of shortest string ‘this is a string’ = ‘this is a string’ ‘this is a string’≠‘this is another string’ ‘this is a string’ =? ‘THIS IS A STRING’

  4. String operations • Concatenate (append): create string s1s2 • Running time (assuming we generate a new string) • Θ(n+m) ‘this is a’ . ‘string’→‘this is a string’

  5. String operations • Substitute: Exchange all occurrences of a particular character with another character • Running time • Θ(n) Substitute(‘this is astring’, ‘i’, ‘x’) →‘thxs xs a strxng’ Substitute(‘banana’, ‘a’, ‘o’) →‘bonono’

  6. String operations • Length: return the number of characters/symbols in the string • Running time • O(1) or Θ(n) depending on implementation Length(‘this is astring’) → 16 Length(‘this is another string’) → 24

  7. String operations • Prefix: Get the first j characters in the string • Running time • Θ(j) • Suffix: Get the last j characters in the string • Running time • Θ(j) Prefix(‘this is astring’, 4) → ‘this’ Suffix(‘this is astring’, 6) → ‘string’

  8. String operations • Substring – Get the characters between i and j inclusive • Running time • Θ(j – i + 1) • Prefix: Prefix(S, i) = Substring(S, 1, i) • Suffix: Suffix(S, i) = Substring(S, i+1, length(n)) Substring(‘this is astring’, 4, 8) → ‘s is ’

  9. Edit distance (aka Levenshtein distance) Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Insertion: ABACED ABACCED DABACCED Insert ‘C’ Insert ‘D’

  10. Edit distance (aka Levenshtein distance) Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Deletion: ABACED

  11. Edit distance (aka Levenshtein distance) Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Deletion: ABACED BACED Delete ‘A’

  12. Edit distance (aka Levenshtein distance) Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Deletion: ABACED BACED BACE Delete ‘A’ Delete ‘D’

  13. Edit distance (aka Levenshtein distance) Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s1 into string s2 Substitution: ABACED ABADED ABADES Sub ‘D’ for ‘C’ Sub ‘S’ for ‘D’

  14. Edit distance examples Edit(Kitten, Mitten) = 1 Operations: Sub ‘M’ for ‘K’ Mitten

  15. Edit distance examples Edit(Happy, Hilly) = 3 Operations: Sub ‘a’ for ‘i’ Hippy Sub ‘l’ for ‘p’ Hilpy Sub ‘l’ for ‘p’ Hilly

  16. Edit distance examples Edit(Banana, Car) = 5 Operations: Delete ‘B’ anana Delete ‘a’ nana Delete ‘n’ naa Sub ‘C’ for ‘n’ Caa Sub ‘a’ for ‘r’ Car

  17. Edit distance examples Edit(Simple, Apple) = 3 Operations: Delete ‘S’ imple Sub ‘A’ for ‘i’ Ample Sub ‘m’ for ‘p’ Apple

  18. Edit distance Why might this be useful?

  19. Is edit distance symmetric? • that is, is Edit(s1, s2) = Edit(s2, s1)? • Why? • sub ‘i’ for ‘j’→ sub ‘j’ for ‘i’ • delete ‘i’ → insert ‘i’ • insert ‘i’ → delete ‘i’ Edit(Simple, Apple) =? Edit(Apple, Simple)

  20. Calculating edit distance X = A B C B D A B Y = B D C A B A Ideas?

  21. Calculating edit distance X = A B C B D A ? Y = B D C A B ? After all of the operations, X needs to equal Y

  22. Calculating edit distance X = A B C B D A ? Y = B D C A B ? Operations: Insert Delete Substitute

  23. Insert X = A B C B D A ? Y = B D C A B ?

  24. Insert X = A B C B D A ? Edit Y = B D C A B ?

  25. Delete X = A B C B D A ? Y = B D C A B ?

  26. Delete X = A B C B D A ? Edit Y = B D C A B ?

  27. Substition X = A B C B D A ? Y = B D C A B ?

  28. Substition X = A B C B D A ? Edit Y = B D C A B ?

  29. Anything else? X = A B C B D A ? Y = B D C A B ?

  30. Equal X = A B C B D A ? Y = B D C A B ?

  31. Equal X = A B C B D A ? Edit Y = B D C A B ?

  32. Combining results Insert: Delete: Substitute: Equal:

  33. Combining results

  34. Running time Θ(nm)

  35. Variants • Only include insertions and deletions • What does this do to substitutions? • Include swaps, i.e. swapping two adjacent characters counts as one edit • Weight insertion, deletion and substitution differently • Weight specific character insertion, deletion and substitutions differently • Length normalize the edit distance

  36. String matching Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S P = ABA S = DCABABBABABA

  37. String matching Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S P = ABA S = DCABABBABABA

  38. Uses • grep/egrep • search • find • java.lang.String.contains()

  39. Naive implementation

  40. Is it correct?

  41. Running time? • What is the cost of the equality check? • Best case: O(1) • Worst case: O(m)

  42. Running time? • Best case • Θ(n) – when the first character of the pattern does not occur in the string • Worst case • O((n-m+1)m)

  43. Worst case P = AAAA S = AAAAAAAAAAAAA

  44. Worst case P = AAAA S = AAAAAAAAAAAAA

  45. Worst case P = AAAA S = AAAAAAAAAAAAA

  46. Worst case P = AAAA S = AAAAAAAAAAAAA repeated work!

  47. Worst case P = AAAA S = AAAAAAAAAAAAA Ideally, after the first match, we’d know to just check the next character to see if it is an ‘A’

  48. Patterns Which of these patterns will have that problem? P = ABAB P = ABDC P = BAA P = ABBCDDCAABB

  49. Patterns Which of these patterns will have that problem? P = ABAB If the pattern has a suffix that is also a prefix then we will have this problem P = ABDC P = BAA P = ABBCDDCAABB

  50. q0 q1 q2 qn Finite State Automata (FSA) • An FSA is defined by 5 components • Q is the set of states …

More Related