280 likes | 401 Views
Martin Kay Stanford University. String Search 1. Naive Search (1). naive_search(Pattern, Text, 1) :- append(Pattern, _, Text). naive_search(Pattern, [_ | Text], N) :- naive_search(Pattern, Text, N0), N is N0+1. naive_search("is", "mississippi", N). N = 2 ? ; N = 5 ? ; no | ?-.
E N D
Martin Kay Stanford University String Search 1
Naive Search (1) naive_search(Pattern, Text, 1) :- append(Pattern, _, Text). naive_search(Pattern, [_ | Text], N) :- naive_search(Pattern, Text, N0), N is N0+1. naive_search("is", "mississippi", N). N = 2 ? ; N = 5 ? ; no | ?-
pref — A Prefix Predicate pref(P, T) :- assert(stat(T, P)), fail. Make an entry in the data base every time the predicate is called. pref([], _). pref([H | P], [H | T]) :- pref(P, T).
Search using pref naive_search1(Pattern, Text, 1) :- pref(Pattern, Text). naive_search1(Pattern, [_ | Text], N) :- naive_search1(Pattern, Text, N0), N is N0+1. | ?- naive_search1([i,s], [m,i,s,s,i,s,s,i,p,p,i], N). N = 2 ? ; N = 5 ? ; no | ?-
| ?- listing(stat). stat([m,i,s,s,i,s,s,i,p,p,i], [i,s]). stat([i,s,s,i,s,s,i,p,p,i], [i,s]). stat([s,s,i,s,s,i,p,p,i], [s]). stat([s,i,s,s,i,p,p,i], []). stat([s,s,i,s,s,i,p,p,i], [i,s]). stat([s,i,s,s,i,p,p,i], [i,s]). stat([i,s,s,i,p,p,i], [i,s]). stat([s,s,i,p,p,i], [s]). stat([s,i,p,p,i], []). stat([s,s,i,p,p,i], [i,s]). stat([s,i,p,p,i], [i,s]). stat([i,p,p,i], [i,s]). stat([p,p,i], [s]). stat([p,p,i], [i,s]). stat([p,i], [i,s]). stat([i], [i,s]). stat([], [s]). stat([], [i,s]). 11 Allignments The Statistics 18 Entries
or maybe even here Mismatch No “m” here So move to here! Observe-- If the pattern “mississippi” matched part of the way, we can move over all the the characters matched because none of them can be an “m”, which is what we need to start a new match. Text: Pattern: m i s s i o n a r y . . . . m i s s i s s i p p i
Mismatch p e r p e t r a t e So try this This is a prefix of the pattern Observe further -- p e r p e n d i c u l a r . . . p e r p e t r a t e Text: Pattern:
p e r p e t r a t e So move to here Mismatch Observe yet further -- p e r p e t u a l . . . . . p e r p e t r a t e Text: Pattern: No (shorter) prefix of the pattern ends here
Overlaps Search for a b a c a b a d a b a c a b a in the text a b a b a c a b a d a b a c a b a d a b a c a b a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a
Déja vu Search for a b a c a b a d a b a c a b a in the text a b a b a c a b a d a b a c a b a d a b a c a b a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a
c c c On-line search We have seen this much of the text so far: c a c a We are looking for the pattern cacao. We have some number (0 or more) searches in progress and are waiting for the next character to see which ones continue and maybe to start a new one. c a c a c a
0 a [0] 1 b [0, 1] 2 a [0, 2] 3 b [0, 1, 3] 4 a [0, 2] 5 c [0, 1, 3] 6 a [0, 4] 7 b [0, 1, 5] 8 a [0, 2, 6] 9 d [0, 1, 3, 7] 10 a [0, 8] 11 b [0, 1, 9] 12 a [0, 2, 10] 13 c [0, 1, 3, 11] 14 a [0, 4, 12] 15 b [0, 1, 5, 13] 16 a [0, 2, 6, 14] result 2 17 d [0, 1, 3, 7] 18 a [0, 8] 19 b [0, 1, 9] 20 a [0, 2, 10] 21 c [0, 1, 3, 11] 22 a [0, 4, 12] 23 b [0, 1, 5, 13] 24 a [0, 2, 6, 14] result 10 25 b [0, 1, 3, 7] 26 a [0, 2] Search for The rightmost pointer always moves. Others pointers move if they can do so over the same character A new ‘0’ is introduced on the left a b a c a b a d a b a c a b a in the text a b a b a c a b a d a b a c a b a d a b a c a b a b a A pointer in a given position always has pointers in the same set of positions to its left These are properties of the pattern only. Therefore they can be cached or precompiled.
0 a [0] 1 b [0, 1] 2 a [0, 2] 3 b [0, 1, 3] 4 a [0, 2] 5 c [0, 1, 3] 6 a [0, 4] 7 b [0, 1, 5] 8 a [0, 2, 6] 9 d [0, 1, 3, 7] 10 a [0, 8] 11 b [0, 1, 9] 12 a [0, 2, 10] 13 c [0, 1, 3, 11] 14 a [0, 4, 12] 15 b [0, 1, 5, 13] 16 a [0, 2, 6, 14] result 2 17 d [0, 1, 3, 7] 18 a [0, 8] 19 b [0, 1, 9] 20 a [0, 2, 10] 21 c [0, 1, 3, 11] 22 a [0, 4, 12] 23 b [0, 1, 5, 13] 24 a [0, 2, 6, 14] result 10 25 b [0, 1, 3, 7] 26 a [0, 2] If this matches ... then so will these Search for a b a c a b a d a b a c a b a a b a b a c a b a d a b a c a b a d a b a c a b a b a
0 a [0] 1 b [0, 1] 2 a [0, 2] 3 b [0, 1, 3] 4 a [0, 2] 5 c [0, 1, 3] 6 a [0, 4] 7 b [0, 1, 5] 8 a [0, 2, 6] 9 d [0, 1, 3, 7] 10 a [0, 8] 11 b [0, 1, 9] 12 a [0, 2, 10] 13 c [0, 1, 3, 11] 14 a [0, 4, 12] 15 b [0, 1, 5, 13] 16 a [0, 2, 6, 14] result 2 17 d [0, 1, 3, 7] 18 a [0, 8] 19 b [0, 1, 9] 20 a [0, 2, 10] 21 c [0, 1, 3, 11] 22 a [0, 4, 12] 23 b [0, 1, 5, 13] 24 a [0, 2, 6, 14] result 10 25 b [0, 1, 3, 7] 26 a [0, 2] Search for a b a c a b a d a b a c a b a a b a b a c a b a d a b a c a b a d a b a c a b a b a So try these only if this fails!
a [0] b [0, 1] a [0, 2] b [0, 1, 3] a [0, 2] c [0, 1, 3] a [0, 4] b [0, 1, 5] a [0, 2, 6] d [0, 1, 3, 7] a [0, 8] b [0, 1, 9] a [0, 2, 10] c [0, 1, 3, 11] a [0, 4, 12] The failure function 0 1 2 3 4 5 6 7 8 9 10 11 12 ... a b a c a b a d a b a c a ... 0 0 1 0 1 2 3 0 1 2 3 4 ...
a [0] b [0, 1] a [0, 2] b [0, 1, 3] a [0, 2] c [0, 1, 3] a [0, 4] b [0, 1, 5] a [0, 2, 6] d [0, 1, 3, 7] a [0, 8] b [0, 1, 9] a [0, 2, 10] c [0, 1, 3, 11] a [0, 4, 12] 0 1 2 3 4 5 6 7 8 9 10 11 12 ... a b a c a b a d a b a c a ... 0 0 1 0 1 2 3 0 1 2 3 4 ...
The Failure Function -1 0 0 0 1 2 3 4 5 a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c
The Failure Function -1 0 0 1 0 1 2 3 0 1 2 3 4 5 6 a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a
The Failure Function -1 0 0 1 0 1 2 3 0 1 2 3 4 5 6 a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a a b a c a b a d a b a c a b a
Substring, Prefix, Suffix • Part of a string S (even if it covers the whole of S) is a substring of S. • If it includes the first (last) character of S, it is a prefix (suffix) of S. • If it does not cover the whole of S, it is a proper substring (prefix, suffix) of S. Example: S = ababac Some substrings: ababac, ab, b, bab, ac, only ababac is not proper Some prefixes: ababac, a, aba, only ababac is not proper Some suffixes: ababac, abac, c, only ababac is not proper is the empty string
Borders • If B is a proper prefix and a proper suffix of a string S, it is a border of S. • Note is a border of every string Examples: abcabcabc has borders abc, abcabc, abacabadabacaba has borders abacaba, aba, a,
-1 0 0 0 1 2 3 4 5 a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c Borders
border in Prolog border(Pattern, Boarder) :- append([_ | _], Border, Pattern), append(Border, _, Pattern).
a b a c a b a d a b a b a c a b a d a b a b a c a b a d a b -1 0 0 1 0 1 2 3 0 1 Borders in Linear-time border(I, Pattern, Q) :- J is I-1, border(J, Pattern, P), nth0(J, Pattern, C), extend(C, P, Pattern, Q). extend(_, -1, _, 0). extend(C, P, Pattern, Q) :- nth0(P, Pattern, C), !, Q is P+1. extend(C, P0, Pattern, R) :- border(P0, Pattern, Q), extend(C, Q, Pattern, R). Borders at position i+1 extend borders at position i
border(I, Pattern, Q) :- J is I-1, border(J, Pattern, P), nth0(J, Pattern, C), extend(C, P, Pattern, Q). extend(_, -1, _, 0). extend(C, P, Pattern, Q) :- nth0(P, Pattern, C), !, Q is P+1. extend(C, P0, Pattern, R) :- border(P0, Patttern, Q), extend(C, Q, Pattern, R). make_table(Pattern) :- retractall(border_table(_, _)), assert(border_table(0, 0)), assert(border_table(1, 0)), length(Pattern, PL), make_table(Pattern, 2, PL). make_table(_, I, N) :- I>N, !. make_table(Pattern, I, N) :- border(I, Pattern, K), assert(border_table(I, K)), J is I+1, make_table(Pattern, J, N). Building A Table
border(I, Pattern, Q) :- J is I-1, border_table(J, P), nth0(J, Pattern, C), extend(C, P, Pattern, Q). extend(_, -1, _, 0). extend(C, P, Pattern, Q) :- nth0(P, Pattern, C), !, Q is P+1. extend(C, P0, Pattern, R) :- border_table(P0, Q), extend(C, Q, Pattern, R). make_table(Pattern) :- retractall(border_table(_, _)), assert(border_table(0, 0)), assert(border_table(1, 0)), length(Pattern, PL), make_table(Pattern, 2, PL). make_table(_, I, N) :- I>N, !. make_table(Pattern, I, N) :- border(I, Pattern, K), assert(border_table(I, K)), J is I+1, make_table(Pattern, J, N). Building A Table
Searching search(Pattern, Text, N) :- make_table(Pattern), retract(border_table(0, _)), assert(border_table(0, 0)), length(Pattern, PL), search(Pattern, PL, Text, N). search(Pattern, PL, Text, N) :- common_prefix(Pattern, Text, CPL), search(CPL, Pattern, PL, Text, N). search(CPL, _, CPL, _, 0). search(CPL, Pattern, PL, Text0, N) :- border_table(CPL, BL), M is CPL-BL, advance(Text0, M, Text), search(Pattern, PL, Text, N0), N is N0+M. Build the table Do the search
Reference Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing , 6(2):323-350, June 1977.