570 likes | 690 Views
Speeding up on two string matching algorithms. , CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W. Algorithmica, Vol.12 , 1994, pp. 247-267. Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen. Problem Definition.
E N D
Speeding up on two string matching algorithms , CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W. Algorithmica, Vol.12, 1994, pp.247-267 Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen
Problem Definition • Input : A text T and a pattern P. • Output : Find all occurrences of P in T
Rule 1: The Suffix to Prefix Rule • For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern.
Basic Ideas Open a window W with size |P| in the text. W T |P| p • Find the longest suffix of W is also the prefix of pattern. Case 1: W T |P| p Match!
Case 2: W T |P| p W T |P| p Case 3: If there is no such suffix, we move W withlength |P|. W T |P| |P| p
Preprocessing phase • T=GCATCGGCGAGAGTATACAGTACG • P=GCAGAGAG • L(S): a set contains all prefixes of the pattern. We construct the suffix automaton of P. C Suffix Automaton A G C A G G G A 8 7 6 5 4 3 2 1 0 C A C
Preprocessing: Construct a Suffix Tree The reversal string of P. Suffix tree for
Example 1 W P We want to find the longest suffix of W which is equal to a prefix of P. Suffix tree for We find that ACG (a prefix of , a suffix of W) is a suffix of (a prefix of P). Thus ACG is the longest suffix of W which is equal to a prefix of P.
Example 2 Suffix tree for W P We find that GAC is the longest prefix of (thus the longest suffix of W) which is equal to a substring of . But GAC is not a suffix of and GACA is not a suffix of either.
Luckily, a prefix of GACG, namely G, is also a suffix of . G can be found by finding the lowest common ancestor of G and GACG. Thus G is the longest prefix of (suffix of W) which is equal to a suffix of (prefix of P).
Let X be the longest prefix of (suffix of W) which is equal to a substring of , but not a suffix of . Let Y be a prefix of X (a suffix of W) which is equal to a suffix of (prefix of P). Then Y is the longest suffix of W equal to a prefix of P.
Z is a suffix of which can be found in the suffix tree of . Y may not exist. If it exists, it must be in the suffix tree of and must have been found before X is found because Y is a prefix of X.
Preprocessing phase: the worst case of the time complexity is O(m). • Searching phase: the worst case of the time complexity is O(mn). • But it needs time O() in average case where r is the size of the alphabet as shown in this paper.
About the average case analysis of RF algorithm, assume that the text is a random sequence over a size r alphabet and is preserved such that m must be enough large. This assumption is reasonable. Let m=16, r=4.
Theorem. The expected average time of the RF algorithm is O(). Proof. Note that r>1, and . For a pattern with length m, there are no more than m substrings. Thus, there are at most m substrings with length .
Let Li be the length of the shift in the ith attempt of RF algorithm and Let Xi and Yi be the X and the Y in ith attempt respectively. Let Si be the length of the longest prefix of which appear in in the ith attempt. That is, Si=|Xi|. Let Ai=|Yi| such that because Yi is a prefix of Xi. 16
Let us call the ith shift long if and only if and short otherwise. (It implies that Li is long if .)
When at least new symbols are being read at the current attempt, with probability there are at most characters of the suffix of the window can match a substring of P, which causes a long shift. 19
We divide all attempts into phases. Each phase ends on the first long shift. In other words, there is exactly one long shift in each phase.
There are two main ideas in the paper: • The number of all phases is . • We calculate the expected number of comparison of each phase. An expected number of comparison of each phase is . • We shall discuss above two ideas in the next slides. 21
The number of all phases is . We know that the length of long shift is Then The number of all phases is
Next, we calculate the expected number of comprison of each phase. Claim 1: Assume that Li and Li+1 are both short. Then . That is , Li+2 is the end of a phase. Proof. Suppose Li and Li+1 < , then the pattern is of the form where , w, .
Note that Yi denotes a longest suffix of the window Wi which is equal to a prefix of the pattern, where Wi is a window of the text of length m in the ithattempt. Let Bi be the set of new symbols to read in the ith attempt. Note that the pattern is of the form . Then , , .
Let Bi+1 be because there exists an overlap between Yi and Yi+1, and
Example: T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a,z=dd. , Then , When P shifts Li+1, the overlap of Yi and Yi+1 is
Without loss of generality we can assume that is a minimal period of . . If there exists a word such that , then because is a minimal period of . Hence,
Example: P=abcabcabcabcabcabcabbc, w=cabc,v=ab, s=b,z=c. w1v1 is a minimal period of P.
We can also assume (eventually changing wv and k) that and sz do not have a common prefix. We may therefore obtain a new fragment s1z1 such that
A suffix of the read part of the text is of the form , and we have at least C=min(Li+1, Li) new symbols to read in the (i+2)th attempt. Let e be a random word of length C to be read part of the text such that .
Note that If |Bi|>|Bi+1|, then , otherwise, , . 31
We give an example when |Bi|>|Bi+1|. T=bbbaaaaaaaacda P=aaaaaaaabc, w=a,v=a, s=a, z=bc. 32
We give another example when T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a, z=dd. 33
It is easy to see that if w1v1s1e is a substring of , then y must be either equal to pref(z1) if , or otherwise.
In other words, by the above condition, if , w1v1s1e would only appear to the end of P. Therefore, e=pref(z1). otherwise, w1v1s1e may appear to any position of P. Therefore,
Note that The probability that reading e new symbols leads to a long (longer than Li+Li+1 which is less than ) substring of the pattern is no greater than .
By Claim 1, the assumptions say that when the (k-1)th and (k-2)th shifts are both short, the kth shift is long with probability . It implies that the kth shift of the phase is short with probability for
Let F be the random variable which is the number of short shifts in the phase. What can we say about the probability distribution of F?
By claim 1, we know when (k-2)th and (k-1)th are both short, .
Let G be the random variable which is the number of comparison of the phase and let L be the number of comparison of a long shift of the phase. Then The problem is on how to find L.
For the number of comparison of a long shift of the phase, we know and . Note that Si is the length of the substring of the pattern that is matched in Wi. Hence,
For the expected number of comparison of each phase, we have
According to above discussion, we know that there are phases in the algorithm and an expected number of comparison of each phase is . Therefore, the expected time of the RF algorithm is .
In this paper, they use X to analyze the average case of RF algorithm finally note that X is the longest suffix of W which is equal to a substring of P . In fact, the main idea of RF algorithm is to find out Y, but not X. Therefore, we may re-analyze the expected length of Yi. Note that the Li=shift is equal to Li=m-|Yi|=m-Ai. If Ai is small, Li is large. We expect Ai to be very small.
Given a window Wi of T in the ith attempt and a pattern P, the expected length of the longest suffix of Wi equal to a prefix of P is …..(1) …..(2)
We randomly generate some texts and patterns using Knuth’s random generating function in the first experiment.