220 likes | 448 Views
Approximate string matching using factor automata Jan Holub and Borivoj Melichar Theoretical Computer Science vol.249 p.305-311. Speaker: L. C. Chen Advisor: R. C. T. Lee. Problem.
E N D
Approximate string matching using factor automataJan Holub and Borivoj MelicharTheoretical Computer Science vol.249 p.305-311 Speaker: L. C. Chen Advisor: R. C. T. Lee
Problem • DL(P, X) between strings P and X is the minimum number of edit operations (substitution, insertion and deletion) needed to convert string P to X. • Given a text T, a pattern P, and an integer k, k≦m≦n, approximate string matching can be defined as determining whether string X occurs in text T such that edit distance DL(P, X) between pattern P and string X is less than or equal to k.
An example of Edit Distance To convert P into T: P = abcde T = bcfeg P = abcde T = bcfeg Delete a Substitute d with f g Insert f P2 = bcfe P1 = bcde
Basic definition • Fac(T): a set contains all the substrings of text T. • A nondeterministic finite automaton (NFA) is a five-tuple M=(Q, Σ, δ, q0 , F), where Q is a finite set of states, Σ is a finite input alphabet, δ is a mapping from Q×(Σ∪ {ε}) into the set of subsets of Q, q0 Qis an initial state, and F Q is a set of final states. • M(Fac(T)): a factor automaton accepts Fac(T).
Factor automaton Factor automation M(Fac(T)): a deterministic finite automaton (DFA) accepts all substrings of the given text T. T=aabbabd Fac(T)={a,b,d,aa,ab,bb,ba,bd,aab,abb,bba,bab,abd,aabb,abba,bbab,babd aabba,abbab,bbabd,aabbab,abbabd,aabbabd}
A suffix tree can also be used to recognize all substrings ofT=aabbabd, Fac(T)={a,b,d,aa,ab,bb,ba,bd,aab,abb,bba,bab,abd,aabb,abba,bbab,babd aabba,abbab,bbabd,aabbab,abbabd,aabbabd}
One matched, 0 error. Three matched, 0 error. One matched, one error. P = bab, k=1. The finite automaton M(Lk(P)) accepts Lk(P). Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.
Recognize ab P = bab, k=1. The finite automaton M(Lk(P)) accepts Lk(P). Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.
Recognize aab P = bab, k=1. The finite automaton M(Lk(P)) accepts Lk(P). Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.
Recognize bbab P = bab, k=1. The finite automaton M(Lk(P)) accepts Lk(P). Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.
Definition • Let An automaton for intersection of M1 and M2 is an automaton
T=aabbabd P = bab, k=1 Intersectionof M(Lk(P)) and M(Fac(T)). Solutions : {ba, bab, bb, bbab, aab, ab} (All end with {3,0} or {3,1}.)
T=aabbabd P = bab, k=1 Intersectionof M(Lk(P)) and M(Fac(T)).
Intersection T DL(P,ba)=1 P=bab
Intersection T DL(P,bab)=0 P=bab
Intersection T DL(P,bb)=1 P P=bab
Intersection T DL(P,bbab)=1 P=bab
Intersection T DL(P,aab)=1 P=bab
Intersection T DL(P,ab)=1 P=bab
Lemma • The number of automaton is always lower than .
T=aabbabdP = bab, k=1. The finite automaton M(Lk(P)) accepts Lk(P). Lk(P)={ab, bb, ba, aab, bab, dab, bbb, bdb baa, bad, bbab, bdab, baab, badb}.