690 likes | 811 Views
Text Indexing and Dictionary Matching with One Error. Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325. Adviser: R. C. T. Lee Speaker: C. W. Cheng. Problem Definition. The Indexing Problem :
E N D
Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325 Adviser: R. C. T. Lee Speaker: C. W. Cheng
Problem Definition • The Indexing Problem: • Input:A Text T of length n over alphabet Σ, a pattern P of length m over alphabet Σ and an integer k. • Output: All occurrences of P in T with at most k mismatches.
Main idea • In this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2…m. Then we find the prefix P1,j-1 in prefix tree and the suffix Pj+1,min suffix tree. If both of them exist, an approximation string matching with one error occurs.
Processing • 1.Construct a suffix tree ST of the text string T and suffix tree STR of the string TR is the reversed text TR = tn … t1.
Ex: T=AGCAGAT TR=TAGACGA
Ex: T=AGCAGAT TR=TAGACGA
Processing • 2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.
Ex: T=AGCAGAT TR=TAGACGA
Processing • 3. For each of the suffix trees, set pointers from each tree node v to its left most leaf vl and rightmost leave vr in the linked list.
Ex: T=AGCAGAT TR=TAGACGA
Processing • 4. Designate each leaf in ST by the starting location of its suffix. Designate each leaf in STR by n – i + 3, where i is the starting position of the leaf’s suffix in TR.
Ex: T=AGCAGAT TR=TAGACGA
Query Processing • For j = 1, …., m do • 1. Find node v, the location of Pj+1 … Pm in ST, if such a node exists. • 2. Find node w, the location of Pj-1 .. P1 in STR, if such a node exist. • 3. If v and w exist, the values of leaves under v and w are V[vl….vr] and W[wl…wr], to find the intersections I of V[vl….vr] and W[wl…wr]. If the intersections exist, the approximate string matching occurs on Ti-3…Ti-3+m, for all iI.
Example Ex: T=actgacctcagctta P=ctga k=1
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=1 v=Pj+1…Pm=taa w=Pj-1…P1=ε V[vl….vr]={ε}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=1 v=Pj+1…Pm=taa w=Pj-1…P1=ε V[vl….vr]={ε} W[vl….vr]={3,12,…,14} I={ε}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=2 v=Pj+1…Pm=aa w=Pj-1…P1=c V[vl….vr]={ε}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=2 v=Pj+1…Pm=aa w=Pj-1…P1=c V[vl….vr]={ε} W[vl….vr]={4,8,9,14,11} I={ε}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=3 v=Pj+1…Pm=a w=Pj-1…P1=tc V[vl….vr]={15,5,1,10}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=3 v=Pj+1…Pm=a w=Pj-1…P1=tc V[vl….vr]={15,5,1,10} W[vl….vr]={5,10,15} I={15,5,10}
When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on Ti-j…Ti-j+m, for all iI. T2…T6, T7…T11, T12…T15。 T=actgacctcagctta P=ctaa
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of T j=4 v=Pj+1…Pm=ε w=Pj-1…P1=atc V[vl….vr]={15,5,…,13}
Ex: T=actgacctcagctta TR=attcgactccagtca P=ctaa Suffix Tree of TR j=3 v=Pj+1…Pm=ε w=Pj-1…P1=atc V[vl….vr]={15,5,…,13} W[vl….vr]={ε} I={ε}
Range Query Problem • In step 3, given nodes v and w, we want to find the leaves that appear both in interval [vl … vr] and in the interval [wl … wr], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.
Problem Definition of Range Query • Input: Let V=[v1,v2 … vn] and W=[w1,w2 … wn] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n. • Output: Find the intersection of elements of V[i … i+k] and W[j … j+l].
Example: V=[8,5,1,4,3,7,6,2] W=[3,6,4,7,2,1,5,8] i=3,k=4 j=2,l=5 Output: the intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6]
Preprocessing 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 V= W= 1 2 3 4 5 6 7 8
Preprocessing 1 2 3 4 5 6 7 8 3 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 7 V= W= 4 1 2 3 4 5 6 7 8 6 2 1 5 8
Preprocessing 1 2 3 4 5 6 7 8 3 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 6 V= W= 4 1 2 3 4 5 6 7 8 7 2 The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}. 1 5 8
Time Complexity of Range Query Problem • By using Overmars’ algorithm, the range query problem can be solved with preprocessing time and , where k is the number of points in the range. [O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.
Time Complexity • For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the pattern in the text with one error.
Problem Definition • The Dictionary Matching Problem • Input: • 1. A dictionary P = {p1,…., ps}, where pi, i = 1,…., s, are patterns over alphabet Σ, and is the sum of the lengths of all the dictionary patterns. • 2. A Text T of length n over alphabet Σ. • 3. An integer k. • Output: • All occurrences of any dictionary patterns in T with at most k mismatches.
Main idea • In this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2…n. Then we find the prefix T1,j-1 in prefix tree and the suffix Tj+1,min suffix tree. If both of them exist, an approximation string matching with one error occurs.
Processing • 1. Construct a suffix tree SD of string D and suffix tree SDR of the string DR, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where DR is the reversal of string D.
Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Processing • 2. Modify suffix tree SD, and SDR respectively, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled σ$σ”, where σ≠ε, break (u,v) into (u,w) and (w,v). Label (u,v) with σ and (w,v) with $σ’.
Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Preprocessing • 3. Scan suffix tree SD, respectively SDR, and modify as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)$. So, go to edge (v,w) with label beginning with $, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.
Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=TCA$GCTGA$GCA$ DR=ACG$AGTCG$ACT$
Query Processing • For j = 1,…., n do • 1. Find node v, the location of the longest prefix of tj+1 … tn in SD. • 2. Find node w, the location of the longest prefix of tj-1 … t1 in SDR. • 3. Find intersection of markings of nodes on the path from the root to v in SD and on the path from the root to w in SDR.
Example T=acagccga D={tca,gctga,gca} K=1
Suffix Tree of D (SD) Example: P={tca,gctga,gca} D=tca$gctga$gca$ DR=acg$agtcg$act$ T=acagccga
Suffix Tree of DR (SDR) Example: P={tca,gctga,gca} D=tca$gctga$gca$ DR=acg$agtcg$act$ T=acagccga