Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

Finding approximate palindromes in stringsPattern Recognition, vol.35, pp. 2581-2591, 2002 Alexandre H. L Porto and Valmir C. Barbosa Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

Definition • S: a string of n characters. • S[i]: the ith character in S. S[i..j]: the substring of S whose first and last characters are S[i] and S[j]. • SR: the reverse of S. S: abcab SR:bacba

Definition • A even(odd) palindrome is a string which is of the form of SRS(SRaS). Thus abaccaba is a palindrome because abac is the reverse of caba. • S[c]: the center of palindrome S[i…j] in S, where . S S[2…7]=baccab is an even palindrome and S[c]=4

Edit distance • In edit distance, there are three types of differences between two strings X and Y: • Insertion: a symbol of Y is missing in X at a corresponding position. • Substitution: symbols at corresponding positions are distinct. • Deletion: a symbol of X is missing in Y at a corresponding position. X :A－T Y :A GT • X : AC C • Y : TC C X: GCA Y: G－A

denotes the edit distance between two strings A and B as the minimum number of substitutions, insertions and deletions of characters in B to transform to A. A=abcab-a B=cb–abbc Insertion:1, Substitution:2 and Deletion:1.

Approximate palindromes • An approximate palindrome with error up to k : a string of the form of SRS(SRaS) such that ED(S,SR) ≦k. • An approximate palindrome is maximal if no other approximate palindrome for the same c and k exists having strictly greater size or the same size but strictly fewer errors.

To simplify our discussion, we only discuss even approximate palindromes here. • S: aabaabcd and k=1. S At c=3, abaa and aabaa are even approximate palindromes, Substitute b with a Delete b and aabaa is a maximal approximate palindrome.

Problem • Given a string T of size n, we want to find all maximal approximate palindromes in T with up to k errors. • For each c, we find the largest i’ and j’ in T[c+1…n]and TR[1…c] respectively such that ED(T[c+1…i’]), TR[1…j’]) ≦k.

Let S2=TR[1…c] and S1=T[c+1…n], where 1≦c≦n. • In the dynamic programming approach, we construct a matrix Dn’+1,m’+1 when Di,jis the minimum edit distance between S1[1,i] and S2[1,j], where the length of S1and S2are n’ and m’ respectively.

T: dbcaabac, and k=2. • At c=3, S2=TR[1…3]=cbdand S1=T[4…7]=aabac. ↖: substitution or a matching ↑: deletion ←: insertion We can find that the maximal approximate palindrome is bcaab.

How can we compute the table faster? • In this paper, the method in [LV89]( L.Y. Huang) was used.

i 1 2 3 a b c 0 1 2 3 j 1 2 b 1 1 1 2 c 2 2 2 1 • We shall heavily use the concept of diagonal. • Diagonald is defined as all of the Di,j’s where d = i – j. • The diagonal property: Di,j-Di-1,j-1=0 or 1. It means that on the diagonal, the values are monotonically increasing. [U85] Diagonal 2 Diagonal 0

Consider diagonald=0. Let us find the largest j, if it exists, such that (i,j) is on Diagonald (i - j = d) and Di,j = 0. Let us now label all of these locations. i 1 2 3 4 5 6 7 g g g t c t a 0 1 2 3 4 5 6 7 j 1 2 3 4 g 1 0 t 2 t 3 c 4 S1=gggtcta S2=gttc Diagonal 0

Having found the above locations (i, j) where Di,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonald and Di,j = 1. • To do this, we use the following observation: Each element in Diagonald can only influence elements in Diagonalsd-1, d and d+1.

Let us consider any (i, j) location on Diagonald. Di,j can only be influenced as shown below: Thus, we conclude that we only need to consider Diagonalsd-1, d and d+1 for each Di,j. Di, j-1 Di-1, j-1 substitution delete d+1 Di, j Di-1, j insert d d-1

Observe the following two strings: If i and j are the largest i and j such that ED(T1[1…i],T2[1…j]) = k and T1[i+1]≠ T2[j+1], then ED(A1+x, A2+y) = k+1. T1 1 i T2 1 j

d T1 ab c 1 i T2 cbd e • Consider T1=abcd and T2=cdde. ED(T1[1…i],T2[1…j])=2. The largest such i and j are 2 and 3 respectively, and T1[i+1]≠ T2[j+1]. Thus the ED(ab+c,cbd+e)=2+1=3. 1 j

Based upon the above discussion, on a diagonal d, we can find the largest i and j such that Di,j=e. • How can we find the largest row containing the value smaller or equal to k ? • We need to let Ld,e denote the largest rowj such that Di,j is on the Diagonald (i- j = d) and Di,j =e≦k.

Let Ld,e denote the largest rowj such that Di,j is on the Diagonald (i- j = d) and Di,j =e≦k. Based upon this definition, e is the edit distance between S1[1…i]and S2[1…j] such that i and j are the suchlargest ones, and S2[ j+1] ≠S1[i+1]. At d =0. L0,0 = 1, L0,1=2, L1,2 =3 and L1,3 =4. S1=gggtcta S2=gttc i 1 2 3 4 5 6 7 j 1 2 3 4 d=0

How can we compute the Ld,e’s value? • We define rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)]. (substitution) (insertion) (deletion) • Ld,e= rowd,e+t, wheret= the length of the longest common prefix of S1[d+rowd,e+1…n’]and S2[rowd,e+1…m’]. If t=0, it means that S1[d+rowd,e+1]≠S2[rowd,e+1].

Consider D3,2. L1,1=1. The largest j on d=1 for Di,j=1 is j=1. In this case, d=1, e=2. Ld,e-1=L1,1=1, Ld-1,e-1=L0,1=2 and Ld+1,e-1=L2,1=0. Thus rowd,e=row1,2=max(L1,1+1,L0,1,L2,1+1)=max(1+1,2,0+1)=max(2,2,1)=2. i 1 2 3 4 5 6 7 j 1 2 3 4 d=2 d=1 d=0

i 1 2 3 4 5 6 7 g g g t c t a 0 1 2 3 4 5 6 7 j 1 2 3 4 g 1 0 t 2 1 t 3 c 4 d = -1 • e =1, d = -1 S1=gggtcta S2=gttc • How to compute L-1,1? • row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L-1,0+1),(L-2,0),(L0,0+1)] = max[0+1, 0, 1+1]= max[1, 0, 2] = 2 Since S1[d+rowd,e+1]= S1[-1+1+2]=g ≠S2[rowd,e+1]=S2[2+1]=t, L-1,1 = row-1,1+0 = 2.

S1=gggtcta S2=gttc i 1 2 3 4 5 6 7 g g g t c t a 0 1 2 3 4 5 6 7 j 1 2 3 4 • How to compute L1,2? • row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[(L1,1+1),(L0,1),(L2,1+1)] = max[1+1, 2, 0+1]= max[2, 2, 1] = 2. Since the length of the longest common prefix of S1[d+row1,2+1…n’]=S1[4…7]=tctaand S2[row1,2+1…m’]= S2[3…4]=tc is 2, L1,2 = row1,2+2 =4. g 1 0 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1

Ld,e=rowd,e+t, where t= the length of the longest common prefix of S1[d+rowd,e+1…n’]and S2[rowd,e+1…m’]. • How can we compute t ? In this paper, LCA (lowest common ancestor ) is used.

Consider two substrings T1 and T2 as shown below: T1 A1 S1 x T2 A2 S2 y If ED(A1, A2) =k and S1=S2, then ED(A1+S1, A2+S2) =k.

When we find the ED(A1, A2) =k, we want to determine whether the longest common prefix S of B1 and B2 exists. B1 S1 S2 B2 This paper will use LCA (lowest common ancestor) to find S.

To find such S, if it exists, we may concatenate S1 and S2 to a new string. Obviously, suffixes S1’ and S2’ have a common prefix S. S1 S2 S2’ S1’

Let us concatenate S1 and S2 to be a new string as follows: Consider D3,2,the substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common prefix with length 2. Thus we have that D3,2=D4,3=D5,4=2. S1=gggtcta S2=gttc i 1 2 3 4 5 6 7 j 1 2 3 4 d = 1

S1=gggtcta S2=gttc Let us concatenate S1 and S2 to be a new string as follows: gggtctagttaa. And then we construct the suffix tree of it. The substring after ggg is tctagttc=S1’. The substring after gt is tc=S2’. Note that S2’ and S1’ have a common ancestor tc of length 2.

Algorithm Initialization for all d, 1≦d ≦k+1, d＞e, Ld,e=-1 . for all d, -(k+1) ≦d ≦-1,Ld,|d|-1= -1, Ld,|d|-2 =|d|-2 . for all e, -1≦e≦k, Ln’+1,e = -1 Find L0,0= the length of longest common prefix of S1and S2 For e = 1 to k do For d = -e to e do rowd,e = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] rowd,e = min(rowd,e,m’) while rowd,e < m’ and rowd,e+d <n’ do find t= the length of longest common prefix of S1[d+rowd,e+1…n’]and S2[rowd,e+1…m’]; rowd,e = rowd,e + t; Ld,e = rowd,e.

Example: T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, S2=TR[1..4]=gttc and S1=T[5…11]=gggtcta. S1 i 1 2 3 4 5 6 7 S2 j 1 2 3 4

g g g t c t a 0 1 2 3 4 5 6 7 j 1 2 3 4 g 1 0 t 2 t 3 c 4 • At d = 0, find the largest j such that S2[1…j] is equal to S1[1..i], then we set the value of L0,0 = j. S1 i 1 2 3 4 5 6 7 S2 d=0 • S2[1] = S1[1], L0,0 =1

i 1 2 3 4 5 6 7 g g g t c t a 0 1 2 3 4 5 6 7 j 1 2 3 4 g 1 0 t 2 1 t 3 c 4 d = -1 • e =1, d = -1 S1 S2 • row-1,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[0,0,2]=2. the length of longest common prefix of ggtctagttc and tc is 0. • L-1,1 = 2

The length of LCA of ggtctagttc and tc is 0.

i 1 2 3 4 5 6 7 g g g t c t a 0 1 2 3 4 5 6 7 j 1 2 3 4 g 1 0 t 2 1 1 t 3 c 4 • e =1, d = 0 S1 S2 row0,1 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] = max[2,0,1]=2. the length of common prefix of gtctagttc and tc is 0. L0,1 = 2 d = 0

The length of LCA of gtctagttc and tc is 0.

e =1, d = 1 S1 i 1 2 3 4 5 6 7 S2 g g g t c t a row1,1= max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1. the length of common prefix of gtctagttc and ttc is 0. L1,1 = 1 0 1 2 3 4 5 6 7 j 1 2 3 4 g 1 0 1 t 2 1 1 t 3 c 4 d = 1

The length of LCA of gtctagttc and ttc is 0.

g g g t c t a S2 0 1 2 3 4 5 6 7 j 1 2 3 4 g 1 0 1 t 2 1 1 2 t 3 2 2 2 c 4 • e =2, d = 1 S1 i 1 2 3 4 5 6 7 • row1,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =2 d = 1

e =2, d = 1 i 1 2 3 4 5 6 7 g g g t c t a 0 1 2 3 4 5 6 7 j 1 2 3 4 We find that the longest common prefix of tc and tctagttc is tc. g 1 0 1 t 2 1 1 2 t 3 2 2 2 2 c 4 2 d = 1 S2’ S1’ L1,2 = row+2=2+2=4

The length of LCA of tctagttc and ttc is 2.

g g g t c t a 0 1 2 3 4 5 6 7 j 1 2 3 4 g 1 0 1 2 t 2 1 1 2 2 t 3 2 2 2 2 c 4 2 • e =2, d = 2 S1 i 1 2 3 4 5 6 7 S2 row2,2 = max[(Ld,e-1+1),(Ld-1,e-1),(Ld+1,e-1+1)] =1 • We find that the lenghth of common prefix of ttc and tctagttc is 1. d = 2 S2’ S1’ L2,2 = row2,2+1=1+1=2

The length of LCA of ttc and tctagttc is 1.

i 1 2 3 4 5 6 7 S2 g g g t c t a 0 1 2 3 4 5 6 7 j 1 2 3 4 g 1 0 1 2 t 2 1 1 2 2 t 3 2 2 2 2 c 4 2 S1=gggtcta S2=gttc S1 T = cttggggtcta and k=2. At c=4, T[1…4]=cttg, TR[1..4]=gttc and TR[5…11]=gggtcta. cttggggtc is the maximal approximate palindromes.

References • [U85] Finding approximate patterns in strings, Ukkonen, E., Journal of algorithms, Vol. 6, 1985, pp.132-137. • [LV89] Fast parallel and serial approximate string matching, G. Landau and U. Vishkin, Journal of algorithms, Vol. 10, 1989, pp.157-169.

Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen