450 likes | 581 Views
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrics. By Cruchemor, Landau and Ziv-ukelson. Abstract.
E N D
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrics By Cruchemor, Landau and Ziv-ukelson
Abstract We present an O(n²/log n) algorithm for computing the optimal global alignment value of two strings,of size n each, over a constant alphabet.the algorithm is even faster when the sequence is compressible. In fact, for most texts, the complexity of the algorithm is actually O(hn²/log n). Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an algorithm for comparing A and B in O(m’n+n’m) complexity, using any distance or similarity schemes with additive gap.
Definition 1 – global alignment problem: given a scoring matrix over the alphabet , the similarity of two strings A and B is defined as the value maxV of the alignment of A and B, that maximizes the total alignment value. * The score value maxV is called the optimal global alignment value of A and B. * A description of a maxV – scoring transformation of A into B is called a global alignment trace.
Global alignment via dynamic programming Reminder: The classical dynamic programming algorithm creates the alignment graph, row by row in a left to right order : V(I,j)= max [V(I,j-1) + (,Bj), V(I-1,j) + (Ai,), V(I-1,j-1) + (Ai,Bj)] Complexity: O(n²).
a a c g a c g a 6 7 3 4 1 5 2 8 0 c 1 t 2 a 3 c 4 g 5 a 6 g 7 a 8 B Alignment graph A Alignment graph for comparing the strings - A=“ctacgaga” and B=“aacgacga”.
Scoring matrix Scoring matrix , specifies the substitution score for each pair of characters and the insertion/deletion scores for each character from the alphabet. A gap is the result of the deletion of one or more consecutive characters in one of the sequences. Additive gap scores assign a constant weight to each of the consecutive characters. The algorithm assume a scoring scheme with additive gap.
A block partition of the alignment graph based on LZ LZ Main idea: based on the idea of self reference. while scanning text, the text seen so far is parsed into phrases, where each phrase is the longest matching phrase seen previously plus one character. Each phrase is encoded as an index to it’s prefix, plus the extra character.
LZ example Example: S=“aacgacga” is divided into four phrases: a, ac, g, acg. The trie for S: a 0 g 1 c 3 2 g 4
LZ continuous Theorem (Lempel and Ziv): given a sequence S of size n over a constant alphabet. The maximal number of distinct phrases in S is O(n/log n). A tighter upper bound will be O(hn/log n) where h is the entropy factor – a real number, 0 < h 1, which measures the amount of disorder in a sequence. Entropy is small when there is a lot of order (I.e. the sequence is repetitive) and large when there is a lot of disorder.
Trie for B g a 0 c 1 3 g 2 4 Trie for A a t 0 g c 3 2 1 5 g 4 LZ – partitioned alignment graph 3 4 2 1 0 1 2 3 4 5
Computing the optimal global similarity value • Let Xa and Yb denote phrases in A and B respectively , obtained by extending a previous phrase X/Y with character a/b respectively. • block G corresponds to the comparison of Xa and Yb. • input borderI – the left and top borders of G. • output border O – the bottom and right borders of G. • we define the following three prefix blocks of G: 1. The left prefix of G denotes the block comparing phrase Xa of A and phrase Y of B. 2. The diagonal prefix of G denotes the block comparing phrase X of A and phrase Y of B. 3. The top prefix of G denotes the block comparing phrase X of A and phrase Yb of B.
Trie for B g a 0 c 1 3 g 2 a c g 4 a Trie for A g a t 0 g c 3 2 1 5 g a c a c a c g 4 a a a g Computing block G 3 2 4 1 0 1 2 3 4 5 Graph G for block (5,4) diagonal prefix (3,2) top prefix (3,4) left prefix (5,2)
a c g a c a a g g left prefix G a c g a g Note that the graph for the left prefix of G is identical to the sub-graph of G containing al columns but the last one. More specifically, both the structure and the weights of edges are identical, but the weights to be assigned during the computation may vary according to input border values. Similarly for the top and diagonal prefix graphs. The only new cell in G, which doesn’t appear in any of it’s prefix blocks is the cell for comparing a and b. This cell consists of one new vertex and three new edges.
The work for each block consists of two stages: 1. Encoding: study the structure of G and represent it in an efficient way. 2. Propagation: given I and the encoding of G, compute O for G. The structure of G is encoded by computing weights of optimal path connecting each entry of its input border with each entry of its output border. DIST matrix is used. DIST[i,j] stores the weight of the optimal path from entry i of the input border of G to entry j of its output border. OUT matrix – merges the information from input row I and DIST. given I and DIST, OUT[r,j]=Ir + DIST[r,j]. Vertex Oj of O is the maximum of column j of OUT. I/O propagation Across G
a c g 2 4 3 5 5 a 1 4 g I 0 3 1 2 0 G O DIST and OUT matrix example Block G – subsequences “acg”, “ag” DIST matrix 0 -1 -2 -3 -1 -1 -2 -1 -3 -2 0 0 1 -1 -3 -2 -2 0 -2 -2 -2 0 -1 -1 -2 -1 0 OUT matrix 0 I (input borders) I0=1 I1=2 I2=3 I3=2 I4=1 I5=3 1 0 -1 -2 -- 1 1 0 1 -1 - 1 3 3 4 2 0 -12 0 0 2 0 0 -13 -13 -1 1 0 0 -14-14-14 1 2 3 I0 + DIST[0,j] O (output borders) O0 O1 O2 O3 O4 O5 1 3 3 4 2 3
0 -1 -2 -3 -1 -1 -2 -1 -3 -2 0 0 1 -1 -3 -2 -2 0 -2 -2 -2 0 -1 -1 -2 -1 0 the rectangle problem Since paths in an alignment graph can only assume a left-to-right, top-to-bottom direction, connection between some input border vertices and some output vertices are impossible. Therefore, the matrices are missing both a lower-left triangle and upper-right triangle. For the next stage, OUT matrix have to be a full, rectangular matrix.
Addressing the rectangle problem The undefined entries of OUT can be complemented in constant time each, as follows: 1. The missing upper-right triangle can be completed by setting all entries to -. 2. Let k denote the maximal absolute value of a score in . The missing lower-left triangle entries will be setting to: -(n+i+1)*k. where n is the shortest path in the original alignment graph. Lemma : complementing the undefined entries as described above preserves the concave total monotonicity condition of out, and does not introduce new row-maxima.
DIST and OUT properties Definition: a matrix M[0…m,0…n] is Monge if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1. Convex condition: M[a,c] + M[b,d] M[b,c] + M[a,d] for all a<b and c<d. 2. Concave condition: M[a,c] + M[b,d] M[b,c] + M[a,d] for all a<b and c<d. Previous researches observed that DIST matrices are Monge arrays. Since DIST is Monge, so is OUT, which is a DIST with constants added to its rows. An important property of Monge arrays is that of being totally monotone. c e d a b
a b o d c DIST and OUT properties continuous Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1. Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d. 2. Concave condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d. Both DIST and OUT are totally monotone by the concave condition. The concave condition If b-c is better than a-c, than b-d is better than a-d.
Computing O An algorithm called SMAWK can compute in O(n) time all row and column maximum of an n x n totally monotone matrix, by querying only O(n) elements of the array. Hence we can use SMAWK on OUT to compute output row O. clearly, if both the full DIST and I are available, than computing an element of OUT is O(1) work. Thus, there is no need to compute and store the full matrix OUT but only the entries of OUT which are needed by SMAWK.
Computing O In order to efficiently utilize Monge and Monotonicity properties, we need to address two problems: 1. SMAWK is intended for a full, rectangular matrix – already explained. 2. How to efficiently compute DIST and represent it in a format which allows direct access to its entries.
a c g a g Incremental update of the new DIST information for G As shown, in block G, there is only one “new” cell, in its lowest – rightmost corner. When processing G, we compute the scores of t (t= |I|) new optimal paths, leading from input border i to the new vertex (lr,lc). This values correspond to column lc of the DIST matrix for G and can be computed as follows:
Incremental update of the new DIST information for G The weight of the optimal path from entry Ii to (lr,lc) is equal to the maximum among: 1. Entry [i] of column lc –1 of the DIST for the left prefix of G, plus the weight of the horizontal edge leading into (lr,lc). a c g Left prefix DIST[i , lc –1] + edge a g
Incremental update of the new DIST information for G The weight of the optimal path from entry Ii to (lr,lc) is equal to the maximum among: 1. Entry [i] of column lc –1 of the DIST for the left prefix of G, plus the weight of the horizontal edge leading into (lr,lc). 2. Entry [i] of column lc –1 of the DIST for the diagonal prefix of G, plus the weight of the diagonal edge leading into (lr,lc). a c g a diagonal prefix DIST[i , lc –1] + edge g
top prefix DIST[i , lc] + edge Incremental update of the new DIST information for G The weight of the optimal path from entry Ii to (lr,lc) is equal to the maximum among: 1. Entry [i] of column lc –1 of the DIST for the left prefix of G, plus the weight of the horizontal edge leading into (lr,lc). 2. Entry [i] of column lc –1 of the DIST for the diagonal prefix of G, plus the weight of the diagonal edge leading into (lr,lc). 3. Entry [i] of column lc of the DIST for the top prefix of G, plus the weight of the vertical edge leading into (lr,lc). a c g a g
DIST(5,4) (Block G) 0 -1 -2 -3 -1 -1 -2 -1 -3 -2 0 -3 0 1 -1 -2 -2 -2 0 -2 -1 -2 0 -1 0 -2 -1 Maintaining direct access to DIST columns As shown, for each block only one new DIST column has been computed and stored. All other columns besides column lc of the DIST for G need to be obtained from G’s prefix ancestor blocks. Therefore, before the execution of SMAWK begins, a vector with pointers to all columns of the DIST for G is constructed. This vector can be deleted after the computation for G. Column lc is set to the newly constructed DIST vector for G. All columns of indices smaller than lc are obtained via lc recursive calls to left prefix blocs of G. All columns of indices greater than lc are obtained via lc recursive calls to top prefix blocs of G.
Trie for B Trie for A a g a t 0 0 c g 1 3 c 3 2 g 2 1 5 g 4 4 Querying a prefix block and obtaining its DIST column in constant time Each block will be identified by a pair of numbers, composed of the serial numbers for its corresponding phrases in the LZ tries for A and B. For example: Comparison of “ag” and “acg” – (5,4). The left prefix of G can be identified in constant time as follows: The first number identical to the serial number of Xa, and the second corresponding to the serial number of y, which is the direct ancestor of Yb in the trie for B.
Trie for A Trie for B a g a t 0 0 c g 1 3 c 3 2 g 1 2 5 g 4 4 Querying a prefix block and obtaining its DIST column in constant time Each block will be identified by a pair of numbers, composed of the serial numbers for its corresponding phrases in the LZ tries for A and B. For example: Comparison of “ag” and “acg” – (5,4). The left prefix of G can be identified in constant time as follows: The first number identical to the serial number of Xa, and the second corresponding to the serial number of y, which is the direct ancestor of Yb in the trie for B. Example: left prefix of G (5,4) will be: (5,2). Similarly, Top prefix (3,4) can be identified in constant time.
Trie for B Trie for A a g a t 0 0 c g 1 3 c 3 2 g 2 1 5 g 4 4 -3 -1 1 0 0 -2 Another data structured to be constructed is a Block Table, containing an entry for each partitioned block of the alignment graph. The entry for each block points to the start of its new DIST column, and can directly accessed via the block’s phrase number index pair. Block table New DIST column for (5,4)
Trie for B Trie for A a g a t 0 0 c g 1 3 c 3 2 g 2 1 5 g DIST(5,4) (Block G) 4 4 0 -1 -2 -3 -1 -1 -2 -1 -3 -2 0 -3 0 1 -1 -2 -2 -2 0 -2 -1 -2 0 -1 0 -2 -1 Maintaining direct access scheme Block Table
Assuming |A|=|B|=n. LZ algorithm – O(n). Number of phrases in both A and B – O(hn/log n). Number of blocks in the alignment graph – O(h²n²/(log n)²) For each block: 1. Updating the encoding structure for G – the new DIST column for G can be constructed in o(t) (linear with the size of its I/O borders). 2. Maintaining direct access to DIST columns – the vector with pointers to columns of the DIST for G can be set in O(t). 3. Propagating I/O values across the block – SMAWK in O(t). Total complexity – since the work for every block is linear with its I/O borders, the total complexity is linear with the total size of the borders of the blocks. In the alignment graph there are O(hn/log n) rows of size |B|, and O(hn/log n) columns of size |A|. Therefore, the total complexity is O(hn²/(log n)). complexity
Global similarity optimal alignment trace recovery The recovery of an optimal global alignment trace between A and B goes back from vertex (n,n) to vertex (0,0).for each block crossed, the internal alignment trace is reported. In order to support the recovery of block crossing paths in linear time, the following information for a given block G is required: 1. During propagation stage, input border entry i, which is the source of the highest scoring path to output border entry j, is saved for each output entry j.
a c g 2 4 3 5 5 a 1 4 g 0 3 G 1 2 0 O 2. During encoding, an additional O(t) sized vector of pointers called ancestors vector, is computed for G. the value of ancestor[lc] is set to G. all columns of indices smaller than lc are obtained via lc recursive calls to left prefix blocks of G. columns of indices greater than lc are obtained via lc recursive calls to top prefix blocks of G. Ancestor[3]=G Ancestor[2]=“ag” x “ac”
a c g a g G 2 4 3 5 5 1 4 0 3 1 2 0 O 2. During encoding, an additional O(t) sized vector of pointers called ancestors vector, is computed for G. the value of ancestor[lc] is set to G. all columns of indices smaller than lc are obtained via lc recursive calls to left prefix blocks of G. columns of indices greater than lc are obtained via lc recursive calls to top prefix blocks of G. Ancestor[3]=G Ancestor[4]=“a” x “acg” Ancestor[2]=“ag” x “ac” Ancestor[5]=“ ” x “acg” Ancestor[1]=“ag” x “a” Ancestor[0]= “ag” x “ “
3. During encoding, G’s new vertex is annotated with an additional O(t) sized vector of pointers called direction. These pointers are set during the DIST column computation, according to the direction of the last edge in the optimal path from I[i] to the new vertex. Given that the optimal path enters through entry j of the output border of G,The trace-back of the part of the path going through G has two stages: • destination and origin initialization: fetching I[i] which was stored as the origin for the highest path to G’s O[j]. Entry i is the destination for the trace-back. In addition, the ancestor prefix P pointed by ancestor[j] is fetched. The edge recovery begins in block P.
• trace-back the part of the path contained in P, from entry j on P’s output (the new vertex of P), to entry i on its input border. This is done by backtracking a dynasty of prefix ancestor blocks internal to P, using the direction vector computed for each of the traversed blocks. If direction[I] of the traversed block specifies a horizontal edge, than the trace-back retreats to the left prefix of P, and an “insertion” operation is reported in the alignment trace. Correspondingly, “substitution” and “deletion” are reported when backtracking to diagonal and top prefix blocks. The recovery continues until the full optimal alignment trace is recovered.
a c g 2 4 3 5 a 5 1 4 g I 0 G 3 1 2 0 O a a c a a Trace-back example Trace-back in G – j=3 Weight of optimal path = 1 (from I[2] to O[3]) Stage 1: Entry I[2] is the destination. Ancestor[3] is fetched (which is G) Stage 2: Direction[2]= diagonal Diagonal prefix is fetched Alignment trace: - substitution Direction[1]=horizontal Left prefix is fetched - insertion - substitution Direction[1]=diagonal Done
Trace complexity The vectors direction and ancestor, and the input border entry I, can be computed during the encoding and propagation stages in O(t). The first stage in the trace-back can be done in constant time. Each edge in The recovery stage can be computed in constant time (including accessing to prefix blocks and its direction vectors). The highest scoring global alignment between strings A and B can be recovered in additional time linear in its size, using the O(hn²/log n) of the encoding/propagation..
Comparing run length encoded strings A string S is run-length encoded if it is described as an ordered sequence of pairs (,i): aaabbcccaa = (a,3),(b,2),(c,3),(a,2) or a³b²c³a² Former algorithms achieved O(m’n+n’m) complexity for computing edit distance between two run-length strings. This algorithm runs at the same complexity and is suitable to any similarity scoring scheme with additive gap.
Let m and n be the lengths of two run-length encoded strings X and Y, of encoded length m’ and n’, respectively. The alignment graph is partitioned into blocks, each block consists of two runs – one of X and one of Y. therefore the alignment graph is partitioned into m’ n’ blocks. The algorithm also propagates accumulated scores from the left-upper borders to the bottom-right borders of each block.
Consider the block R for comparing i of X and j of Y. An edge in R could be assigned one of three values: D(diagonal), H (horizontal) and V (vertical) according to scoring scheme. h – the difference in row index between entry i on the input border and entry j on the output border of R. w – the difference in column index between entry i on the input border and entry j on the output border of R. We can compute DIST[i,j] in constant time, given h and w, and the values D, H and V.
Computing DIST[i,j] • H + V D. clearly, the optimal path can use all possible diagonal edges and only than the minimal H and V edges. DIST[i,j] obtains one of three values: 1. If w = h, DIST[i,j]= D x h. 2. If w > h, DIST[i,j]= D x h + H x (w - h). 3. If w < h, DIST[i,j]= D x w + V x (h - w). • H + V >D. in this case an optimal path never uses any diagonal edge. DIST[i,j]= H x w + Vx h. Therefore DIST[i,j] can be easily computed in constant time given a scoring scheme.
Correctness and complexity Block R consists of X and Y. The values D,H and V are the same for all edges in R, and can be find in the scoring scheme in constant time. The O vector for each block is computed using SMAWK. Vector I for block R can be obtained from the O vectors for the blocks above R and the block in its left. Any value OUT[i,j]= I[i] + DIST[i,j], needed by SMAWK, can be computed in constant time. Since The work for each block is linear with its I/O borders, the total time complexity is linear with the total size of the borders of the blocks – O(m’n + n’m).
Note that by computing only the relevant DIST entries (and relevant OUT entries) in constant time, and “on the fly”, these values do not need to be stored, and so we can improve space complexity.