200 likes | 439 Views
Fast and Practical Algorithms for Computing Runs. Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario, CAN. CPM, UWO, July 11, 2007. Overview. I won’t talk much about runs! Lempel-Ziv ( LZ ) Factorization How to compute LZ with SA & LCP
E N D
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario, CAN CPM, UWO, July 11, 2007
Overview • I won’t talk much about runs! • Lempel-Ziv (LZ) Factorization • How to compute LZ with SA & LCP • Suffix Array & LCP Array Basics (again!) • Two different methods for LZ factorization • CPS1 and CPS2 • Various space time trade-offs • Experimental comparison to other approaches
LZ Factorization (Defn) • The LZ-factorization, LZx of string x[1..n] is a factorization x = w1w2...wk such that each wj, j ε 1..k, is either: • a letter that does not occur in w1w2...wj-1; or • the longest substring that occurs at least twice in w1w2...wj. • This is the LZ-77 parsing of the input string • Also known as the S-Factorization(Crochemore)
1 2 3 4 5 6 7 8 a b a a b a b a x = LZ Factorization (Ex) wj a b a aba ba (POS,LEN) (1,0) (2,0) (1,3) (2,2) (1,1) … or (5,2) • POS = Position of some previous occurrence • LEN = Factor length • Convention: LEN = 0 if factor is a new letter
Applications of LZ Factorization LZ Factorization is the computational bottleneck in numerous string processing algorithms • Computing all runs (Kolpakov & Kucherov) • Repeats with fixed gap (Kolpakov & Kucherov… again) • Branching repeats (Gusfield & Stoye) • Sequence Alignment (Crochemore et al.) • Local periods (Duval et al.) • Data Compression (Lempel & Ziv, many others) Etcetera…
Computing LZ • “Traditional” method is to use a suffix tree • Can be computed as a by-product of Ukkonen’s online suffix tree construction algorithm OR • During a bottom-up traversal of a whole tree • SA/LCP interval tree (Abouelhoda et al 2004) • Essentially simulating a bottom-up traversal of the suffix tree on the SA/LCP combination • Both these approaches use lots of space.
1 2 3 4 5 6 7 8 a a b a b a b a x = SORT The ubiquitous Suffix Array • Sort the n suffixes of x[1..n] into lexorder • Store the offsets in an array 1 abaababa 2 baababa 3 aababa 4 ababa 5 baba 6 aba 7 ba 8 a 8 a 3 aababa 6 aba 1 abaababa 4 ababa 7 ba 2 baababa 5 baba
Many SA algorithms rely on an additional table: the LCP (longest common prefix) array Can be computed in O(n) time (Kasai et al. 1999) Several practical improvements: space consumption reduced from 13n to 9n (Manzini 2004) LCP Array 8 0 a 3 1 aababa 6 1 aba 1 3 abaababa 4 3 ababa 7 0 ba 2 2 baababa 5 2 baba LCP Array stores length of Longest Common Prefix between suffixes SA[i] and SA[i-1]
POS = 1 2 1 1 2 1 2 1 LCP = LEN = 0 0 1 0 1 1 3 3 3 2 3 0 2 2 2 1 Computing LZ with the SA • First “family” of LZ algorithms we call CPS1 • CPS1 algorithms compute arrays POS and LEN • These arrays give us the factor information for every position (which is more than we require) • Also, LEN is a permutation of LCP 1 2 3 4 5 6 7 8 a b a a b a b a x =
CPS1: LZ from SA & LCP • POS and LEN are computed in a straight left-to-right traversal of the SA/LCP arrays • We “ascend” the LCP array, saving indexes on the stack until LCP values decrease • Backtrack using the stack to locate the rightmost i1 < i2 with LCP[i1] < LCP[i2] • As we go set the larger position with equal LCP to point leftwards to the smaller one • 14 lines of C code! • x, SA, LCP, POS, LEN arrays →17n + stack
Overwrite LCP with POS • Once POS[SA[i]] has been assigned • SA[i] and LCP[i] are no longer accessed… • Reuse the space • Leave SA[i] as is • Assign LCP[i] = POS[SA[i]] • Store LEN separately as before • After the traversal of SA/LCP is complete, permute the SA and “LCP” arrays inplace into string order by following all cycles • POS array no longer needed →13n + stack
Eliminate the LEN Array • Given POS[i] = p • LEN[i] = longestmatch(x[POS[i]…n],x[i…n]) • Compute only the POS values • Permute them into the POS array (as last slide) • Compute LEN values only for factors in the parsing • Sum of factors lengths required for the parsing is n, still O(n) time • LEN array no longer needed →9n + stack
CPS2: LZ without LCP • LCP computation is slow (though linear) • requires extra space: can we drop it? • Use SA to search for the longest previous match at each position in the factorization • Problem is: we don’t want any match - we want a match to the left. • When do we stop the search?
1 2 3 4 5 6 7 8 a b a a b a b a x = LZ without LCP (cont…) RangeMinSA(1,5) = 1 Length = 1 8 a 3 aababa 6 aba 1 abaababa 4 ababa 7 ba 2 baababa 5 baba RangeMinSA(3,5) = 1 Length = 2 RangeMinSA(3,5) = 1 Length = 3 RangeMinSA(3,5) = 1 Length = 3 RangeMinSA(3,5) = 4
LZ without LCP (cont…) • Use two binary searches to refine range • Incremental use of Manber and Myers search • Could use other search algs (like FM) • Preprocess SA for fast RMQ queries • RMQSA(i,j) returns minimum value in SA[i..j] • Fast implementation of RMQ requires n bytes • O(n log n) time, ~6n bytes space • n single character searches • Each search takes O(log n) time
Experiments Implemented CPS algorithms and raced with: • Kolpakov and Kucherov’s implementation • Computes factors during online construction of the suffix tree (Ukkonen’s algorithm) • Tuned specifically for DNA strings • Abouelhoda et al’s approach • Uses SA and LCP, computes the POS,LEN
Conclusions • KK remains fastest algorithm on DNA • CPS1 (13n) is consistently fastest on larger alphabets (notably faster than AKO) • CPS1 (9n) provides a nice space time tradeoff • CPS2 most suitable if memory is tight
Future Work • Computing the LCP array is a burden • Can we speed it up? • Compute it during SA construction? • How easily do these algorithms map to compressed SAs? • Overwriting SA/LCP difficult in that setting • Can LZ be computed efficiently without using SA/LCP or STree? • Can we compute the rightmost previous POS instead of the leftmost? (Veli Makinen 7-9-2007)