240 likes | 278 Views
This study explores alternative algorithms for Lyndon factorization, including Duval's algorithm and variations using LF-skip and run-length encoding. The algorithms are analyzed for efficiency and their application in string compression.
E N D
Alternative Algorithms for Lyndon Factorization Sukhpal Singh Ghuman, EmanueleGiaquinta, and JormaTarhio Aalto University Finland
Lyndon Word • Given two strings w and w′, w′ is a rotation of w if w=uv and w′=vu, for some strings u and v. • A string is a Lyndon word if it is lexicographically (alphabetically) smaller than all its proper rotations.
Lyndon Word • w=ab, w′=ba where u=a, v=b. • w is lexicographically smaller than its rotation w′ . • w is Lyndon word.
Examples of Lyndon words • Lyndon words • a • ab • aabab • Non-Lyndon words • ba • abaac • abcaac
Lyndon factorization • A word w can be factorized into w0 w1 w2 … wm-1 factors such that each factor is a Lyndon word. • Every string has a unique factorization in Lyndon words with corresponding sequence of factors is non-increasing with respect to lexicographical order. • The Lyndon factorization has importance in a recent method for sorting the suffixes of a text.
Examples of Lyndon factorization • abcaabcaaabcaaaabc -> abcaabcaaabcaaaabc • aacaacaacaad -> aacaacaacaad • abacabab -> abac ab ab
Duval’s algorithm • For Lyndon factorization of a word w, computes the longest prefix w1 of w = w1w′ which is a Lyndon word and then recursively restart the process from w′. • Non-empty prefixes of Lyndon words are all of the form (uv)ku. • Duval’s algorithm compute the factorization using a left to right parsing.
Computing Lyndon factorization for T=aabaabaaac • For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. • Then there are two cases, depending on the next symbol to be read.
Computing Lyndon factorization for T=aabaabaaac • For i=3 having P = aab. With u = emptystring, v = aab and k = 1. • The next symbol to read is 'a' and aabais still a prefix of a Lyndon word. The next iteration then starts with P =aaba.
Computing Lyndon factorization for T=aabaabaaac • For i = 6, P = aabaab; P as (uv)k u with u = empty string, v = aab and k = 2. • The next symbol to read is 'a' and after reading 'aaa', it is found aabaabaaa is not a prefix of a Lyndon word. • Output is two times aab and the next iteration starts on the suffix aaac of T with P = a.
Variations of Duval’s algorithm. • First variation is designed with LF skip algorithm. • Second variation is for strings compressed with run-length encoding.
LF skip algorithm • The algorithm is able to skip a significant portion of the characters of the string if it contains runs of smallest character. • Let w be a word over an alphabet Σ with a factorization CFL(w) = w1,w2,...,wm .
LF skip algorithm • Let c be the smallest symbol in Σ. • There exists k ≥ 2,i ≥ 1 such that ck is a prefix of wi. • If the last symbol of w is not c, then, c is a prefix of each of wi, wi+1, . . . , wm. • This property is used to devise an algorithm for Lyndon factorization that skip symbols.
LF skip algorithm • Let us consider the alphabet {a,b,c,…}. Let us assume that the last character is not a. • Let wi start with aaad. We know that the prefix of wi+1 belongs to the set P = {aaaa,aaab,aaac,aaad}. • We search for occurrences of P with an algorithm (e.g. SBNDM) sublinear on average in order to skip characters. • aaadxxxxxxxxxxxaaac ---^---^--^^+++
Run Length Encoding • Run-length encoding (RLE) is a very simple form of data compression in which runs of symbols are stored as a single data value. • Given string: aaaaaabbbccaaabbbccbbbbbaaa • RLE: a6b3c2a3b3c2b5a3
Lyndon factorization of RLE string • The second variation is for strings compressed with run-length encoding. • Strings are stored in RLE for preferably.
Lyndon factorization of RLE string • The algorithm is based on Duval’s original algorithm and on a combinatorial property between the Lyndon factorization of a string and its RLE. • Run of length t in the RLE is either contained in one factor of the Lyndon factorization, or it corresponds to t unit-length factors.
Computing Lyndon factorization from RLE for T=aabaabaaac • For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. • RLE algorithm works in it is similar, except the runs are read instead of symbols.
Computing Lyndon factorization from RLE for T=aabaabaaac • For i= 3, P = aab. The next run to be read is 'aa' and aabaa is still a prefix of a Lyndon word. The next iteration then starts with P = aabaa. • For i= 6, P = aabaab. The next run to be read is 'aaa' and aabaabaaa is not a prefix of a Lyndon word. • Next iteration starts on the suffix aaac of T with P = aaa.
Complexity • Given a run-length encoded string R of length ρ, algorithm computes the Lyndon factorization of R in O(ρ) time. • It is preferable to Duval’s algorithm in the cases in which the strings are stored or maintained in run-length encoding.
Experimental results • LF-skip algorithm and Duval’s algorithm with various texts. • LF-skip gave a significant speed-up over Duval’s algorithm. • Following table shows the speed-ups for random texts of 5 MB with various alphabets sizes.
Conclusion • Two variations of Duval’s algorithm for computing the Lyndon factorization of a string are presented. • The first algorithm is designed that skips a significant portion of the characters. • Experimental results show that the algorithm is considerably faster than Duval’s original algorithm. • The second algorithm is for strings compressed with run-length encoding and computes the Lyndon factorization of a run-length encoded string of length ρ in O(ρ) time.