Alternative Algorithms for Lyndon Factorization

Alternative Algorithms for Lyndon Factorization Sukhpal Singh Ghuman, EmanueleGiaquinta, and JormaTarhio Aalto University Finland

Lyndon Word • Given two strings w and w′, w′ is a rotation of w if w=uv and w′=vu, for some strings u and v. • A string is a Lyndon word if it is lexicographically (alphabetically) smaller than all its proper rotations.

Lyndon Word • w=ab, w′=ba where u=a, v=b. • w is lexicographically smaller than its rotation w′ . • w is Lyndon word.

Examples of Lyndon words • Lyndon words • a • ab • aabab • Non-Lyndon words • ba • abaac • abcaac

Lyndon factorization • A word w can be factorized into w0 w1 w2 … wm-1 factors such that each factor is a Lyndon word. • Every string has a unique factorization in Lyndon words with corresponding sequence of factors is non-increasing with respect to lexicographical order. • The Lyndon factorization has importance in a recent method for sorting the suffixes of a text.

Examples of Lyndon factorization • abcaabcaaabcaaaabc -> abcaabcaaabcaaaabc • aacaacaacaad -> aacaacaacaad • abacabab -> abac ab ab

Duval’s algorithm • For Lyndon factorization of a word w, computes the longest prefix w1 of w = w1w′ which is a Lyndon word and then recursively restart the process from w′. • Non-empty prefixes of Lyndon words are all of the form (uv)ku. • Duval’s algorithm compute the factorization using a left to right parsing.

Computing Lyndon factorization for T=aabaabaaac • For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. • Then there are two cases, depending on the next symbol to be read.

Computing Lyndon factorization for T=aabaabaaac • For i=3 having P = aab. With u = emptystring, v = aab and k = 1. • The next symbol to read is 'a' and aabais still a prefix of a Lyndon word. The next iteration then starts with P =aaba.

Computing Lyndon factorization for T=aabaabaaac • For i = 6, P = aabaab; P as (uv)k u with u = empty string, v = aab and k = 2. • The next symbol to read is 'a' and after reading 'aaa', it is found aabaabaaa is not a prefix of a Lyndon word. • Output is two times aab and the next iteration starts on the suffix aaac of T with P = a.

Variations of Duval’s algorithm. • First variation is designed with LF skip algorithm. • Second variation is for strings compressed with run-length encoding.

LF skip algorithm • The algorithm is able to skip a significant portion of the characters of the string if it contains runs of smallest character. • Let w be a word over an alphabet Σ with a factorization CFL(w) = w1,w2,...,wm .

LF skip algorithm • Let c be the smallest symbol in Σ. • There exists k ≥ 2,i ≥ 1 such that ck is a prefix of wi. • If the last symbol of w is not c, then, c is a prefix of each of wi, wi+1, . . . , wm. • This property is used to devise an algorithm for Lyndon factorization that skip symbols.

LF skip algorithm • Let us consider the alphabet {a,b,c,…}. Let us assume that the last character is not a. • Let wi start with aaad. We know that the prefix of wi+1 belongs to the set P = {aaaa,aaab,aaac,aaad}. • We search for occurrences of P with an algorithm (e.g. SBNDM) sublinear on average in order to skip characters. • aaadxxxxxxxxxxxaaac ---^---^--^^+++

Run Length Encoding • Run-length encoding (RLE) is a very simple form of data compression in which runs of symbols are stored as a single data value. • Given string: aaaaaabbbccaaabbbccbbbbbaaa • RLE: a6b3c2a3b3c2b5a3

Lyndon factorization of RLE string • The second variation is for strings compressed with run-length encoding. • Strings are stored in RLE for preferably.

Lyndon factorization of RLE string • The algorithm is based on Duval’s original algorithm and on a combinatorial property between the Lyndon factorization of a string and its RLE. • Run of length t in the RLE is either contained in one factor of the Lyndon factorization, or it corresponds to t unit-length factors.

Computing Lyndon factorization from RLE for T=aabaabaaac • For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. • RLE algorithm works in it is similar, except the runs are read instead of symbols.

Computing Lyndon factorization from RLE for T=aabaabaaac • For i= 3, P = aab. The next run to be read is 'aa' and aabaa is still a prefix of a Lyndon word. The next iteration then starts with P = aabaa. • For i= 6, P = aabaab. The next run to be read is 'aaa' and aabaabaaa is not a prefix of a Lyndon word. • Next iteration starts on the suffix aaac of T with P = aaa.

Complexity • Given a run-length encoded string R of length ρ, algorithm computes the Lyndon factorization of R in O(ρ) time. • It is preferable to Duval’s algorithm in the cases in which the strings are stored or maintained in run-length encoding.

Experimental results • LF-skip algorithm and Duval’s algorithm with various texts. • LF-skip gave a significant speed-up over Duval’s algorithm. • Following table shows the speed-ups for random texts of 5 MB with various alphabets sizes.

Speed-up of LF-skip

Conclusion • Two variations of Duval’s algorithm for computing the Lyndon factorization of a string are presented. • The first algorithm is designed that skips a significant portion of the characters. • Experimental results show that the algorithm is considerably faster than Duval’s original algorithm. • The second algorithm is for strings compressed with run-length encoding and computes the Lyndon factorization of a run-length encoded string of length ρ in O(ρ) time.

THANK YOU

Alternative Algorithms for Lyndon Factorization

Alternative Algorithms for Lyndon Factorization

Presentation Transcript

Alternative Algorithms

Lyndon B. Johnson

Lyndon Baines Johnson

Being Lyndon Johnson

Prime Factorization

Prime Factorization

Approximate modeling of continuous context in factorization algorithms

Factorization

Matrix Factorization Models, Algorithms and Applications

Matrix Factorization

Lyndon Baines Johnson

Lyndon B. Johnson

Prime Factorization

Lyndon Henry

Lyndon B.Jhonson

Lyndon B. Johnson

Lyndon B. Johnson

Factorization ：

Prime Factorization

Barry Lyndon

Lyndon Henry