270 likes | 401 Views
Efficient LZ78 factorization of grammar compressed text. Hideo Bannai , Shunsuke Inenaga , Masayuki Takeda Kyushu University, Japan. Outline. Background LZ78 Factorization Straight Line Programs (SLP) Algorithms LZ78 factorization using suffix trees SLP to LZ78 Improvements.
E N D
SPIRE 2012 @ Cartagena, Colombia Efficient LZ78 factorization of grammar compressed text Hideo Bannai, ShunsukeInenaga, Masayuki Takeda Kyushu University, Japan
SPIRE 2012 @ Cartagena, Colombia Outline • Background • LZ78 Factorization • Straight Line Programs (SLP) • Algorithms • LZ78 factorization using suffix trees • SLP to LZ78 • Improvements
SPIRE 2012 @ Cartagena, Colombia Background Compressed String Processing (CSP) • compress string for storage … but …don’t decompress all of itwhen using it! • can be faster than processing the uncompressed text,by exploiting regularities identified by compression • regard compression as a generic preprocessing! BIG String Pattern Matching Compressed Representation of String process directly Edit Distance Pattern Mining etc. This work: LZ78 factorizationof grammar compressed strings
SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization [Ziv&Lempel’78] The LZ78-factorization of string S is a factorization S = f1f2 ... fm where fi is the longest prefix of fi ... fmsuch that fi= fjc for some 0 ≤ j < i(let f0 =ε) • S = a l a b a r a l a l a b a r d a $ (0,a) (0,l) (1,b) (1,r) (1,l) (5,a) (0,b) (5,d) • (1,$) f1 f2 f3 f4 f5 f6 f7 f8 f9 a 0 l b $ b O(N log σ) time O(m) space 7 1 2 r l 3 5 4 9 a d LZ78 trie of S 6 8
SPIRE 2012 @ Cartagena, Colombia Straight Line Programs Straight Line Program • CFG in Chomsky normal form that derives single string. • Can efficiently model outputs of many compression algorithms: REPAIR, SEQUITUR, LZ78, etc. X1 = a X2 = b X3 = X1 X2 X4 = X1 X3 X5 = X4 X3 X6 = X4 X5 X7 = X6 X5 SLP, n=7 Derivation tree X7 X6 X5 X4 X5 X4 X3 X1 X3 X4 X3 X1 X3 X1 X2 X1 X2 X1 X3 X1 X2 X1 X2 X1 X2 a a b a a b a b a a b a b S
SPIRE 2012 @ Cartagena, Colombia Problem: SLP to LZ78 X1 = a X5= X4 X3 X2= b X6= X4 X5 X3 = X1 X2 X7= X6 X5 X4 = X1 X3 Input: SLP Output: LZ78 Factorization (Trie) b a 0 a b 6 1 • Why “re-compress” a compressed representation? • Convert the representation Some CSP algorithms require specific compression • Re-compress an SLP modified by ad-hoc edits Dynamic compressed texts • Compute Normalized Compression Distance [Li et al. 2004] Clustering & classification w/o decompressionCLZ78 (x), CLZ78 (y), CLZ78(xy) from SLPs of x, y 5 a 2 b 3 4 Computer Scientist Make Sleeping Files Walk in their Sleep!
SPIRE 2012 @ Cartagena, Colombia Our Results Algorithms to compute LZ78 from SLP N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ) • α ≥ 0 is a quantity that representsthe amount of redundancy in the string that is captured by the SLP
SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization using a Suffix Tree
SPIRE 2012 @ Cartagena, Colombia Suffix Tree & LZ78 The LZ78 trie can be superimposed on the suffix tree 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b S a b a a b b 0 0 a 13 b a b b a a 4 4 1 1 b a 12 a a b 3 3 a a 2 2 b b b a a 11 b b a 5 5 b b b a 9 10 a a a 8 6 6 a b a b a b a a a a 7 a a b b a b a b a a b a b b b a a b b b 1 4 2 5 3 6 • LZ78 trie of S • suffix tree of S
SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization on Suffix Tree 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b S i • Build LZ78 trie on top of suffix tree ST Nodes corresponding to LZ78 trie are marked 0 a b • Next factor is prefix of S[i:N].Find node in ST corresponding to S[i:N] 5 a 13 1 b a 4 b 2 a 12 a • Find longest prefix of S[i:N] in LZ78 trie O(1) time bydynamicnearest marked ancestor queries [Westbrook, ‘92] a b b b b a a 11 3 a b b b a 9 10 a a Make new node of LZ78 trie on ST O(1) time by level ancestor query on ST [Berkman & Vishkin ‘94] a 6 8 a b a b a b a a a a 7 a a b b a b a b a a b a b • Compute next position i i + |fi| b b a a b b b • LZ78 factorization in O(m) time,given suffix tree preprocessed for nma& la queries 1 4 2 5 3 6
SPIRE 2012 @ Cartagena, Colombia SLP to LZ78
SPIRE 2012 @ Cartagena, Colombia Our algorithm: SLP to LZ78 Key Observation For any string of length N, the length of any LZ78 factor fi satisfies: |fi| ≤ cN= (2N+¼)½ – ½ = O(N½) Main Idea • We only need a suffix tree that contains all distinct substrings of S with length at most cN • Build GST from a set of substrings of S that contain all distinct length-cNsubstrings of S
SPIRE 2012 @ Cartagena, Colombia Important Concept: Stabbing Xistabsan interval [u:v] of S,when it is the shortest variable that derives the interval(any interval is stabbed by a unique variable) e.g.:aaba at [9:12] is stabbed by X5 X7 X1 = a X2 = b X3 = X1 X2 X4 = X1 X3 X5 = X4 X3 X6 = X4 X5 X7 = X6 X5 X6 X5 X4 X5 X4 X3 X1 X3 X4 X3 X1 X3 X1 X2 X1 X2 X1 X3 X1 X2 X1 X2 X1 X2 a a b a a b a b a a b a b 1 2 3 4 5 6 7 8 9 10 11 12 13
SPIRE 2012 @ Cartagena, Colombia Substrings stabbed by Xi All length-qsubstrings stabbed by Xi are contained in a stringti(q) of length at most 2(q – 1) Xi Xr(i) Xl(i) q Any length-qsubstring of Sis stabbed by some unique variable Xi,and therefore is a substring of some ti(q) q • ti(q) q– 1 q– 1 • { ti(cN) : |Xi| ≥ cN, 1 ≤ i≤ n }will containall distinctlength-cN substrings of S
SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization from SLP Algorithm: • Compute { ti(cN) : |Xi| ≥ cN, 1 ≤ i≤ n } • Build generalized suffix tree (GST)for strings{ ti(cN) : |Xi| ≥ cN, 1 ≤ i≤ n } • Run LZ78 Factorization algorithm using GST O(ncN) time/space
SPIRE 2012 @ Cartagena, Colombia Example • N = 13, cN = 4, n = 7 • { t5(4), t6(4), t7(4) } = { aabab, aabaab, babaab } X7 X6 X5 X4 X5 X4 X3 X1 X3 X4 X3 X1 X3 X1 X2 X1 X2 X1 X3 X1 X2 X1 X2 S X1 X2 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b
SPIRE 2012 @ Cartagena, Colombia GST & LZ78 Factors The LZ78 triesuperimposed on GST of {t5(4), t6(4), t7(4)} 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b S a b a a b b 0 0 a 5,11,17 b a a a 1 1 b b 4 4 b 3 3 4,10,16 a 9,15 a a 2 2 a a b 3 a a a a b a b a a b a a b b a b a a b a 5 5 b b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 b b b b b b t5(4) t6(4) t7(4) 6 1 7,13 2 12 8,14 3 6 6 • LZ78 trie of S • GST of {t5(4),t6(4),t7(4)}
SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization on GST 0 a b a 5,11,17 1 b a b X7 4,10,16 a 9,15 a a X6 X5 b 3 i a a a a b a b a a b a a b b a b a a b a b b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 b b b b X4 X5 X4 X3 • Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N] t5(4) t6(4) t7(4) 6 1 7,13 2 12 8,14 3 X1 X3 X4 X3 X1 X3 X1 X2 • Find longest prefix of S[i:N] in LZ78 trie • O(log N) time w/ random accesson SLP [Bille et al. 2011] cN= 4 X1 X2 X1 X3 X1 X2 X1 X2 • Make new node for LZ78 trie on ST • O(1) time w/ dynamic nmaqueries S • Compute next position i i + |fi| X1 X2 • O(1) time w/ dynamic nmaqueries 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b
SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization on GST 0 a b a 5,11,17 1 b a b X7 4,10,16 2 a 9,15 a a X6 X5 b 3 i a a a a b a b a a b a a b b a b a a b a b b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 b b b b X4 X5 X4 X3 • Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N] t5(4) t6(4) t7(4) 6 1 7,13 2 12 8,14 3 X1 X3 X4 X3 X1 X3 X1 X2 • Find longest prefix of S[i:N] in LZ78 trie • O(log N) time w/ random accesson SLP [Bille et al. 2011] cN= 4 X1 X2 X1 X3 X1 X2 X1 X2 • Make new node for LZ78 trie on ST • O(1) time w/ dynamic nmaqueries S • Compute next position i i + |fi| X1 X2 • O(1) time w/ dynamic nmaqueries 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b
SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization on GST 0 a b a 5,11,17 1 b a b X7 4,10,16 2 a 9,15 a a 3 X6 X5 b 3 i a a a a b a b a a b a a b b a b a a b a b b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 b b b b X4 X5 X4 X3 • Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N] t5(4) t6(4) t7(4) 6 1 7,13 2 12 8,14 3 X1 X3 X4 X3 X1 X3 X1 X2 • Find longest prefix of S[i:N] in LZ78 trie • O(log N) time w/ random accesson SLP [Bille et al. 2011] cN= 4 X1 X2 X1 X3 X1 X2 X1 X2 • Make new node for LZ78 trie on ST • O(1) time w/ dynamic nmaqueries S • Compute next position i i + |fi| X1 X2 • O(1) time w/ dynamic nmaqueries 1 2 3 4 5 6 7 8 9 10 11 12 13 • LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma& la, and SLP preprocessed for random access queries a a b a a b a b a a b a b
SPIRE 2012 @ Cartagena, Colombia Summary of Basic Algorithm Extreme Cases: • If the string is compressible, n = O(log N), m = O(N½), soO(ncN + m log N) = O(N½ log N) = o(N) • If the string is not compressible, n, m= O(N) and O(ncN + m log N) = O(N1.5) cN= O(N½) can we do better than just revert to decompress & process?
SPIRE 2012 @ Cartagena, Colombia (1) Improving ncNterm to nL≤ ncN Let Ldenote length of longest LZ78 factor of S • We built GST for distinct substrings of length at most cNbut actually, we only need substrings of length at most L • However, L is not known beforehand… • Doubling Technique: • Assume L = 2 and run algorithm. • If LZ78 trieexpands beyond GST, L 2×L, rebuild GST and LZ78 trie,and continue • Total time complexity for rebuild: Σi=1..log LO(n2i+m)= O(nL+mlogL) • O(ncN+ mlogN) time, O(ncN + m) space • O(nL+ mlogN) time, O(nL + m) space
SPIRE 2012 @ Cartagena, Colombia (2) Improving ncNterm to Nα≤ N Lemma [Goto et al. CPM 2012] We can replace GST with suffix tree of trie for q = cN Given SLP for string S, the set of length-q substrings of S can be represented as paths in a reverse trie of sizeNα = N – α(q)≤ N, whereα(q)= Σi:|Xi| ≥q (vOcc(Xi) – 1)(|ti(q)| – (q – 1)) ≥ 0vOcc(Xi) : # of times Xi occurs in derivation tree The trie can be computed in time linear of its size. Lemma [Shibuya 2003] The suffix tree of a reverse triecan be constructed in linear time. • O(ncN+ mlogN) time, O(ncN + m) space • O(Nα + mlogN) time, O(Nα + m) space Nα = O(ncN)
SPIRE 2012 @ Cartagena, Colombia Example: Trie of size Nαfor q = 4 X7 X6 X5 X4 X5 X4 X3 X1 X3 X4 X3 X1 X3 X1 X2 X1 X2 X1 X3 X1 X2 X1 X2 X1 X2 a a b a a b a b a a b a b S a a b Σ|ti(q)| : 17 Text size: 13 Trie size: 11 a a b a b b a b We can aggregate all ti(q) into a trie of size at most the text size
SPIRE 2012 @ Cartagena, Colombia Summary • Showed algorithm for SLP LZ78 factorization • at least as fast as naïve decompress & process • better when string is compressible N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α(cN) ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ)