1 / 25

Efficient LZ78 factorization of grammar compressed text

Efficient LZ78 factorization of grammar compressed text. Hideo Bannai , Shunsuke Inenaga , Masayuki Takeda Kyushu University, Japan. Outline. Background LZ78 Factorization Straight Line Programs (SLP) Algorithms LZ78 factorization using suffix trees SLP to LZ78 Improvements.

adonica
Download Presentation

Efficient LZ78 factorization of grammar compressed text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SPIRE 2012 @ Cartagena, Colombia Efficient LZ78 factorization of grammar compressed text Hideo Bannai, ShunsukeInenaga, Masayuki Takeda Kyushu University, Japan

  2. SPIRE 2012 @ Cartagena, Colombia Outline • Background • LZ78 Factorization • Straight Line Programs (SLP) • Algorithms • LZ78 factorization using suffix trees • SLP to LZ78 • Improvements

  3. SPIRE 2012 @ Cartagena, Colombia Background Compressed String Processing (CSP) • compress string for storage … but …don’t decompress all of itwhen using it! • can be faster than processing the uncompressed text,by exploiting regularities identified by compression • regard compression as a generic preprocessing! BIG String Pattern Matching Compressed Representation of String process directly Edit Distance Pattern Mining etc. This work: LZ78 factorizationof grammar compressed strings

  4. SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization [Ziv&Lempel’78] The LZ78-factorization of string S is a factorization S = f1f2 ... fm where fi is the longest prefix of fi ... fmsuch that fi= fjc for some 0 ≤ j < i(let f0 =ε) • S = a l a b a r a l a l a b a r d a $ (0,a) (0,l) (1,b) (1,r) (1,l) (5,a) (0,b) (5,d) • (1,$) f1 f2 f3 f4 f5 f6 f7 f8 f9 a 0 l b $ b O(N log σ) time O(m) space 7 1 2 r l 3 5 4 9 a d LZ78 trie of S 6 8

  5. SPIRE 2012 @ Cartagena, Colombia Straight Line Programs Straight Line Program • CFG in Chomsky normal form that derives single string. • Can efficiently model outputs of many compression algorithms: REPAIR, SEQUITUR, LZ78, etc. X1 = a X2 = b X3 = X1 X2 X4 = X1 X3 X5 = X4 X3 X6 = X4 X5 X7 = X6 X5 SLP, n=7 Derivation tree X7 X6 X5 X4 X5 X4 X3 X1 X3 X4 X3 X1 X3 X1 X2 X1 X2 X1 X3 X1 X2 X1 X2 X1 X2 a a b a a b a b a a b a b S

  6. SPIRE 2012 @ Cartagena, Colombia Problem: SLP to LZ78 X1 = a X5= X4 X3 X2= b X6= X4 X5 X3 = X1 X2 X7= X6 X5 X4 = X1 X3 Input: SLP Output: LZ78 Factorization (Trie) b a 0 a b 6 1 • Why “re-compress” a compressed representation? • Convert the representation Some CSP algorithms require specific compression • Re-compress an SLP modified by ad-hoc edits Dynamic compressed texts • Compute Normalized Compression Distance [Li et al. 2004] Clustering & classification w/o decompressionCLZ78 (x), CLZ78 (y), CLZ78(xy) from SLPs of x, y 5 a 2 b 3 4 Computer Scientist Make Sleeping Files Walk in their Sleep!

  7. SPIRE 2012 @ Cartagena, Colombia Our Results Algorithms to compute LZ78 from SLP N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ) • α ≥ 0 is a quantity that representsthe amount of redundancy in the string that is captured by the SLP

  8. SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization using a Suffix Tree

  9. SPIRE 2012 @ Cartagena, Colombia Suffix Tree & LZ78 The LZ78 trie can be superimposed on the suffix tree 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b S a b a a b b 0 0 a 13 b a b b a a 4 4 1 1 b a 12 a a b 3 3 a a 2 2 b b b a a 11 b b a 5 5 b b b a 9 10 a a a 8 6 6 a b a b a b a a a a 7 a a b b a b a b a a b a b b b a a b b b 1 4 2 5 3 6 • LZ78 trie of S • suffix tree of S

  10. SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization on Suffix Tree 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b S i • Build LZ78 trie on top of suffix tree ST Nodes corresponding to LZ78 trie are marked 0 a b • Next factor is prefix of S[i:N].Find node in ST corresponding to S[i:N] 5 a 13 1 b a 4 b 2 a 12 a • Find longest prefix of S[i:N] in LZ78 trie O(1) time bydynamicnearest marked ancestor queries [Westbrook, ‘92] a b b b b a a 11 3 a b b b a 9 10 a a Make new node of LZ78 trie on ST O(1) time by level ancestor query on ST [Berkman & Vishkin ‘94] a 6 8 a b a b a b a a a a 7 a a b b a b a b a a b a b • Compute next position i  i + |fi| b b a a b b b • LZ78 factorization in O(m) time,given suffix tree preprocessed for nma& la queries 1 4 2 5 3 6

  11. SPIRE 2012 @ Cartagena, Colombia SLP to LZ78

  12. SPIRE 2012 @ Cartagena, Colombia Our algorithm: SLP to LZ78 Key Observation For any string of length N, the length of any LZ78 factor fi satisfies: |fi| ≤ cN= (2N+¼)½ – ½ = O(N½) Main Idea • We only need a suffix tree that contains all distinct substrings of S with length at most cN •  Build GST from a set of substrings of S that contain all distinct length-cNsubstrings of S

  13. SPIRE 2012 @ Cartagena, Colombia Important Concept: Stabbing Xistabsan interval [u:v] of S,when it is the shortest variable that derives the interval(any interval is stabbed by a unique variable) e.g.:aaba at [9:12] is stabbed by X5 X7 X1 = a X2 = b X3 = X1 X2 X4 = X1 X3 X5 = X4 X3 X6 = X4 X5 X7 = X6 X5 X6 X5 X4 X5 X4 X3 X1 X3 X4 X3 X1 X3 X1 X2 X1 X2 X1 X3 X1 X2 X1 X2 X1 X2 a a b a a b a b a a b a b 1 2 3 4 5 6 7 8 9 10 11 12 13

  14. SPIRE 2012 @ Cartagena, Colombia Substrings stabbed by Xi All length-qsubstrings stabbed by Xi are contained in a stringti(q) of length at most 2(q – 1) Xi Xr(i) Xl(i) q Any length-qsubstring of Sis stabbed by some unique variable Xi,and therefore is a substring of some ti(q) q • ti(q) q– 1 q– 1 • { ti(cN) : |Xi| ≥ cN, 1 ≤ i≤ n }will containall distinctlength-cN substrings of S

  15. SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization from SLP Algorithm: • Compute { ti(cN) : |Xi| ≥ cN, 1 ≤ i≤ n } • Build generalized suffix tree (GST)for strings{ ti(cN) : |Xi| ≥ cN, 1 ≤ i≤ n } • Run LZ78 Factorization algorithm using GST O(ncN) time/space

  16. SPIRE 2012 @ Cartagena, Colombia Example • N = 13, cN = 4, n = 7 • { t5(4), t6(4), t7(4) } = { aabab, aabaab, babaab } X7 X6 X5 X4 X5 X4 X3 X1 X3 X4 X3 X1 X3 X1 X2 X1 X2 X1 X3 X1 X2 X1 X2 S X1 X2 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b

  17. SPIRE 2012 @ Cartagena, Colombia GST & LZ78 Factors The LZ78 triesuperimposed on GST of {t5(4), t6(4), t7(4)} 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b S a b a a b b 0 0 a 5,11,17 b a a a 1 1 b b 4 4 b 3 3 4,10,16 a 9,15 a a 2 2 a a b 3 a a a a b a b a a b a a b b a b a a b a 5 5 b b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 b b b b b b t5(4) t6(4) t7(4) 6 1 7,13 2 12 8,14 3 6 6 • LZ78 trie of S • GST of {t5(4),t6(4),t7(4)}

  18. SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization on GST 0 a b a 5,11,17 1 b a b X7 4,10,16 a 9,15 a a X6 X5 b 3 i a a a a b a b a a b a a b b a b a a b a b b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 b b b b X4 X5 X4 X3 • Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N] t5(4) t6(4) t7(4) 6 1 7,13 2 12 8,14 3 X1 X3 X4 X3 X1 X3 X1 X2 • Find longest prefix of S[i:N] in LZ78 trie • O(log N) time w/ random accesson SLP [Bille et al. 2011] cN= 4 X1 X2 X1 X3 X1 X2 X1 X2 • Make new node for LZ78 trie on ST • O(1) time w/ dynamic nmaqueries S • Compute next position i  i + |fi| X1 X2 • O(1) time w/ dynamic nmaqueries 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b

  19. SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization on GST 0 a b a 5,11,17 1 b a b X7 4,10,16 2 a 9,15 a a X6 X5 b 3 i a a a a b a b a a b a a b b a b a a b a b b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 b b b b X4 X5 X4 X3 • Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N] t5(4) t6(4) t7(4) 6 1 7,13 2 12 8,14 3 X1 X3 X4 X3 X1 X3 X1 X2 • Find longest prefix of S[i:N] in LZ78 trie • O(log N) time w/ random accesson SLP [Bille et al. 2011] cN= 4 X1 X2 X1 X3 X1 X2 X1 X2 • Make new node for LZ78 trie on ST • O(1) time w/ dynamic nmaqueries S • Compute next position i  i + |fi| X1 X2 • O(1) time w/ dynamic nmaqueries 1 2 3 4 5 6 7 8 9 10 11 12 13 a a b a a b a b a a b a b

  20. SPIRE 2012 @ Cartagena, Colombia LZ78 Factorization on GST 0 a b a 5,11,17 1 b a b X7 4,10,16 2 a 9,15 a a 3 X6 X5 b 3 i a a a a b a b a a b a a b b a b a a b a b b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 b b b b X4 X5 X4 X3 • Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N] t5(4) t6(4) t7(4) 6 1 7,13 2 12 8,14 3 X1 X3 X4 X3 X1 X3 X1 X2 • Find longest prefix of S[i:N] in LZ78 trie • O(log N) time w/ random accesson SLP [Bille et al. 2011] cN= 4 X1 X2 X1 X3 X1 X2 X1 X2 • Make new node for LZ78 trie on ST • O(1) time w/ dynamic nmaqueries S • Compute next position i  i + |fi| X1 X2 • O(1) time w/ dynamic nmaqueries 1 2 3 4 5 6 7 8 9 10 11 12 13 • LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma& la, and SLP preprocessed for random access queries a a b a a b a b a a b a b

  21. SPIRE 2012 @ Cartagena, Colombia Summary of Basic Algorithm Extreme Cases: • If the string is compressible, n = O(log N), m = O(N½), soO(ncN + m log N) = O(N½ log N) = o(N) • If the string is not compressible, n, m= O(N) and O(ncN + m log N) = O(N1.5) cN= O(N½) can we do better than just revert to decompress & process?

  22. SPIRE 2012 @ Cartagena, Colombia (1) Improving ncNterm to nL≤ ncN Let Ldenote length of longest LZ78 factor of S • We built GST for distinct substrings of length at most cNbut actually, we only need substrings of length at most L • However, L is not known beforehand… • Doubling Technique: • Assume L = 2 and run algorithm. • If LZ78 trieexpands beyond GST, L 2×L, rebuild GST and LZ78 trie,and continue • Total time complexity for rebuild: Σi=1..log LO(n2i+m)= O(nL+mlogL) • O(ncN+ mlogN) time, O(ncN + m) space •  O(nL+ mlogN) time, O(nL + m) space

  23. SPIRE 2012 @ Cartagena, Colombia (2) Improving ncNterm to Nα≤ N Lemma [Goto et al. CPM 2012] We can replace GST with suffix tree of trie for q = cN Given SLP for string S, the set of length-q substrings of S can be represented as paths in a reverse trie of sizeNα = N – α(q)≤ N, whereα(q)= Σi:|Xi| ≥q (vOcc(Xi) – 1)(|ti(q)| – (q – 1)) ≥ 0vOcc(Xi) : # of times Xi occurs in derivation tree The trie can be computed in time linear of its size. Lemma [Shibuya 2003] The suffix tree of a reverse triecan be constructed in linear time. • O(ncN+ mlogN) time, O(ncN + m) space •  O(Nα + mlogN) time, O(Nα + m) space Nα = O(ncN)

  24. SPIRE 2012 @ Cartagena, Colombia Example: Trie of size Nαfor q = 4 X7 X6 X5 X4 X5 X4 X3 X1 X3 X4 X3 X1 X3 X1 X2 X1 X2 X1 X3 X1 X2 X1 X2 X1 X2 a a b a a b a b a a b a b S a a b Σ|ti(q)| : 17 Text size: 13 Trie size: 11 a a b a b b a b We can aggregate all ti(q) into a trie of size at most the text size

  25. SPIRE 2012 @ Cartagena, Colombia Summary • Showed algorithm for SLP  LZ78 factorization • at least as fast as naïve decompress & process • better when string is compressible N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α(cN) ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ)

More Related