1 / 37

Reducing the Space Requirement of LZ-index

Reducing the Space Requirement of LZ-index. Diego Arroyuelo 1 , Gonzalo Navarro 1 , and Kunihiko Sadakane 2 1 Dept. of Computer Science, Univ. Of Chile 2 Dept. of Computer Science and Comunnication Engineering, Kyushu Univ. Barcelona – July 7, 2006. Outline. Introduction

Download Presentation

Reducing the Space Requirement of LZ-index

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Reducing the Space Requirement of LZ-index Diego Arroyuelo1, Gonzalo Navarro1, and Kunihiko Sadakane2 1Dept. of Computer Science, Univ. Of Chile 2Dept. of Computer Science and Comunnication Engineering, Kyushu Univ. Barcelona – July 7, 2006

  2. Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions

  3. The k-th order empirical entropy of T Hk(T) ≤ Hk-1(T) ≤ … ≤ H0(T) ≤ logs Problem definition • The full-text search problem: to find the occ occurrences of a pattern P[1…m] in a text T[1…u] • To provide fast access to T requiring little space we use compressed full-text self-indexes: • replaceT and in addition give indexed access to it, and • take space proportional to the compressed text size(O(uHk(T)) bits) • Main motivation: to store the indexes of very large texts entirely in main memory

  4. Our Results LZ-index [Navarro, 2004] Our results Space: 4uHk(T)+o(ulogs) bits, k = o(logsu) Reporting: O(m3logs + (m+occ)logu) Displaying: O(llogs) (2+e)uHk(T)+o(ulogs) bits for any constant 0 < e < 1 O(m2log m + (m+occ)logu) O(l / logsu) (optimal) • The main drawback of LZ-index is the factor 4 in the space complexity But also (1+e)uHk(T)+o(ulogs) bits O(m2) (average case), for m ≥2logsu • LZ-index is faster to report and to display (very important for a self-index!)

  5. Our results in context • Our data structures: • Size O(uHk(T)) bits • O(logu) time per occurrence reported, if s = Q(polylog(u)) • There are competing schemes requiring the same or better complexity for reporting • The case s = Q(polylog(u)) represents moderate-size alphabets and is very common in practice, but does not fit in competing schemes

  6. Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions

  7. The LZ-index (a review) LZTrie RevTrie LZ78 parsing of T Range Node We don’t need to store the text!

  8. Succinct representation of the data structures Assume n is the number of phrases in the LZ78 parsing of T • LZTrie: • par: the balanced parentheses representation of LZTrie (2n+o(n)bits) • lets: the symbols labelling the arcs of LZTrie (in preorder) (nlogs bits) • ids: the phrase identifiers in preorder (nlogn bits) • RevTrie: • rpar: the balanced parentheses representation of RevTrie (4n+o(n)bits) • rids: the phrase identifiers in preorder (nlogn bits) • Node: an array requiring nlog(2n) = nlogn+ n bits • Range: implemented using [Chazelle, 1988], requiring nlogn(1+o(1)) bits

  9. Succinct representation of the data structures • We have fournlogn-bit terms • As nlogn = uHk(T)+o(ulogs), for k = o(logsu), • the LZ-index requires 4nlogn(1+o(1)) = 4uHk(T) + o(ulogs) bits, for k = o(logsu) • To reduce the space requirement we must reduce the number of nlogn-bit terms in the index

  10. Search Algorithm • Occurrences of Type 1 • Occurrences of Type 2 • Occurrences of Type 3 • Reporting time: O(m3logs + (m+occ)logn) Bk-1 Bk … Bl Bl+1

  11. Shortest possible LZ78 phrases containing P LZTrie P P P Solving Occurrences of Type 1 By LZ78, P is a suffix of such phrases Subtrees containing ocurrences of type 1

  12. LZTrie RevTrie Pr P P P Solving Occurrences of Type 1 • As P is a suffix of such phrases, Pr is a prefix of the corresponding reverse phrases • We need the Reverse Trie (RevTrie) to solve this problem

  13. P2 P1 RevTrie LZTrie Pr1 P2 x y x’ y’ Solving Occurrences of Type 2 • Search for [x,y][x’,y’] in Range • For every pair (k, k+1) found, report k

  14. Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions

  15. P2 P1 RevTrie LZTrie Pr1 P2 Node RNode LZ-index as a Navigation Scheme • In practice Range is replaced by RNode (phrase id  RevTrie node) • Occurrences of type 2: • We have no worst-case guarantees at search time • Average time for type 2 occs: O(n/sm/2) (O(1) for m ≥2logsn)

  16. Original Navigation Scheme • When we replace Range by RNode, we get a “navigation” scheme We study how to reduce the redundancy in the LZ-index But the scheme is redundant…

  17. Alternative Navigation Scheme Inverse permutations represented with Munro et al. Space requirement: (2+e)uHk + o(ulogs) bits Search algorithm remains the same… O(m2) (average case), for m ≥2logsn

  18. Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions

  19. Suffix Links in RevTrie Can we reduce the space requirement of LZ-index to (1+e)uHk+o(ulogs) bits? Can we reduce the space requirement while retaining worst-case guarantees in the search process? We are going to compress the R mapping

  20. LZTrie RevTrie a xr x x j(i) a i R[i] Suffix Links in RevTrie • Definition 1: We define function j as a suffix link in RevTrie j(i) = R-1(parentLZ(R[i])) if we follow a suffix link in RevTrie, we are “going to the parent” in LZTrie

  21. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 j 0 2 0 9 14 16 2 3 14 0 2 14 5 17 2 6 0 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 L $ a a a a b b d l l l p p r r _ _ Suffix Links in RevTrie R[11] =?? 3 1 2

  22. Suffix Links in RevTrie • We can compute R using j • But, what is the difference in space requirement? (both R and j require, in principle, nlogn bits) • We can prove the following lemma for function j

  23. Suffix Links in RevTrie • We replace the nlogn-bit representation of R by a representation of j requiring nH0(lets) + O(nloglogs) + O(slogs) + n + o(n) • To compute R in O(1/e) time we store en values of R, requiring enlogn extra bits • R-1 can be computed in O(1/e2) time

  24. Suffix Links in RevTrie Yes, we can reduce the space requirement of LZ-index to (1+e)uHk+o(ulogs) bits

  25. Suffix Links in RevTrie • We can add Range to get worst case guarantees in the search process, requiring nlogn extra bits Yes, we can reduce the space requirement of LZ-index to (2+e)uHk+o(ulogs) bits, retaining worst case guarantees at search time

  26. Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions

  27. Subpath search with string P P P P xbw LZ-index • The xbw transform[Ferragina et al., 2005] is a succinct tree representation requiring 2nlogs+O(n) bits and allowing operations: • parent (O(1) time) • child(x, i) (O(1) time) • child(x, a) (O(1) time) • Subpath queries (O(m) time) • As we can perform prefix and suffix searching, we can do the work of both LZTrie and RevTrie only with xbw!

  28. Balanced Parentheses LZTrie (()()())()(()())(()) ids xbw LZTrie Range Slast Sa i Pos+ Pos-1 preorder positions i In principle: (3+e)uHk(T)+ o(ulogs) bits xbw positions xbw LZ-index

  29. Balanced Parentheses LZTrie (()()())()(()())(()) ids Pos[i] xbw LZTrie Slast Sa Pos’ i xbw LZ-index (2+e)uHk(T)+ o(ulogs) bits j We store one out of O(1/e) values of Pos

  30. xbw LZ-index We have achieved Theorem 1 and 2 with radically different means!! • Occurrences of Type 1: using the xbw (subpath search with Pr), and then mapping to the parentheses LZTrie • Occurrences of Type 2: subpath search for Pr1 and search (using child from the root) for P2. • Then use the corresponding xbw and preorder ranges to search in Range • Ocurrences of Type 3:mostly as with the original LZ-index • Occurrences of Type 2 can be solved as Occurrences of Type 3 (we don’t need Range!)

  31. Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions

  32. Displaying text substrings • The approach of [Sadakane and Grossi, 2006] to display any text substring of length Q(logsu) in constant time can be adapted to our indexes

  33. Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions

  34. Conclusions • We have studied the reduction of the space requirement of LZ-index • Two different approaches • In either case we achieve (2+e)uHk(T) + o(ulogs) to index T[1…u], k = o(logsu) • The search time is improved to O(m2logm + (m+occ)logn) (worst case) Navigational scheme xbw + bp LZTrie

  35. Conclusions • We also define indexes requiring (1+e)uHk(T) + o(ulogs) to index T[1…u], k = o(logsu) • O(m2) average-case time if m ≥ 2logsn • The time to display a context of length l around any text position is also improved to the optimal O(l/logsu) • We also remove some restrictions of the original LZ-index (see the paper)

  36. Questions? Contact darroyue@dcc.uchile.cl

  37. Thanks! Contact darroyue@dcc.uchile.cl

More Related