370 likes | 444 Views
Reducing the Space Requirement of LZ-index. Diego Arroyuelo 1 , Gonzalo Navarro 1 , and Kunihiko Sadakane 2 1 Dept. of Computer Science, Univ. Of Chile 2 Dept. of Computer Science and Comunnication Engineering, Kyushu Univ. Barcelona – July 7, 2006. Outline. Introduction
E N D
Reducing the Space Requirement of LZ-index Diego Arroyuelo1, Gonzalo Navarro1, and Kunihiko Sadakane2 1Dept. of Computer Science, Univ. Of Chile 2Dept. of Computer Science and Comunnication Engineering, Kyushu Univ. Barcelona – July 7, 2006
Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions
The k-th order empirical entropy of T Hk(T) ≤ Hk-1(T) ≤ … ≤ H0(T) ≤ logs Problem definition • The full-text search problem: to find the occ occurrences of a pattern P[1…m] in a text T[1…u] • To provide fast access to T requiring little space we use compressed full-text self-indexes: • replaceT and in addition give indexed access to it, and • take space proportional to the compressed text size(O(uHk(T)) bits) • Main motivation: to store the indexes of very large texts entirely in main memory
Our Results LZ-index [Navarro, 2004] Our results Space: 4uHk(T)+o(ulogs) bits, k = o(logsu) Reporting: O(m3logs + (m+occ)logu) Displaying: O(llogs) (2+e)uHk(T)+o(ulogs) bits for any constant 0 < e < 1 O(m2log m + (m+occ)logu) O(l / logsu) (optimal) • The main drawback of LZ-index is the factor 4 in the space complexity But also (1+e)uHk(T)+o(ulogs) bits O(m2) (average case), for m ≥2logsu • LZ-index is faster to report and to display (very important for a self-index!)
Our results in context • Our data structures: • Size O(uHk(T)) bits • O(logu) time per occurrence reported, if s = Q(polylog(u)) • There are competing schemes requiring the same or better complexity for reporting • The case s = Q(polylog(u)) represents moderate-size alphabets and is very common in practice, but does not fit in competing schemes
Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions
The LZ-index (a review) LZTrie RevTrie LZ78 parsing of T Range Node We don’t need to store the text!
Succinct representation of the data structures Assume n is the number of phrases in the LZ78 parsing of T • LZTrie: • par: the balanced parentheses representation of LZTrie (2n+o(n)bits) • lets: the symbols labelling the arcs of LZTrie (in preorder) (nlogs bits) • ids: the phrase identifiers in preorder (nlogn bits) • RevTrie: • rpar: the balanced parentheses representation of RevTrie (4n+o(n)bits) • rids: the phrase identifiers in preorder (nlogn bits) • Node: an array requiring nlog(2n) = nlogn+ n bits • Range: implemented using [Chazelle, 1988], requiring nlogn(1+o(1)) bits
Succinct representation of the data structures • We have fournlogn-bit terms • As nlogn = uHk(T)+o(ulogs), for k = o(logsu), • the LZ-index requires 4nlogn(1+o(1)) = 4uHk(T) + o(ulogs) bits, for k = o(logsu) • To reduce the space requirement we must reduce the number of nlogn-bit terms in the index
Search Algorithm • Occurrences of Type 1 • Occurrences of Type 2 • Occurrences of Type 3 • Reporting time: O(m3logs + (m+occ)logn) Bk-1 Bk … Bl Bl+1
Shortest possible LZ78 phrases containing P LZTrie P P P Solving Occurrences of Type 1 By LZ78, P is a suffix of such phrases Subtrees containing ocurrences of type 1
LZTrie RevTrie Pr P P P Solving Occurrences of Type 1 • As P is a suffix of such phrases, Pr is a prefix of the corresponding reverse phrases • We need the Reverse Trie (RevTrie) to solve this problem
P2 P1 RevTrie LZTrie Pr1 P2 x y x’ y’ Solving Occurrences of Type 2 • Search for [x,y][x’,y’] in Range • For every pair (k, k+1) found, report k
Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions
P2 P1 RevTrie LZTrie Pr1 P2 Node RNode LZ-index as a Navigation Scheme • In practice Range is replaced by RNode (phrase id RevTrie node) • Occurrences of type 2: • We have no worst-case guarantees at search time • Average time for type 2 occs: O(n/sm/2) (O(1) for m ≥2logsn)
Original Navigation Scheme • When we replace Range by RNode, we get a “navigation” scheme We study how to reduce the redundancy in the LZ-index But the scheme is redundant…
Alternative Navigation Scheme Inverse permutations represented with Munro et al. Space requirement: (2+e)uHk + o(ulogs) bits Search algorithm remains the same… O(m2) (average case), for m ≥2logsn
Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions
Suffix Links in RevTrie Can we reduce the space requirement of LZ-index to (1+e)uHk+o(ulogs) bits? Can we reduce the space requirement while retaining worst-case guarantees in the search process? We are going to compress the R mapping
LZTrie RevTrie a xr x x j(i) a i R[i] Suffix Links in RevTrie • Definition 1: We define function j as a suffix link in RevTrie j(i) = R-1(parentLZ(R[i])) if we follow a suffix link in RevTrie, we are “going to the parent” in LZTrie
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 j 0 2 0 9 14 16 2 3 14 0 2 14 5 17 2 6 0 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 L $ a a a a b b d l l l p p r r _ _ Suffix Links in RevTrie R[11] =?? 3 1 2
Suffix Links in RevTrie • We can compute R using j • But, what is the difference in space requirement? (both R and j require, in principle, nlogn bits) • We can prove the following lemma for function j
Suffix Links in RevTrie • We replace the nlogn-bit representation of R by a representation of j requiring nH0(lets) + O(nloglogs) + O(slogs) + n + o(n) • To compute R in O(1/e) time we store en values of R, requiring enlogn extra bits • R-1 can be computed in O(1/e2) time
Suffix Links in RevTrie Yes, we can reduce the space requirement of LZ-index to (1+e)uHk+o(ulogs) bits
Suffix Links in RevTrie • We can add Range to get worst case guarantees in the search process, requiring nlogn extra bits Yes, we can reduce the space requirement of LZ-index to (2+e)uHk+o(ulogs) bits, retaining worst case guarantees at search time
Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions
Subpath search with string P P P P xbw LZ-index • The xbw transform[Ferragina et al., 2005] is a succinct tree representation requiring 2nlogs+O(n) bits and allowing operations: • parent (O(1) time) • child(x, i) (O(1) time) • child(x, a) (O(1) time) • Subpath queries (O(m) time) • As we can perform prefix and suffix searching, we can do the work of both LZTrie and RevTrie only with xbw!
Balanced Parentheses LZTrie (()()())()(()())(()) ids xbw LZTrie Range Slast Sa i Pos+ Pos-1 preorder positions i In principle: (3+e)uHk(T)+ o(ulogs) bits xbw positions xbw LZ-index
Balanced Parentheses LZTrie (()()())()(()())(()) ids Pos[i] xbw LZTrie Slast Sa Pos’ i xbw LZ-index (2+e)uHk(T)+ o(ulogs) bits j We store one out of O(1/e) values of Pos
xbw LZ-index We have achieved Theorem 1 and 2 with radically different means!! • Occurrences of Type 1: using the xbw (subpath search with Pr), and then mapping to the parentheses LZTrie • Occurrences of Type 2: subpath search for Pr1 and search (using child from the root) for P2. • Then use the corresponding xbw and preorder ranges to search in Range • Ocurrences of Type 3:mostly as with the original LZ-index • Occurrences of Type 2 can be solved as Occurrences of Type 3 (we don’t need Range!)
Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions
Displaying text substrings • The approach of [Sadakane and Grossi, 2006] to display any text substring of length Q(logsu) in constant time can be adapted to our indexes
Outline • Introduction • The LZ-index (A Review) • LZ-index as a Navigation Scheme • Suffix-Links in the Reverse Trie • xbw LZ-index • Displaying Text Substrings • Conclusions
Conclusions • We have studied the reduction of the space requirement of LZ-index • Two different approaches • In either case we achieve (2+e)uHk(T) + o(ulogs) to index T[1…u], k = o(logsu) • The search time is improved to O(m2logm + (m+occ)logn) (worst case) Navigational scheme xbw + bp LZTrie
Conclusions • We also define indexes requiring (1+e)uHk(T) + o(ulogs) to index T[1…u], k = o(logsu) • O(m2) average-case time if m ≥ 2logsn • The time to display a context of length l around any text position is also improved to the optimal O(l/logsu) • We also remove some restrictions of the original LZ-index (see the paper)
Questions? Contact darroyue@dcc.uchile.cl
Thanks! Contact darroyue@dcc.uchile.cl