120 likes | 207 Views
Compressed Index for a Dynamic Collection of Texts. H.W. Chan, W.K. Hon , T.W. Lam The University of Hong Kong. Problem Definition. Given L = { T 1 , T 2 , … , T k } of total length n over an alphabet Σ
E N D
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong
Problem Definition • Given L ={ T1, T2, …, Tk} of total length n over an alphabetΣ • We want to create an index for L such that on given any pattern P, the occurrences of P in each of the Ti can be found quickly • Also, the index should support fast insertion/ deletion of Tiinto/from L
Two Basic Tools: CSA, FM-index • Definition 1: The main component of CSA for a text T is a function Ψ such that Ψ[i] = SA-1[SA[i] + 1] where SA[i] is the i-th entry in the suffix array, and SA-1is the inverse of SA
Two Basic Tools: CSA, FM-index • Definition 2: The FM-index of T is based on Burrows-Wheeler array of T, which is an array of characters, denoted by BWT, such that BWT[i] = T[SA[i]-1]. The main component of FM-index is |Σ| functions countcfor everyc Σsuch that countc[i] = # of c inBWT[1…i]
Our Index • Our index is a dynamic version of CSA + FM-index for the concatenated text T1T2…Tk • We exploit the property of Ψ and count that, both of them are essentially a couple of sequence of increasing values.
Our Index • To maintain a dynamicCSA and FM-index to maintain a dynamic sequence of increasing values • Observation 3:Balanced search tree is good for dynamic sequence • Observation 4:Difference encoding for increasing values can save space
Our Index • Combining Observations 3 and 4 Differential Balanced Search Tree to handle the values in the dynamic CSA andFM-index • Drawbacks: computation of Ψ and count is slowed down by O(log n) factor • Pattern matching: O(|P| log n + occ log2 n) time
Insertion & Deletion (sketch idea) • Insertion corresponds to finding update points in the increasing sequences of Ψ and count • To insert a text T intoL, there are O(|T|) such update points • Update points can be found by simulating a pattern matching query of T against L • Total time:O(|T| log n)
Insertion & Deletion (sketch idea) • Deletion reverses the insertion process • Update points can be found by queryingΨiteratively, instead of simulating a pattern matching query • Total time: O(|T| log n)
Conclusion, Progress & Future Work • In the literature, there is a dualproblem called Dictionary Management, which maintains a collection of patterns, such that when a text Tis given later, all occurrencesof each pattern in T is reported in one query. Also, fast insertion/deletion of pattern is required • O(n) bits: some progress …
Conclusion, Progress & Future Work • There is another problem called Dynamic Text, which maintains a single text T, and when a pattern P is given later, it supports finding all occurrences of P in T. The text T is subject to insertion/deletion of substrings. • O(n log n) bits: Sahinalp & Vishkin, FOCS’96 • O(n) bits: ??