Compressed Index for a Dynamic Collection of Texts

Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong

Problem Definition • Given L ={ T1, T2, …, Tk} of total length n over an alphabetΣ • We want to create an index for L such that on given any pattern P, the occurrences of P in each of the Ti can be found quickly • Also, the index should support fast insertion/ deletion of Tiinto/from L

Previous Work & Our Result

Two Basic Tools: CSA, FM-index • Definition 1: The main component of CSA for a text T is a function Ψ such that Ψ[i] = SA-1[SA[i] + 1] where SA[i] is the i-th entry in the suffix array, and SA-1is the inverse of SA

Two Basic Tools: CSA, FM-index • Definition 2: The FM-index of T is based on Burrows-Wheeler array of T, which is an array of characters, denoted by BWT, such that BWT[i] = T[SA[i]-1]. The main component of FM-index is |Σ| functions countcfor everyc Σsuch that countc[i] = # of c inBWT[1…i]

Our Index • Our index is a dynamic version of CSA + FM-index for the concatenated text T1T2…Tk • We exploit the property of Ψ and count that, both of them are essentially a couple of sequence of increasing values.

Our Index • To maintain a dynamicCSA and FM-index to maintain a dynamic sequence of increasing values • Observation 3:Balanced search tree is good for dynamic sequence • Observation 4:Difference encoding for increasing values can save space

Our Index • Combining Observations 3 and 4  Differential Balanced Search Tree to handle the values in the dynamic CSA andFM-index • Drawbacks: computation of Ψ and count is slowed down by O(log n) factor • Pattern matching: O(|P| log n + occ log2 n) time

Insertion & Deletion (sketch idea) • Insertion corresponds to finding update points in the increasing sequences of Ψ and count • To insert a text T intoL, there are O(|T|) such update points • Update points can be found by simulating a pattern matching query of T against L • Total time:O(|T| log n)

Insertion & Deletion (sketch idea) • Deletion reverses the insertion process • Update points can be found by queryingΨiteratively, instead of simulating a pattern matching query • Total time: O(|T| log n)

Conclusion, Progress & Future Work • In the literature, there is a dualproblem called Dictionary Management, which maintains a collection of patterns, such that when a text Tis given later, all occurrencesof each pattern in T is reported in one query. Also, fast insertion/deletion of pattern is required • O(n) bits: some progress …

Conclusion, Progress & Future Work • There is another problem called Dynamic Text, which maintains a single text T, and when a pattern P is given later, it supports finding all occurrences of P in T. The text T is subject to insertion/deletion of substrings. • O(n log n) bits: Sahinalp & Vishkin, FOCS’96 • O(n) bits: ??

Compressed Index for a Dynamic Collection of Texts

Compressed Index for a Dynamic Collection of Texts

Presentation Transcript

Dynamic Authenticated Index Structures for Outsourced Databases

Dynamic Hedge Ratio for Stock Index Futures: Application of Threshold VECM

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman

The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

Finding Characteristic Substrings from Compressed Texts

Pattern Matching on Compressed Texts II

Dynamic Index Coding

R-Trees A Dynamic Index Structure for Spatial Searching

R-Tree: Spatial Representation on a Dynamic-Index Structure

Texts for Analyses

Compressed Index for Dictionary Matching

ViST: a dynamic index method for querying XML data by tree structures

NMNH Collection Level Index

Dynamic Hedge Ratio for Stock Index Futures: Application of Threshold VECM

R-Trees: A Dynamic Index Structure for Spatial Data

The SBC-Tree: An Index for Run-Length Compressed Sequences

R-Tree: Spatial Representation on a Dynamic-Index Structure

Styyo listed dynamic collection of earrings for women

compressed GETECNA compressed

A Compact Survey of Compressed Sensing

Compressed sensing in data collection