190 likes | 312 Views
Succinct Representations of Dynamic Strings. Meng He and J. Ian Munro University of Waterloo. Background: Succinct Data Structures. What are succinct data structures ( Jacobson 1989 ) Representing data structures using ideally information-theoretic minimum space
E N D
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo
Background: Succinct Data Structures • What are succinct data structures (Jacobson 1989) • Representing data structures using ideally information-theoretic minimum space • Supporting efficient navigational operations • Why succinct data structures • Large data sets in modern applications: textual, genomic, spatial or geometric
Strings: Definitions • Notation • Alphabet: [σ]={1, 2, …, σ} • String: S[1..n] • Operations: • access(i): S[i] • rank(α, i): number of occurrences of α in S[1..i] • select(α, i): position of the ithoccurrence of α in S
Strings: An Example S = a a b a c cc d a d d a b bb c string_access(8) = d string_rank(a, 8) = 3 string_select(b, 3) = 14
Succinct Representations of Strings • Information-theoretic minimum: nlgσbits • Succinct representation (Grossi et al. 2003) • Space: nH0+o(n)∙lgσ bits • Time: O(lgσ) • There are many more results. • The case in which σ = 2 (bit vector) is even more fundamental! • Jacobson 1989
Applications of Strings and Bit Vectors • Ordinal trees on n nodes • Standard approach: 3nlgn bits • Succinct data structures: 2n + o(n) bits (Jacobson 1989, Munro & Raman 1997, Benoit et al. 1999…) • Full text indexes for text string from [σ]n • Suffix trees can use as much as 4nlgn to6nlgn bits! • Succinct data structures: nlgσ+o(nlgσ) bits (Grossi et al. 2003, González and Navarro 2009…) • Labeled trees, planar graphs, binary relations, permutations, functions, …
Our Problem: Dynamic Strings • Motivation: In many applications, data are also updated frequently • For strings, we also consider the following update operations: • insert(α, i), which inserts character α between S[i-1] and S[i] • delete(i), which deletes S[i] from S
Comparisons lgσ lgσ lgσ lgσ lgσ lgσ O(lgn ( ──── + 1)) O(──── ( ──── + 1)) O(lgn ( ──── + 1)) O(──── ( ──── + 1)) O(lgn ( ──── + 1)) O(lgn ( ──── + 1)) lglg n lglg n lglg n lglg n lglg n lglg n amortized lgn lgn lglg n lglg n For the special cases in which σ = polylog (n) or 2 (bit vector!), our results also improve previous results
Searchable Partial Sums • Data • A sequence Q of n nonnegative integers • Operations • sum(i): Q[1] + Q[2] + … + Q[i] • search(x): the smallest isuch that sum(i) ≥ x • update(i, δ): Q[i] ← Q[i] + δ • Raman et al. 2001 • Assumptions: |Q| = O(lgε n), |δ| ≤ lg n • Space: O(lg1+ε n) bits, with a universal table of size O(nε’) bits • Operations: O(1) time
Collections of Searchable Partial Sums • Data • d sequences of k-bit nonnegative integers of length n each • Operations • sum, search, update: supported on each sequence • insert, delete: operated simultaneously on the same positions of all the sequences, but only 0’s can be inserted or deleted • González and Navarro 2009 (CSPSI) 8 2 9 5 11 9 0 7 3 6 1 5 3 12 4 0 0 0 5 12 0 3 1 19 0 4 2 8 3 5 4 1 0 sum(2, 5) = 25 insert(6) delete(6)
Our results on CSPSI • Assumptions • d = O(lgηn) • |δ| ≤ lg n • Space • O(kdn + w) bits, where w is the word size • Buffer: O(nlgn) bits • Time • All operations: lg n O ( ──── ) lglg n
Data Structures for Dynamic Strings Over a Small Alphabet of size O(lg1/2 n) • Main data structure: a B-tree constructed over S • Leaf • Each leaf stores a superblock of at most 2L bits which encodes a substring of S (L = ) • The numbers of occurrences of each character in all the superblocks form an integer sequence • Maintain the above sequences for all the characters in the alphabet in a CSPSI structure E • Internal node v (lg1/2 n ≤ degree(v) ≤ 2lg1/2 n) • U(v): U(v)[i] = number of leaves of the subtree rooted at the i-th child of v • I(v): I(v)[i] = number of characters stored in the subtree rooted at the i-th child of v lg2n ──── lglgn
Supporting Queries • rank(α, i) • Perform a top-down traversal with the help of I(v)’s • Locate the superblock, j, containing S[i] with the help of U(v)’s • Perform sum(α, j) operation on E to count the number of occurrences of αin superblocks 1, 2, … j-1 • Read superblock j in blocks of size (lg n) / 2 bits • The support for access and select is similar v … …
Insert, delete and deamortization • Supporting insert and delete requires traversing and updating the B-tree and updating E • It is however much more complicated • Merging and splitting B-tree nodes • Deamortization
Succinct Global Rebuilding • A key technique for deamortizing operations on B-trees is global rebuilding (Overmars and van Leeuwen 1981) • Global rebuilding • Rebuild the B-tree after the number of update operations performed exceeds half the initial length of the string • A new copy and an old copy of the B-tree: more space • A buffer of O(nlgn) bits is required • Succinct global rebuilding • Only one copy of the data: no duplication • During rebuilding, queries and updates are performed on either the new part or the old part • No buffer required
Putting Everything Together • Dynamic strings over an alphabet of size O(lg1/2 n) • Space: nH0+o(n)∙lgσ bits • Time: • This can be extended to general alphabets using wavelet trees • Space: nH0+o(n)∙lgσ bits • Time: • When σ = polylog (n) or 2 (bit vectors) • Space: nH0+o(n)∙lgσ bits • Time: lgσ lg n lg n O ( ──── ) O(──── ( ──── + 1)) O ( ──── ) lglg n lglg n lglg n lgn lglg n
Applications • Dynamic text collections • Data: a collection of text strings • Operations • Pattern search • Display a substring • Insert/delete a text string • Compressed construction of full-text indexes • Working space: nHk+o(n)∙lgσ bits • Time: lgσ O(──── ( ──── + 1)) lglg n nlgn lglg n
Conclusions • We designed a succinct representation of dynamic strings that provide more efficient operations than previous results • This structure can be directly applied to improve previous results on text indexing • We expect our results to play an important role in the design of dynamic succinct data structures • We expect succinct global rebuilding to be useful for the deamotization of algorithms on dynamic succinct data structures