Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts

Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.

Contents • Introduction • Rank/select problem • Relations to compressed full-text indices • Dynamic rank-select structure • Extensions of the structure • For a large alphabet text • For a run-length encoded text

Rank-select problem • For a given text T over σ-size alphabet, our structures support: • rankT(c, i): gives the number of character c’s up to position i in T • selectT(c, k): gives the position of the k-th c • E.g. T=acabbc • rankT(‘a’, 5) = 2 • selectT(‘a’, 2) = 3

Rank-select problem • Our structures support additional update operations • insertT(c, i): inserts character c between T[i] and T[i+1] • deleteT(i): deletes T[i] from T • E.g. T=acabbc aababc • rankT(‘a’, 5) = 2  rankT(‘a’, 5) = 3 • selectT(‘a’, 2) = 3 selectT(‘a’, 2) = 2

Why rank-select problem? • In compressed full-text index • Rank-select structures are built on Burrows-Wheeler Transform (BWT) • Rank: backward search (Ferragina & Manzini) • Select: Psi-function in CSA (Grossi & Vitter) • Dynamic BWT • Index for a collection of texts (Chan, Hon & Lam) • Add or remove a text from the collection

T=mississippi$ Psi function Order of the suffix at next position E.g.. Psi[4] = 11, the order of ‘ssippi$’ Example of select on BWT

T=mississippi$ Psi function Order of the suffix at next position E.g. Psi[4] = 11, the order of ‘ssippi$’ Duality between Psi-function and BWT (Hon, Sadakane & Sung) BWT[i] = T[SA[i] – 1] Psi[i] = selectBWT(C[i], i – F[C[i]]) C[i]: T[SA[i]] F[c]: The number of x < c Example of select on BWT

Our results • Dynamic rank-select on texts over a small alphabet (σ < log n) • Improve the binary-alphabet version by Makinen & Navarro • O(log n) time and nlogσ + o(nlogσ) bits • Dynamic rank-select for a large alphabet (σ < n) • Use wavelet trees to extend our small-alphabet structure • O(log n logσ / loglog n) time and nlogσ + o(nlogσ) bits • Application to RLE texts

Static rank-select

Dynamic rank-select

Dynamic rank-select preliminary • We assume RAM model with: • Word size w = θ(log n) bits • +, -, *, / and bitwise operations in O(1) time • We process a word-size text of θ(log n/log) characters in O(1) time

Dynamic rank-select preliminary • Partition of text • Blocks of sizes from ½ log n words to 2log n words • Bit vector representation, I • Give block number b and offset r for position i • Employ binary rank-select by Makinen & Navarro: O(log n) time & O(n) bits • E.g. • T = babc abab abca  b = rankI(‘1’, 10) = 3 • I = 1000 1000 1000 r = 10 - selectI(‘1’, 3) + 1 = 2

Dynamic rank-select preliminary • Over-block/in-block operation • rankT(c, i): • rank-overT(c, b): The number of c’s before the b-th block • rankTb(c, r): The number of c’s up to position r in Tb • E.g. • T = babc abab abca : rankT(‘a’,10) = rank-overT(‘a’, 3) • I = 1000 1000 1000 + rankT3(‘a’, 2)

Dynamic rank-select preliminary • Over-block/in-block operation • selectT(c, k): • select-overT(c,k): The block number containing the k-th c • selectTb(c,k’): The offset of the k’-th c in Tb • Update operation • In-block update: change the text itself • Over-block update: change the statistics of the text

Over-block structures • Sorted character-block pair • Character-block pair (T[i], b): T[i] in the b-th block • E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)

Over-block structures • Sorted character-block pair • Character-block pair (T[i], b): T[i] in the b-th block • Sorted pairs: partially non-decreasing (Hon, Sadakane & Sung) • E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3) • (a,1)(a,2)(a,2)(a,3)(a,3) (b,1)(b,1)(b,2)(b,2)(b,3) (c,1)(c,3)

Over-block structures • Differential encoding of sorted pairs • A bit vector B of O(n) bits • For each distinct pair: • 1: the difference of block number • 0: the number of the same pairs • E.g. • T = ... babc abab bbbb abcc … • … (c,5)(c,8)(c,8) …  … 11111011100 …

Over-block structures • Differential encoding of sorted pairs • A bit vector B of O(n) bits • For each distinct pair: • 1: the difference of block number • 0: the number of the same pairs • E.g. • T = babc abab abca • B = 10100100 10010010 10110 ‘b’ group

Over-block rank-select • rank-overT(c, b): • Find the position of the b-th ‘1’ in the group of c • Count ‘0’s representing c up to the position • E.g. • T = babc abab abca • B = 10100100 10010010 10110 rank-overT(‘b’, 3): count ‘0’s up to 3rd ‘1’ in ‘b’ group

Over-block updates • If the number of blocks is fixed • Insert or delete 0s at the b-th block in I and B • Rank-select remains correct • E.g. • T = babc abab abca  babc aabaaabb abca • I = 1000 1000 1000  1000 100000000 1000 • B = 10100100 10010010 10110  10100000100 100100010 10110

Over-block updates • If the number of blocks is changing • Split or merge the b-th block in I and B • Call O() queries on B  amortized ( < log n) • E.g. • T = babc aabaaabb abca  babc aaba aabb abca • I = 1000 10000000 1000  1000 1000 1000 1000 • B =10100000100 1001000010 10110  101000100100 10010100010 10110

In-block structures • We use the hierarchy as Makinen & Navarro’s: word, sub-block and block • Rank/select on word-size texts w • Convert w to a bit vector representing occurrences of c • E.g. w = abaacbab, mask = bbbbbbbb (log) w XOR mask = x0xxx0x0 (log) 01000101(2) • O(1) time rank-select by tables of o(n) bits size

In-block structures • Linked list over sub-blocks • A block contains ½log n to 2log n words • A sub-block contains √log n words • One extra sub-block is a buffer for updates • Red-black tree over blocks • Leaf node: pointer to block, list of sub-blocks • Internal node: the number of blocks in its subtree

5 3 2 2 ab ba bc In-block rank-select • RankTb(c, r) in O(log n) time • Traverse the tree to find the b-th block • Scan the b-th block of θ(log n) words

5 3 2 2 ab bc ab c In-block updates • Update words in the list in O(log n) time • Process carry characters using the extra space in a block

In-block updates • Split or merge the block of out of the range • Update tree nodes from leaf to root 5 3 2 2 ab bc ac ba bc

In-block updates • Split or merge the block of out of the range • Update tree nodes from leaf to root 6 4 2 2 2 ab bc ac ba bc

Extension of our structure • Dynamic rank-select on plain texts over a large alphabet, σ < n • Use k-ary wavelet trees • O(log n logσ /loglog n) time & nlogσ + O(nlogσ /loglog n) bits • Application to run-length encoded texts • Start from RLFM (Makinen & Navarro) • Support dynamic BWT

Application to RLE • Run-Length Encoding (RLE) of T • Character of runs: text T’ • Length of runs: bit vector L • E.g. T = aaabbaacccc  T’=abac, L=10010101000 • RLE of BWT (Makinen & Navarro) • Run-Length based FM-index • The number of runs in BWT(T) ≤ min(n, nHk) + σk

Application to RLE • Assume rank/select on L and T’ • Total size of structure: O(n + n’logσ) • Operation time: O(log n + log n logσ/loglog n) • Some additional vectors • Sorted length vector: L’ • Frequency table F’: count characters in T’ • E.g. T = bb aa bbbb cc aaa aa aaa bb bbbb cc L = 10 10 1000 10 100  L’ = 10 100 10 1000 10 T’ = babca F’ = 001 001 01

Conclusion • Rank-select structure is an essential ingredient of compressed full-text indices • We propose dynamic rank-select for a small alphabet and its large-alphabet version • We can apply our structures to indices that uses BWT, such as RLFM and index for texts collection

Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts