1 / 31

Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts

Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts. Sunho Lee and Kunsoo Park Seoul National Univ. Contents. Introduction Rank/select problem Relations to compressed full-text indices Dynamic rank-select structure Extensions of the structure

spike
Download Presentation

Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.

  2. Contents • Introduction • Rank/select problem • Relations to compressed full-text indices • Dynamic rank-select structure • Extensions of the structure • For a large alphabet text • For a run-length encoded text

  3. Rank-select problem • For a given text T over σ-size alphabet, our structures support: • rankT(c, i): gives the number of character c’s up to position i in T • selectT(c, k): gives the position of the k-th c • E.g. T=acabbc • rankT(‘a’, 5) = 2 • selectT(‘a’, 2) = 3

  4. Rank-select problem • Our structures support additional update operations • insertT(c, i): inserts character c between T[i] and T[i+1] • deleteT(i): deletes T[i] from T • E.g. T=acabbc aababc • rankT(‘a’, 5) = 2  rankT(‘a’, 5) = 3 • selectT(‘a’, 2) = 3 selectT(‘a’, 2) = 2

  5. Why rank-select problem? • In compressed full-text index • Rank-select structures are built on Burrows-Wheeler Transform (BWT) • Rank: backward search (Ferragina & Manzini) • Select: Psi-function in CSA (Grossi & Vitter) • Dynamic BWT • Index for a collection of texts (Chan, Hon & Lam) • Add or remove a text from the collection

  6. T=mississippi$ Psi function Order of the suffix at next position E.g.. Psi[4] = 11, the order of ‘ssippi$’ Example of select on BWT

  7. T=mississippi$ Psi function Order of the suffix at next position E.g. Psi[4] = 11, the order of ‘ssippi$’ Duality between Psi-function and BWT (Hon, Sadakane & Sung) BWT[i] = T[SA[i] – 1] Psi[i] = selectBWT(C[i], i – F[C[i]]) C[i]: T[SA[i]] F[c]: The number of x < c Example of select on BWT

  8. Our results • Dynamic rank-select on texts over a small alphabet (σ < log n) • Improve the binary-alphabet version by Makinen & Navarro • O(log n) time and nlogσ + o(nlogσ) bits • Dynamic rank-select for a large alphabet (σ < n) • Use wavelet trees to extend our small-alphabet structure • O(log n logσ / loglog n) time and nlogσ + o(nlogσ) bits • Application to RLE texts

  9. Static rank-select

  10. Dynamic rank-select

  11. Dynamic rank-select preliminary • We assume RAM model with: • Word size w = θ(log n) bits • +, -, *, / and bitwise operations in O(1) time • We process a word-size text of θ(log n/log) characters in O(1) time

  12. Dynamic rank-select preliminary • Partition of text • Blocks of sizes from ½ log n words to 2log n words • Bit vector representation, I • Give block number b and offset r for position i • Employ binary rank-select by Makinen & Navarro: O(log n) time & O(n) bits • E.g. • T = babc abab abca  b = rankI(‘1’, 10) = 3 • I = 1000 1000 1000 r = 10 - selectI(‘1’, 3) + 1 = 2

  13. Dynamic rank-select preliminary • Over-block/in-block operation • rankT(c, i): • rank-overT(c, b): The number of c’s before the b-th block • rankTb(c, r): The number of c’s up to position r in Tb • E.g. • T = babc abab abca : rankT(‘a’,10) = rank-overT(‘a’, 3) • I = 1000 1000 1000 + rankT3(‘a’, 2)

  14. Dynamic rank-select preliminary • Over-block/in-block operation • selectT(c, k): • select-overT(c,k): The block number containing the k-th c • selectTb(c,k’): The offset of the k’-th c in Tb • Update operation • In-block update: change the text itself • Over-block update: change the statistics of the text

  15. Over-block structures • Sorted character-block pair • Character-block pair (T[i], b): T[i] in the b-th block • E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)

  16. Over-block structures • Sorted character-block pair • Character-block pair (T[i], b): T[i] in the b-th block • Sorted pairs: partially non-decreasing (Hon, Sadakane & Sung) • E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3) • (a,1)(a,2)(a,2)(a,3)(a,3) (b,1)(b,1)(b,2)(b,2)(b,3) (c,1)(c,3)

  17. Over-block structures • Differential encoding of sorted pairs • A bit vector B of O(n) bits • For each distinct pair: • 1: the difference of block number • 0: the number of the same pairs • E.g. • T = ... babc abab bbbb abcc … • … (c,5)(c,8)(c,8) …  … 11111011100 …

  18. Over-block structures • Differential encoding of sorted pairs • A bit vector B of O(n) bits • For each distinct pair: • 1: the difference of block number • 0: the number of the same pairs • E.g. • T = babc abab abca • B = 10100100 10010010 10110 ‘b’ group

  19. Over-block rank-select • rank-overT(c, b): • Find the position of the b-th ‘1’ in the group of c • Count ‘0’s representing c up to the position • E.g. • T = babc abab abca • B = 10100100 10010010 10110 rank-overT(‘b’, 3): count ‘0’s up to 3rd ‘1’ in ‘b’ group

  20. Over-block updates • If the number of blocks is fixed • Insert or delete 0s at the b-th block in I and B • Rank-select remains correct • E.g. • T = babc abab abca  babc aabaaabb abca • I = 1000 1000 1000  1000 100000000 1000 • B = 10100100 10010010 10110  10100000100 100100010 10110

  21. Over-block updates • If the number of blocks is changing • Split or merge the b-th block in I and B • Call O() queries on B  amortized ( < log n) • E.g. • T = babc aabaaabb abca  babc aaba aabb abca • I = 1000 10000000 1000  1000 1000 1000 1000 • B =10100000100 1001000010 10110  101000100100 10010100010 10110

  22. In-block structures • We use the hierarchy as Makinen & Navarro’s: word, sub-block and block • Rank/select on word-size texts w • Convert w to a bit vector representing occurrences of c • E.g. w = abaacbab, mask = bbbbbbbb (log) w XOR mask = x0xxx0x0 (log) 01000101(2) • O(1) time rank-select by tables of o(n) bits size

  23. In-block structures • Linked list over sub-blocks • A block contains ½log n to 2log n words • A sub-block contains √log n words • One extra sub-block is a buffer for updates • Red-black tree over blocks • Leaf node: pointer to block, list of sub-blocks • Internal node: the number of blocks in its subtree

  24. 5 3 2 2 ab ba bc In-block rank-select • RankTb(c, r) in O(log n) time • Traverse the tree to find the b-th block • Scan the b-th block of θ(log n) words

  25. 5 3 2 2 ab bc ab c In-block updates • Update words in the list in O(log n) time • Process carry characters using the extra space in a block

  26. In-block updates • Split or merge the block of out of the range • Update tree nodes from leaf to root 5 3 2 2 ab bc ac ba bc

  27. In-block updates • Split or merge the block of out of the range • Update tree nodes from leaf to root 6 4 2 2 2 ab bc ac ba bc

  28. Extension of our structure • Dynamic rank-select on plain texts over a large alphabet, σ < n • Use k-ary wavelet trees • O(log n logσ /loglog n) time & nlogσ + O(nlogσ /loglog n) bits • Application to run-length encoded texts • Start from RLFM (Makinen & Navarro) • Support dynamic BWT

  29. Application to RLE • Run-Length Encoding (RLE) of T • Character of runs: text T’ • Length of runs: bit vector L • E.g. T = aaabbaacccc  T’=abac, L=10010101000 • RLE of BWT (Makinen & Navarro) • Run-Length based FM-index • The number of runs in BWT(T) ≤ min(n, nHk) + σk

  30. Application to RLE • Assume rank/select on L and T’ • Total size of structure: O(n + n’logσ) • Operation time: O(log n + log n logσ/loglog n) • Some additional vectors • Sorted length vector: L’ • Frequency table F’: count characters in T’ • E.g. T = bb aa bbbb cc aaa aa aaa bb bbbb cc L = 10 10 1000 10 100  L’ = 10 100 10 1000 10 T’ = babca F’ = 001 001 01

  31. Conclusion • Rank-select structure is an essential ingredient of compressed full-text indices • We propose dynamic rank-select for a small alphabet and its large-alphabet version • We can apply our structures to indices that uses BWT, such as RLFM and index for texts collection

More Related