1 / 42

Semi-dynamic compact index for short patterns and succinct van Emde Boas tree

This paper proposes a self-index for searching patterns of limited length that is theoretically and practically efficient in terms of construction, updates, and searches. The index is compact, requiring only O(n.log.σ) bits of space, where n is the text size and σ is the alphabet size.

killen
Download Presentation

Semi-dynamic compact index for short patterns and succinct van Emde Boas tree

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-dynamic compact index for short patterns and succinct van Emde Boas tree Yoshiaki Matsuoka1, Tomohiro I2, Shunsuke Inenaga1, Hideo Bannai1, Masayuki Takeda1 (1 Kyushu University) (2TU Dortmund)

  2. Overview • There exist many space-efficient indices(e.g. FM-index [Ferragina&Manzini, 2000])but most of them are static. • Some (e.g. Dynamic FM-index [Salson et al., 2010]) are dynamic but consume more space than static counterparts.

  3. Overview • There exist many space-efficient indices(e.g. FM-index [Ferragina&Manzini, 2000])but most of them are static. • Some (e.g. Dynamic FM-index [Salson et al., 2010]) are dynamic but consume more space than static counterparts. • We propose a self-index for searching patterns of limited length, which: • is theoretically and practically efficient in terms of construction, updates (adding characters at the end of the text) and searches, • is compact, i.e., requires only O(n log σ) bits of space, where n is the text size and σ is the alphabet size, and • can be constructed in online manner.

  4. Problem • Preprocess : text T of length n over an alphabet of size σ. • Query : pattern P of length at most r. • Answer : all occurrences of P in T.

  5. Problem • Preprocess : text T of length n over an alphabet of size σ. • Query : pattern P of length at most r. • Answer : all occurrences of P in T. • Example. If P= baa, then we output {5, 9, 14, 19} (in any order).

  6. A naïve algorithm • Since we would like to search for any pattern of length at most r, a naïve solution would be to store all occurrences of all r-grams in T. • This naïve algorithm requires at least n log n bits. • Example.

  7. Sampling of q-grams • To reduce the space, we only store the beginning positions divisible by some k (> 1). • We also sample longer substrings (of length r + k − 1 = q) so that occurrences of substrings of length at most r are not missed. • Example.

  8. Sampling of q-grams • For any pattern P of length at most r,if w is a sampled q-gram at position x in T andPhas an occurrence in w with relative position d(i.e., w[d .. d+|P|−1] = P), then x + d is an occurrence of P in T. occurrence at 8+1 occurrence at 16+3 occurrence at 4+1 occurrence at 12+2 P = baa

  9. Set of q-grams QP,d • Let QP,dbe the set of (not only sampledbut) allq-grams w in Twhere P has an occurrence in w with relative position d, i.e., w[d .. d+|P|−1] = P. • For example, consider the following string T:In this example, if k = 4, q = 6 and P = baa, then • QP,0 = {baaaab, baaaba, baaabb}, • QP,1 = {abaaab, bbbaab}, • QP,2 = {aabaaa, abbaaa, babaaa}, and • QP,3 = {aaabaa, aabbaa, bbabaa}.

  10. Set of q-grams QP,d • For example, consider the following string T:In this example, if k = 4, q = 6 and P = baa, then • QP,0 = {baaaab, baaaba, baaabb}, • QP,1 = {abaaab, bbbaab}, • QP,2 = {aabaaa, abbaaa, babaaa}, and • QP,3 = {aaabaa, aabbaa, bbabaa}. Observation • QP,0 ∪ QP,1 ∪ … ∪ QP,k−1 contains all sampled q-grams which contain P (with its offset). • |QP,d| ≤ #occ for any 0 ≤ d < k.

  11. Basic strategy of our search algorithm • To compute all occurrences of P in T, we incrementally computeQP,0, QP,1, …, QP,k−1 and output occurrences of Pwhen we encounter sampled q-grams in each QP,d. Observation • QP,0 ∪ QP,1 ∪ … ∪ QP,k−1 contains all sampled q-grams which contain P (with its offset). • |QP,d| ≤ #occ for any 0 ≤ d < k.

  12. q-gram transition graph • To compute QP,1,…, QP,k−1, we consider a directed graph G = (Σq, E), which we call a q-gram transition graph.A q-gram transition graph is a subgraph of the de Bruijn graph of Ts.t. the indegree of each vertex is at most 1.

  13. q-gram transition graph abbbab bbbaba bbabaa babaaa abaaab baaaba baaabb aabaaa aaabaa aaabba aabbaa abbaaa bbaaaa baaaab aaaaba We limit the indegree at most 1, so this edge is not constructed.

  14. q-gram transition graph abbbab bbbaba bbabaa babaaa abaaab baaaba 4, 8 0 Positions of sampled q-grams. baaabb aabaaa aaabaa 16 aaabba aabbaa abbaaa bbaaaa baaaab aaaaba 12

  15. Computing QP,0 , …, QP,k−1 P = baa QP,0 QP,1 QP,2 QP,3 aabbaa baaaab bbaaaa abbaaa 12 baaaba abaaab babaaa bbabaa 4, 8 baaabb aabaaa aaabaa 16 This edge does not exist,thereforeabaaba is enumerated only once.

  16. Computing QP,0 , …, QP,k−1 P = baa QP,0 QP,1 QP,2 QP,3 aabbaa baaaab bbaaaa abbaaa 12 baaaba abaaab babaaa bbabaa 4, 8 baaabb aabaaa aaabaa 16 This edge does not exist,thereforeabaaba is enumerated only once.

  17. Computing QP,0 , …, QP,k−1 P = baa QP,0 QP,1 QP,2 QP,3 aabbaa baaaab bbaaaa abbaaa 12 baaaba abaaab babaaa bbabaa 4, 8 baaabb aabaaa aaabaa 16 This edge does not exist,thereforeabaaba is enumerated only once.

  18. Computing QP,0 , …, QP,k−1 P = baa QP,0 QP,1 QP,2 QP,3 aabbaa baaaab bbaaaa abbaaa 12 baaaba abaaab babaaa bbabaa 4, 8 baaabb aabaaa aaabaa 16 This edge does not exist,thereforeabaaba is enumerated only once.

  19. Computing QP,0 , …, QP,k−1 P = baa QP,0 QP,1 QP,2 QP,3 aabbaa baaaab bbaaaa abbaaa 12 baaaba abaaab babaaa bbabaa 4, 8 baaabb aabaaa aaabaa 16 This edge does not exist,thereforeabaaba is enumerated only once.

  20. Computing QP,0 , …, QP,k−1 P = baa QP,0 QP,1 QP,2 QP,3 aabbaa baaaab bbaaaa abbaaa 12 baaaba abaaab babaaa bbabaa 4, 8 baaabb aabaaa aaabaa 16 This edge does not exist,thereforeabaaba is enumerated only once.

  21. ComputingQP,0 • Given pattern P, first we need to computethe source QP,0 of the q-gram transition graph,i.e., all q-grams in T which begin with P.

  22. ComputingQP,0 • Given pattern P, first we need to computethe source QP,0 of the q-gram transition graph,i.e., all q-grams in T which begin with P. • Consider all q-grams in lexicographical order.For any w∈Σq (not necessary appearing in T),we denote by the lexicographical rank of w. • For any pattern P, there existsa single range [sp(P), ep(P)] s.t.a q-gram w begins with Piff .This range can be computed easily. sp(baa) = 32 q-grams that begin with baa. ep(baa) = 39

  23. Computing QP,0 • Consider a bit array B of size σqs.t.iffw appears in T.Then, w∈QP,0iff and . • Hence we need to output all ws.t. and . sp(baa) = 32 q-grams that begin with baa. ep(baa) = 39

  24. Summary of our index • We need to store: • q-gram transition graph, • bit array B[0 .. σq− 1] for computing QP,0, and • positions of sampled q-grams. n : length of T. σ: alphabet size. q : length of sampled substrings. k : sampling distance.

  25. Summary of our index • We need to store: • q-gram transition graph, • bit array B[0 .. σq− 1] for computing QP,0, and • positions of sampled q-grams. • We can represent • in O(σq log σ) bits, • in σq + O(σq/ω) bits, and • in (n / k + σq) log(n/k) bits. • We can search any pattern inO(k× #occ + logσn) time. n : length of T. σ: alphabet size. q : length of sampled substrings. k : sampling distance. ω: machine word size.

  26. Summary of our index • We need to store: • q-gram transition graph, • bit array B[0 .. σq− 1] for computing QP,0, and • positions of sampled q-grams. • We can represent • in O(σq log σ) bits, • in σq + O(σq/ω) bits, and • in (n / k + σq) log(n/k) bits. • We can search any pattern inO(k× #occ + logσn) time. I will explainthese next. n : length of T. σ: alphabet size. q : length of sampled substrings. k : sampling distance. ω: machine word size.

  27. Representation of (a) • Since q-gram transition graph is a subgraph of de Bruijn graph,from each node u, it is enough to store the character cs.t.v= cu[0..q−2] if an edge (u,v) exists. a … abaaab baaaba a a b aabaaa aaabaa a a … baaaab aaaaba b

  28. Representation of (a) • Since q-gram transition graph is a subgraph of de Bruijn graph,from each node u, it is enough to store the character cs.t.v= cu[0..q−2] if an edge (u,v) exists. • Since the number of vertices is σq andthe indegree of each vertex is at most 1,the number of edges is at most σq.We can represent this graphin O(σq log σ) bitsby using some tables. a … abaaab baaaba a a b aabaaa aaabaa a a … baaaab aaaaba b

  29. Representation of (b) • By data structure (b), we output all ws.t. and . • So, using a fast successor data structure,we can compute all such q-grams w. sp(baa) = 32 q-grams that begin with baa. ep(baa) = 39

  30. Representation of (b) • By data structure (b), we output all ws.t. and . • So, using a fast successor data structure,we can compute all such q-grams w. • We need a dynamic successor data structureto support online updates to T. sp(baa) = 32 q-grams that begin with baa. ep(baa) = 39

  31. Representation of (b) • By data structure (b), we output all ws.t. and . • So, using a fast successor data structure,we can compute all such q-grams w. • We need a dynamic successor data structureto support online updates to T. • We can use van Emde Boas treebut it requiresΘ(σq) words = Θ(σqω) bits.We want to reduce the space. sp(baa) = 32 q-grams that begin with baa. ep(baa) = 39

  32. Representation of (b) • We present a succinct variant of van Emde Boas tree. • We divide B into blocks of size ωh where ωis the machine word sizeand h (> 1) is some constant integer. • We maintain an ω-ary tree of height h(bottom tree) for each block,and a van Emde Boas tree (top tree) over the bottom trees. van Emde Boas tree 1 0 1 …… ω-ary trees of height h …… 10101100……1 00000000……0 00100000……0 ωh Corresponds to B.

  33. Representation of (b) : bottom tree • Each bottom tree is a complete ω-ary tree. • Each node has a bit array A of length ωs.t.A[ j] = 1 iff the j-th child of the node contains 1. … … … A Block of size ωh.

  34. Representation of (b) • Data structure (b) can be represented in σq + o(σq) bits. • The bottom trees require σq + O(σq/ ω) = σq + o(σq) bits andthe top tree requires O(σq/ωh−1) = o(σq) bits,assuming the machine word size ω = Θ(log n). • Updates of a single bit in Band successor queriescan be done in O(h + log log σq) = O(log log σq) time. • If σq ≤ n then O(log log n) time.

  35. Complexities • We represent each q-gram by an integer, and we do not store the original text T. • We assume that σ = polylog(n), k ≥ 1, q= k+ r − 1 and q ≤ logσn − logσlogσn. • If we choose k = Θ(logσn), then the space complexity is O(n log σ) bits, and hence our index is compact.

  36. Experimental results of construction Time for construction (in seconds). Text size n (in megabytes).

  37. Experimental results of construction Our index is the fastest to construct. Time for construction (in seconds). Text size n (in megabytes).

  38. Experimental results of searching Average time for searching, using100 patterns of length 6 (in seconds). Text size n (in megabytes).

  39. Experimental results of searching Ours is the fastest compact/compressed index to search. Average time for searching, using100 patterns of length 6 (in seconds). Text size n (in megabytes).

  40. Experimental results of memory usage Memory usage (in megabytes). Text size n (in megabytes).

  41. Experimental results of memory usage Memory usage (in megabytes). Ours is much more space-efficient than Dynamic FM-index Text size n (in megabytes).

  42. Conclusion • We proposed a q-gram based self-index for searching patterns of limited length. Our self-index: • is theoretically and practically efficient in terms of construction, updates (adding characters at the end of the text) and searches, • is compact, i.e., requires only O(n log σ) bits of space, where n is the text size and σ is the alphabet size, and • can be constructed in online manner. • When the text is DNA sequence of human(i.e., σ = 4 and n ~ 109), the practical limit of pattern length is about 10 for our index. • Can we further reduce the space complexity?

More Related