290 likes | 449 Views
Space Efficient Suffix Trees. J. Ivan Munro, Venkatesh Raman, S. Srinivasa Rao. Introduction. n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes O(m) time.
E N D
Space Efficient Suffix Trees J. Ivan Munro, Venkatesh Raman, S. Srinivasa Rao
Introduction • n – length of text, m – length of search pattern string • Generally suffix tree construction takes O(n) time, O(n) space and searching takes O(m) time. • Although space requirement isO(n), the constant is usually big.
Introduction (Cont.) • Motivation is to develop a space efficient data structure with a minimal constant over n. • We present suffix tree that uses n+O(n/lgn) words, or equivalently nlgn+O(n) bits and supports string searching in O(m) time.
Previous Representations • Had either a higher lower order in space and some expectation assumption or required more time for searching • Below are some approaches: Keep alphabet size vector for each node. So constant to space is at least |Σ|. Keep a pair <start, end> for each compressed node (or equivalently a pair <start, length>). Save only the length, called the “skip value“.
Using Skip Values for Search • Skip value – the length of the compressed string at node. • At compressed node skip as many characters as specified by skip value before comparing with the pattern. • Search until the pattern is exhausted or the current character of the pattern has no match at the current node. In first case any leaf of the subtree rooted at the node gives a possible starting point of the pattern in the text. • Start at position given by any of leaves and confirm if the pattern exists in the text.
Suffix Trees (Patricias) ba Input text: bababa# 0 a ba # 7 Skip value 1 2 # # ba 6 5 2 2 ba# ba# # # 4 2 3 1
Example Input text: bababa# Pattern: aba 0 a ba # 7 Skip value 1 2 # # ba ba 6 5 2 2 ba# ba# # # 4 2 3 1
Previous Representations (Cont.) • In compressed suffix tree there are n+1 leaf nodes and at most n internal nodes. Total: 2n+1 nodes. • Storage requirement: For the tree For skip values For position indices at the leaves • Representation using skip values require: 2n+1+n+n+1, about 4n words. Each word takes lgn bits, so total required space is about 4nlgn+O(n) bits. • Suffix array uses 2n words and has O(m+lgn) search time. More compact representation uses 1.25n words but the search time is given as expected bound.
Binary tree rooted ordered tree • Isomorphism between binary trees and rooted ordered trees. • In the ordered tree there is a root which does not corresponds to any node in the binary tree. • Left child of binary tree node corresponds to the leftmost child of the corresponding node in the ordered tree. • Right child of binary tree node corresponds to the next sibling to the right in the ordered tree.
Binary tree representation using the parenthesis sequence • The given binary tree on 10 nodes 1 2 6 3 7 4 5 8 9 10
Binary tree representation using the parenthesis sequence 0 • Equivalent rooted ordered tree • The parenthesis representation 0 1 2 2 3 4 4 3 5 5 1 6 6 7 8 9 9 8 1010 7 0 ( ( ( ) ( ( ) )( ) ) ( )( ( ( ) ) ( ) ) ) 1 7 6 2 3 5 8 10 4 9
Parentheses tree representation • A general rooted ordered tree on n nodes can be represented by 2n parentheses. • Use 2n+o(n) bit encoding of n node binary tree that supports, in constant time: 1. move to left/right child 2. move to parent 3. get the size of subtree
Succint Suffix Tree Representation • Convert each symbol of the alphabet to binary 0,1 . • Our suffix tree becomes binary tree. • Support additional operations in constant time: leafrank(x): return the number of leaves to the left of node x (in the preorder numbering) leafselect(j): return the jth leaf in the left to right ordering of the leaves. leafsize(x): return the number of leaves in the subtree rooted at node x leftmost(x): return the leftmost leaf in the subtree rooted at node x rightmost(x): return the rightmost leaf in the subtree rooted at node x
Example 1 1 1 2 5 2 5 2 5 3 4 6 3 4 6 3 4 6 Leafrank(1) = 2 Leafselect(3) = 6 Leafsize(1) = 3
Succint Suffix tree Representation (Cont.) • Important navigation operations: rank(i): the number of 1’s up to and including the position i select(i): the position of the ith 1 rankp(i): the number of occurrences of pattern p up to and including the position i selectp(i): position of the ith occurrence of p in given binary string
THEOREM 1 Given a binary string of length n and a binary pattern p of length at most єlgn, where є is any constant less than ½, both rankp(i) and selectp(i) can be supported in constant time using o(n) bits, in addition to the space required for the given binary string.
Intuition • Divide the string into blocks of size lg2n and keep the rank info for the first element of every block. • Each block further divide into small blocks. • In the smallest blocks keep precomputed table of answers in o(n) bits.
THEOREM 2 A static binary tree on n nodes can be represented using 2n+o(n) bits such that, given a node x, in addition to finding its parent, left child, right child, and the size of the subtree rooted at node x, we can support leafrank(x), leafselect(j), leafsize(x), leftmost(x), and rightmost(x) operations in constant time.
Proof • Convert binary tree into rooted ordered tree. • Leaves in binary tree correspond to the rightmost leaves in general tree. • Rightmost leaves in general tree correspond to “())” pattern in the string.
Proof (Cont.) 1 0 1 7 2 6 6 2 3 5 3 7 8 10 4 5 8 4 9 0 1 2 2 3 4 4 3 5 5 1 6 6 7 8 9 9 8 1010 7 0 ( ( ( ) ( ( ) )( ) ) ( ) ( ( ( ) )( ) ) ) 9 10
Proof (Cont.) • Since rankp(x) searches the patternfrom the left of the string, then the number of p occurrences is the number of leaves to the left of node x. • leafrank(x)rankp(x), p=“())”
Proof (Cont.) • leafselect(j) selectp(j) When p = “())” then operation selectp(j) chooses j’thleaf from the left. • leftmost(x)selectp(rankp(x)+1) • rightmost(x)selectp(rankp(close(parent(x))-1)) • leafsize(x) rankp(f(x))- rankp(x) note that f(x) is the closing parenthesis ofparent of node x.
Representing Suffix Tree • Binary encoding of suffix tree will make 2n+1 nodes of binary tree. • Use succint representation of binary tree: 2n+o(n) bits of space. • Our suffix tree now has 4n+o(n) bits. • The third component takes nlgn bits. • The second component – skip values are not kept. • Total space needed: 4n+nlg(n)+o(n) bits nlgn+O(n) bits n+O(n/lgn) words.
Skip values storage trick • Skip values need not to be stored. They can be found online when needed. • To find the skip value, go to leftmost and rightmost leaves and compare the text until disagreement, suppose k characters are the same and they occupy l bits. • Find how many first bits are the same in those two different characters. Suppose j bits. • Skip value is l+j bits.
Searching • Perform the search as before. • If the search stops at a leaf, first find leafrank of that leaf and then find the suffix index from the array of pointers. • If the end of pattern is encountered in internal node, then any leaf in the subtree represent a possible matching suffix. The leaf can be found by the leftmost(x) or rightmost(x) at constant time.
Searching (Cont.) • Working with |Σ| alphabet, time to find skip values is O(lg|Σ|+skip value). • The sum of skip values is at most m. So total time spent to find skip values is O(mlg|Σ|).
Searching (Cont.) • Once we confirm that the pattern exists (O(m)), the number of pattern occurrences is the leafsize of the node where the search ended. • Theorem 3: A suffix tree for a text of length n can be represented using nlgn+O(n) bits such that, given a pattern of size m, the number of occurrences of the pattern in the string can be found in O(mlg|Σ|) time. Finding the positions of all the occurrences of the pattern requires O(m+s) time, where s is the number of occurrences of the pattern in the text.