220 likes | 335 Views
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage. Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa State University. Motivation. Large amount of biological sequence data. Index for text usually is bigger than the text itself.
E N D
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa State University.
Motivation • Large amount of biological sequence data. • Index for text usually is bigger than the text itself. • Requires efficient ways to store and query these data.
Related Works • String B-tree • Has the best worst case performance in secondary storage, allowing updates. • However, most existing programs still uses suffix tree instead of string B-tree. • Many other works that only focus on construction of suffix tree, and without worst case bound • S.J. Bedathur and J.R. Haritsa. Search-optimized suffix-tree storage for biological applications. • E. Hunt, M.P. Atkinson, and R.W. Irving. Database indexing for large DNA and protein sequence collections. • Clark and Munro. “Efficient suffix trees on secondary storage” • Focus on reducing the space usage of suffix trees. • Performance depends on the height of the tree. • Farach, odd even tree construction. • Optimal construction time in secondary storage • The performance for search and update operations are not studied. • We show that suffix tree can achieve the same level of efficiency with constant size alphabet.
Definitions • Let v be an internal node of a suffix tree. • size(v) is the number of leaves in the subtree rooted at v. • rank(v) = i, iff Ci size(v) Ci+1. • Internal nodes u and v belong to the same partition, iff u is the parent of v and rank(v)=rank(u). • The rank of a partition P, rank(P) is the rank of the internal nodes in the partition.
A Suffix Tree Partitioned rank = 2 rank = 0 rank = 0 rank = 1 Each root to leaf path goes through at most logCn partitions. rank = 0 rank = 0
Partitions of rank 0 Suffix Tree & Partition Example C= 3
Partitions of rank 1 Suffix Tree & Partition Example C= 3
Properties of a Partition • Nodes in a partition without any child in the same partition are referred to as leaves. • The node whose parent is in another partition is referred to as the root. • There are at most C-1 leaves for each partition. • size(root) ≥ size(u), for all leaves u of the partition. • Ci+1-1 ≥size(root) ≥ size(u) ≥Ci • C*Ci = Ci+1
Properties of a Partition • If a node v has more than 1 child in the same partition as v, it is referred to as a branching node. • There can be at most C-2 branching nodes, because there are at most C-1 leaves. • A skeleton partition tree for a partition P contains the root, all the leaves and branching nodes of a partition. • There are at most 2C-2 nodes in a skeleton partition tree. • With a suitable choice of C, it can be stored in 1 disk page.
Partition and Skeleton Partition Tree Store a representative suffix in each nodes of the skeleton partition tree
Searching for an Exact Match (1) p = TTAATGAT
Searching for an Exact Match (1) p = TTAATGAT Load the representative suffix and compare to p.
Searching for an Exact Match (1) p = TTAATGAT Load the representative suffix and compare to p. Suppose the representative suffix is TTATTAGGA…… The lcp between p and the representative suffix is 3.
Searching for an Exact Match (2) p = TTAATGAT The lcp between p and the representative suffix is 3. Move to the appropriate next partition. Total number of disk access: O(p/B+logBn)
Supporting Update Operations • With insertion and deletion the size of a node as well as the partition changes. • During insertion of a suffix, • Size(v) changes if and only if node v is an ancestor of the newly inserted leaf. • Rank(v) may change only if size(v) changes and node v is the root of a partition. • If rank(v) changes node v will became either a new partition by itself or a leaf in its parent’s partition.
Only the Rank of the Root of a Partition Changes Root • Rank(v) increased by one • size(v) was Crank(v)+1 - 1 • size(root) was Crank(v)+1 • Root was not in the partition
Insertion and Deletion • By the same argument only a leaf’s rank can change during the deletion of a suffix. • Store and keep size(v) up to date for node v if • Node v is the root of the partition, • Node v, such that v is connected to the root by a chain of branching nodes. • Node v is a non-branching node and is the child of a node u that satisfies one of the conditions above.
The Root of a Partition is Removed • Let v be a child of the old root in the partition. • If v is a branching node, nothing need to be done, and the new partition with v as the root have all the size set correctly. • If v is a non-branching node, we can calculated the size of its only child in the partition by subtract the size of all other children from size(v). • After the updates all the size value will be set correctly as stated previously.
The Root of a Partition is Removed Old Root New Roots
The Leaf of a Partition is Removed • If a leaf is removed from a partition, • The leaf became the root, its size can be calculated as the sum of the size of all its children, which were all roots of different partitions. • Either a previously branching node became a non-branching node, no update of size is necessary, or • A previously non-branching node became a new leaf, in this case the size of the new leaf can be calculated by added the size of all its children.
The Leaf of a Partition is Removed Leaf from another partition
Results • Let B be the size of a disk block. • Let n be the total length of strings. • Let m be the length of the string being inserted or deleted. • Construction takes O(n logB n) disk accesses. • Insertion and deletion takes O(m logB (n+m)) and O(m logB (n)) disk accesses, respectively. • Let p be the length of a pattern. • Searching takes disk O(p/B + logB (n)) accesses.