330 likes | 473 Views
Building Suffix Trees in O(m) time. Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen developed a simpler to understand variant in 1995 This is what we will focus on. Implicit Suffix Trees.
E N D
Building Suffix Trees in O(m) time • Weiner had first linear time algorithm in 1973 • McCreight developed a more space efficient algorithm in 1976 • Ukkonen developed a simpler to understand variant in 1995 • This is what we will focus on
Implicit Suffix Trees • An implicit suffix tree for string S is a tree obtained from the suffix tree for S$ by • removing $ from all edge labels • removing any edges that now have no label • removing any node that does not still have at least two children • Some suffixes may no longer be leaves • An implicit suffix tree for prefix S[1..i] of S is similarly defined based on the suffix tree for S[1..i]$ • Ii will denote the implicit suffix tree for S[1..i]
Example • Implicit tree for xabxa from tree for xabxa$ • {xabxa$, abxa$, bxa$, xa$, a$, $} b x a $ x a 6 a $ b $ x 5 b $ x a a 4 $ $ 3 2 1
Remove $ • Remove $ • {xabxa$, abxa$, bxa$, xa$, a$, $} b x a $ x a 6 a $ b $ x 5 b $ x a a 4 $ $ 3 2 1
Remove $ after • Remove $ • {xabxa, abxa, bxa, xa, a} b x a x a 6 a b x 5 b x a a 4 3 2 1
Remove unlabeled edges • Remove unlabeled edges • {xabxa, abxa, bxa, xa, a} b x a x a 6 a b x 5 b x a a 4 3 2 1
Remove unlabeled edges • Remove unlabeled edges • {xabxa, abxa, bxa, xa, a} b x a x a a b x b x a a 3 2 1
Remove interior nodes • Remove internal nodes with only one child • {xabxa, abxa, bxa, xa, a} b x a x a a b x b x a a 3 2 1
Final implicit tree • Remove internal nodes with only one child • {xabxa, abxa, bxa, xa, a} b x x a a a b b x x a a 3 2 1
Basic Structure of Algorithm • Initialization • I1 has one edge labeled S(1) • For i = 1 to m-1 (build Ii+1) • For j = 1 to i+1 • Find location of string S[j..i] in tree • “Extend” to incorporate character S(i+1) • Expand final implicit tree to make full suffix tree
Order of operations visualization • S = xabxacdefghixabcab$ • i+1 = 1 • x • i+1 = 2 • extend x to xa • a • i+1 = 3 • extend xa to xab • extend a to ab • b • …
Extension Rules • Case 1: S[j..i] ends at a leaf • Add character S(i+1) to end of label on leaf edge • Case 2: Not a leaf, but no path from end of S[j..i] location continues with S(i+1) • Split a new leaf edge for character S(i+1) • May need to create an internal node if S[j..i] ends in the middle of an edge • Case 3: S[j..i+1] is already in the tree • No update
a b x x x b a b a b b x 4 b x b 5 b x 3 b 2 1 Visualization • Implicit tree for axabxb from tree for axabx b Rule 1: at a leaf node Rule 2: add a leaf edge (and an interior node) Rule 3: already in tree
Observations • Once S[j..i] is located in the tree, extending to accommodate S(i+1) is constant time • Making Ukkonen’s algorithm O(m2) • Finding the S[j..i] locations in the suffix trees quickly when explicit computation is needed
Edge Label Representation • Potential Problem • Size of edge labels may be W(m2) • Example • S = abcdefghijklmnopqrstuvwxyz • Total length is Sj<m+1 j = m(m+1)/2 • Similar problem can happen when the length of the string is arbitrarily larger than the alphabet size • Solution • Label edges with pair of indices indicating beginning and end of positions of the substring in S
Full edge label illustration • String S = xabxa$ b x a $ x a 6 a $ b $ x 5 b $ x a a 4 $ $ 3 2 1
Compact edge label illustration • String S = xabxa$ (1,2) [or (4,5)?] (2,2) (6,6) (3,6) 6 (6,6) 5 (6,6) (3,6) (3,6) 4 3 2 1
Modified Extension Rules • Rule 2: new leaf edge • label new leaf edge (i+1, i+1) • Rule 1: leaf edge extension • label had to be (p,i) before extension • given rule 2 above and an induction argument • now will be (p, i+1) • Rule 3: still nothing needs to be done
Suffix Links • Consider the two strings a and xa • Suppose some internal node v of the tree is labeled with xa and another node s(v) in the tree is labeled with a • Then the edge (v,s(v)) is a suffix link • Do all internal nodes (the root is not considered an internal node) have suffix links?
Suffix Link Lemma • If a new internal node v with path-label xa is added to the current tree in extension j of some phase i+1, then • the path labeled a already exists at an internal node of the tree or • the internal node labeled a will be created in the extension of j+1 or • string a is empty and s(v) is the root
Proof of Suffix Link Lemma • A new internal node is created only by extension rule 2 • This means there are two distinct suffixes of S[1..i+1] that start with xa • xaS(i+1) and xacb where c is not S(i+1) • This means there are two distinct suffixes of S[1..i+1] that start with a • aS(i+1) and acb where c is not S(i+1) • Thus, if a is not empty, a will label an internal node once extension j+1 is processed which is the extension of a
Using suffix links to speed up location of S[j..i] • S[1..i] must end at a leaf since it is the longest string in implicit tree Ii • Keep a pointer to this leaf in all cases and extend according to rule 1 • Locating S[j+1..i] from S[j..i] which is at node w • If w is an internal node, set v to w • Otherwise, set v = parent(w) • If v is the root, you must traverse from root to find S[j+1..i] • If not, go to s(v) and begin search for remaining portion of S[j..i] from there • Remaining portion is the label of the edge we traversed up • (see figure 6.5 on page 100)
Skip/count Trick • Problem: Moving down from s(v) naively takes time proportional to the number of characters compared • Solution • At each node, only compare the first character in an edge label to the next character to be checked • Then, use the number of characters on that edge to update search in constant time • Running time is now proportional to the number of nodes in the path searched rather than the number of characters • See Figure 6.6 on page 102
O(m2) argument • node-depth of v: number of nodes on path from root to node v • Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, nd(v) <= nd(s(v))+1 • If xb is an ancestral internal node of v where b is not empty, then it has a suffix link to a node with path-label b • See Figure 6.7 on page 103
O(m2) argument • Lemma: Any phase takes O(m) time with skip/count trick • Proof • Decrements to node depth at most 2m • i+1 <= m extensions per phase • Walking up decreases node depth at most 1 • Suffix link traversal decreases node depth at most 1 • At most 3m downward edge traversal • Max node depth is m • None are negative • Each downward traversal increases depth by at least 1
Observation • Making Ukkonen’s algorithm O(m) • Implicit computations of many extensions • Need to take argument for a single phase and extend to multiple phases
Rule 3 • Suppose suffix extension rule 3 applies to S[j..i+1]. • This means S[j..i+1] already appears in the implicit suffix tree as a prefix of a larger suffix (and is thus a substring of S[1..i]) • Then it applies to S[k..i+1] for k > j. • Clearly, S[k..i+1] must also be a substring of S[1..i] and thus must be in the tree • Thus, stop a phase once the first application of rule 3 occurs • All future rule 3 extensions are done implicitly, not explicitly
Implicit expansion of leaf nodes • Once a leaf, always a leaf • It will always be extended using rule 1 • Implicit expansion of leaf nodes • When a leaf edge is created in phase i+1, instead of labeling it with (p, i+1), label it with (p,e) • e is a global index that is set to i+1 once in each phase • In later phases, we will not need to explicitly extend this leaf but rather can implicitly extend it by incrementing e once in its global location • How can we easily identify leaf nodes to avoid explicitly expanding them in later phases?
Avoiding leaf nodes • For phase i, let last(i) denote the last extension of phase i that is not by rule 3 • Observation: • All the suffixes S[j..i] for 1 <= j <= last(i) end at a leaf node • All the extensions for 1 <= j <= last(i) in phase i+1 and higher are rule 1 extensions by previous slide • Therefore, last(i+1) >= last(i) • Trick • In phase i+1, only explicitly compute extensions for last(i)+1 up till first rule 3 extension is found
Single phase algorithm • Phase i+1 • Increment e to i+1 (implicitly extending all existing leaves) • Explicitly compute successive extensions starting at last(i)+1 and continuing until a rule 3 extension or no more extensions needed • Exact location is known given next step in previous phase • Set last(i+1) appropriately (last rule 1 or 2 extension) • Observation • Phase i and i+1 share at most 1 explicit extension • In all phases, there will be at most 2m extensions
Visualization of explicit extensions • S = xabxacdefghixabcab$ • Phase 1: extend x (rule 1) • Phase 2: extend a (rule 2) • Phase 3: extend b (rule 2) • Phase 4: extend x (rule 3) • Phase 5: extend x (rule 3) • Phase 6: extend x (rule 2), extend a (rule 2) extend c (rule 2) • Phase 7: extend d (rule 2) • …
O(m) argument • Lemma: All phases take O(m) time • Proof • Similar to previous node-depth argument • Key new observation • From one phase to the next, we either start at root or we work with same node we ended with last time so the node depth is identical • With 2m total extensions at most, we get O(m) upward and downward traversals of links
Finishing up • Convert final implicit suffix tree to a true suffix tree • Add $ using just another phase of execution • Now all suffixes will be leaves • Replace e in every leaf edge with m • Just requires a traversal of tree which is O(m) time