1 / 33

Building Suffix Trees in O(m) time

Building Suffix Trees in O(m) time. Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen developed a simpler to understand variant in 1995 This is what we will focus on. Implicit Suffix Trees.

fiorello
Download Presentation

Building Suffix Trees in O(m) time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Suffix Trees in O(m) time • Weiner had first linear time algorithm in 1973 • McCreight developed a more space efficient algorithm in 1976 • Ukkonen developed a simpler to understand variant in 1995 • This is what we will focus on

  2. Implicit Suffix Trees • An implicit suffix tree for string S is a tree obtained from the suffix tree for S$ by • removing $ from all edge labels • removing any edges that now have no label • removing any node that does not still have at least two children • Some suffixes may no longer be leaves • An implicit suffix tree for prefix S[1..i] of S is similarly defined based on the suffix tree for S[1..i]$ • Ii will denote the implicit suffix tree for S[1..i]

  3. Example • Implicit tree for xabxa from tree for xabxa$ • {xabxa$, abxa$, bxa$, xa$, a$, $} b x a $ x a 6 a $ b $ x 5 b $ x a a 4 $ $ 3 2 1

  4. Remove $ • Remove $ • {xabxa$, abxa$, bxa$, xa$, a$, $} b x a $ x a 6 a $ b $ x 5 b $ x a a 4 $ $ 3 2 1

  5. Remove $ after • Remove $ • {xabxa, abxa, bxa, xa, a} b x a x a 6 a b x 5 b x a a 4 3 2 1

  6. Remove unlabeled edges • Remove unlabeled edges • {xabxa, abxa, bxa, xa, a} b x a x a 6 a b x 5 b x a a 4 3 2 1

  7. Remove unlabeled edges • Remove unlabeled edges • {xabxa, abxa, bxa, xa, a} b x a x a a b x b x a a 3 2 1

  8. Remove interior nodes • Remove internal nodes with only one child • {xabxa, abxa, bxa, xa, a} b x a x a a b x b x a a 3 2 1

  9. Final implicit tree • Remove internal nodes with only one child • {xabxa, abxa, bxa, xa, a} b x x a a a b b x x a a 3 2 1

  10. Basic Structure of Algorithm • Initialization • I1 has one edge labeled S(1) • For i = 1 to m-1 (build Ii+1) • For j = 1 to i+1 • Find location of string S[j..i] in tree • “Extend” to incorporate character S(i+1) • Expand final implicit tree to make full suffix tree

  11. Order of operations visualization • S = xabxacdefghixabcab$ • i+1 = 1 • x • i+1 = 2 • extend x to xa • a • i+1 = 3 • extend xa to xab • extend a to ab • b • …

  12. Extension Rules • Case 1: S[j..i] ends at a leaf • Add character S(i+1) to end of label on leaf edge • Case 2: Not a leaf, but no path from end of S[j..i] location continues with S(i+1) • Split a new leaf edge for character S(i+1) • May need to create an internal node if S[j..i] ends in the middle of an edge • Case 3: S[j..i+1] is already in the tree • No update

  13. a b x x x b a b a b b x 4 b x b 5 b x 3 b 2 1 Visualization • Implicit tree for axabxb from tree for axabx b Rule 1: at a leaf node Rule 2: add a leaf edge (and an interior node) Rule 3: already in tree

  14. Observations • Once S[j..i] is located in the tree, extending to accommodate S(i+1) is constant time • Making Ukkonen’s algorithm O(m2) • Finding the S[j..i] locations in the suffix trees quickly when explicit computation is needed

  15. Edge Label Representation • Potential Problem • Size of edge labels may be W(m2) • Example • S = abcdefghijklmnopqrstuvwxyz • Total length is Sj<m+1 j = m(m+1)/2 • Similar problem can happen when the length of the string is arbitrarily larger than the alphabet size • Solution • Label edges with pair of indices indicating beginning and end of positions of the substring in S

  16. Full edge label illustration • String S = xabxa$ b x a $ x a 6 a $ b $ x 5 b $ x a a 4 $ $ 3 2 1

  17. Compact edge label illustration • String S = xabxa$ (1,2) [or (4,5)?] (2,2) (6,6) (3,6) 6 (6,6) 5 (6,6) (3,6) (3,6) 4 3 2 1

  18. Modified Extension Rules • Rule 2: new leaf edge • label new leaf edge (i+1, i+1) • Rule 1: leaf edge extension • label had to be (p,i) before extension • given rule 2 above and an induction argument • now will be (p, i+1) • Rule 3: still nothing needs to be done

  19. Suffix Links • Consider the two strings a and xa • Suppose some internal node v of the tree is labeled with xa and another node s(v) in the tree is labeled with a • Then the edge (v,s(v)) is a suffix link • Do all internal nodes (the root is not considered an internal node) have suffix links?

  20. Suffix Link Lemma • If a new internal node v with path-label xa is added to the current tree in extension j of some phase i+1, then • the path labeled a already exists at an internal node of the tree or • the internal node labeled a will be created in the extension of j+1 or • string a is empty and s(v) is the root

  21. Proof of Suffix Link Lemma • A new internal node is created only by extension rule 2 • This means there are two distinct suffixes of S[1..i+1] that start with xa • xaS(i+1) and xacb where c is not S(i+1) • This means there are two distinct suffixes of S[1..i+1] that start with a • aS(i+1) and acb where c is not S(i+1) • Thus, if a is not empty, a will label an internal node once extension j+1 is processed which is the extension of a

  22. Using suffix links to speed up location of S[j..i] • S[1..i] must end at a leaf since it is the longest string in implicit tree Ii • Keep a pointer to this leaf in all cases and extend according to rule 1 • Locating S[j+1..i] from S[j..i] which is at node w • If w is an internal node, set v to w • Otherwise, set v = parent(w) • If v is the root, you must traverse from root to find S[j+1..i] • If not, go to s(v) and begin search for remaining portion of S[j..i] from there • Remaining portion is the label of the edge we traversed up • (see figure 6.5 on page 100)

  23. Skip/count Trick • Problem: Moving down from s(v) naively takes time proportional to the number of characters compared • Solution • At each node, only compare the first character in an edge label to the next character to be checked • Then, use the number of characters on that edge to update search in constant time • Running time is now proportional to the number of nodes in the path searched rather than the number of characters • See Figure 6.6 on page 102

  24. O(m2) argument • node-depth of v: number of nodes on path from root to node v • Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, nd(v) <= nd(s(v))+1 • If xb is an ancestral internal node of v where b is not empty, then it has a suffix link to a node with path-label b • See Figure 6.7 on page 103

  25. O(m2) argument • Lemma: Any phase takes O(m) time with skip/count trick • Proof • Decrements to node depth at most 2m • i+1 <= m extensions per phase • Walking up decreases node depth at most 1 • Suffix link traversal decreases node depth at most 1 • At most 3m downward edge traversal • Max node depth is m • None are negative • Each downward traversal increases depth by at least 1

  26. Observation • Making Ukkonen’s algorithm O(m) • Implicit computations of many extensions • Need to take argument for a single phase and extend to multiple phases

  27. Rule 3 • Suppose suffix extension rule 3 applies to S[j..i+1]. • This means S[j..i+1] already appears in the implicit suffix tree as a prefix of a larger suffix (and is thus a substring of S[1..i]) • Then it applies to S[k..i+1] for k > j. • Clearly, S[k..i+1] must also be a substring of S[1..i] and thus must be in the tree • Thus, stop a phase once the first application of rule 3 occurs • All future rule 3 extensions are done implicitly, not explicitly

  28. Implicit expansion of leaf nodes • Once a leaf, always a leaf • It will always be extended using rule 1 • Implicit expansion of leaf nodes • When a leaf edge is created in phase i+1, instead of labeling it with (p, i+1), label it with (p,e) • e is a global index that is set to i+1 once in each phase • In later phases, we will not need to explicitly extend this leaf but rather can implicitly extend it by incrementing e once in its global location • How can we easily identify leaf nodes to avoid explicitly expanding them in later phases?

  29. Avoiding leaf nodes • For phase i, let last(i) denote the last extension of phase i that is not by rule 3 • Observation: • All the suffixes S[j..i] for 1 <= j <= last(i) end at a leaf node • All the extensions for 1 <= j <= last(i) in phase i+1 and higher are rule 1 extensions by previous slide • Therefore, last(i+1) >= last(i) • Trick • In phase i+1, only explicitly compute extensions for last(i)+1 up till first rule 3 extension is found

  30. Single phase algorithm • Phase i+1 • Increment e to i+1 (implicitly extending all existing leaves) • Explicitly compute successive extensions starting at last(i)+1 and continuing until a rule 3 extension or no more extensions needed • Exact location is known given next step in previous phase • Set last(i+1) appropriately (last rule 1 or 2 extension) • Observation • Phase i and i+1 share at most 1 explicit extension • In all phases, there will be at most 2m extensions

  31. Visualization of explicit extensions • S = xabxacdefghixabcab$ • Phase 1: extend x (rule 1) • Phase 2: extend a (rule 2) • Phase 3: extend b (rule 2) • Phase 4: extend x (rule 3) • Phase 5: extend x (rule 3) • Phase 6: extend x (rule 2), extend a (rule 2) extend c (rule 2) • Phase 7: extend d (rule 2) • …

  32. O(m) argument • Lemma: All phases take O(m) time • Proof • Similar to previous node-depth argument • Key new observation • From one phase to the next, we either start at root or we work with same node we ended with last time so the node depth is identical • With 2m total extensions at most, we get O(m) upward and downward traversals of links

  33. Finishing up • Convert final implicit suffix tree to a true suffix tree • Add $ using just another phase of execution • Now all suffixes will be leaves • Replace e in every leaf edge with m • Just requires a traversal of tree which is O(m) time

More Related