730 likes | 938 Views
Selected Applications of Suffix Trees. Reminder – suffix tree. Suffix tree for string S of length m: rooted directed tree with m leaves numbered 1,...,m. each internal node, except the root, has at least 2 children. each edge labeled with a nonempty substring of S.
E N D
Reminder – suffix tree Suffix tree for string S of length m: • rooted directed tree with m leaves numbered 1,...,m. • each internal node, except the root, has at least 2 children. • each edge labeled with a nonempty substring of S. • edges out of a node begin with different characters. • path from the root to leaf i spells out suffix S[i...m].
Reminder – suffix tree (continued) • Each substring a of S appears on some unique path from the root. • If a ends at point p, the leaves below p mark all its occurrences. a occurs in S starting at position j a is a prefix of S[j...m] a labels an initial part of the path from the root to leaf j.
Example: S=xabxa$1 2 3 4 5 6 x b a a x v a $ b b x x $ $ a $ a $ $ 3 6 5 2 4 1
Exact string matching Find all occurrences of pattern P in text T. • Build suffix tree for T O(m) (Ukkonen). • Match P along a path from the root O(1) per character (finite alphabet) O(n) total. • If P fully matches a path, then the leaves below mark all starting positions of P in T O(k) where k = number of occurrences.
Matching Statistics • ms(i) – the length of the longest substring of T starting at position i that matches a substring somewhere in P. • example: T = abcxabcdex, P = wyabcwzqabcdw ms(1)=3, ms(5)=4. • There is an occurrence of P starting at position i of T iff ms(i)=|P|.
Goal: Compute ms(i) for each position i in T, in O(m) total time, using only a suffix tree for P. • Naive way: match T[i...m] starting from the root.more than O(m) total. Using suffix links: • Build suffix tree for P (Ukkonen) and keep suffix links. • suffix link: pointer from internal node v with path-label xa to node s(v) with path-label a. (x character, a substring)
Compute ms(i) in order base case: For ms(1), match T[1...m] from the root. general case: Suppose the matching path for ms(i) ended at point b, then for ms(i+1): • Let v be the first internal node at or above b. • If there is no such v – search from the root. • Otherwise – follow the suffix link from v to s(v) and search from s(v).path_label(v)=xa is a prefix of T[i...m] path_label(s(v))=a is a prefix of T[i+1...m].
skip / count • Let b denote the string between node v and point b. • substring xab in P matches a prefix of T[i...m]. • substring ab in P matches a prefix of T[i+1...m]. • Traverse the path labeled b out of s(v) using skip/count trick (time proportional to number of nodes on the path). • From the end of b, match single characters (starting with the first character that didn’t match for ms(i)).
Time analysis In the search for ms(i+1): • back up at most one edge from b to v O(1). • traverse suffix link from v to s(v) O(1). • traverse a b-path from s(v) in time proportional to the number of nodes on it O(m) total. • perform additional comparisons starting with the first character that didn’t match for ms(i) O(m) total.
Definitions For any position i in string S of length m: • Priori - longest prefix of S[i...m] that occurs as a substring of S[1...i-1]. • li - length of Priori. • si - starting position of the left-most copy of Priori (li>0). Example: S = abaxcabaxabz, Prioir7 = bax, l7 = 3, s7 = 2. • Copy of Priori starting at si is totally contained in S[1...i-1].
Basic idea • Suppose the text S[1...i-1] has been represented (perhaps in compressed form) and li>0. • Then Priori need not be explicitly represented. • The pair (si,li) points to an earlier occurrence of Priori . • Example:S = abaxcabaxabz (2,3)
Compression algorithm (outline) i := 1 Repeat compute li and siif li > 0 then output (si,li) i := i + lielse output S(i) i := i + 1 Until i > n
Examples S1 = a b a c a b a x a b z a b (1,1) c (1,3) x (1,2) z S2 = ab ababababababababababababababab ab(1,2)(1,4) (1,8) (1,16) S = (ab)k compressed representation is O(log k)
Decompress • Process the compressed string left to right. • Any pair (si,li) in the representation points to a substring that has already been fully decompressed.
Computing (si,li) • The algorithm does not request (si,li) for any position i already in the compressed part of S. • For total O(m) time, find each requested pair (si,li) in O(li) time. compute li and siif li > 0 then output (si,li) i := i + li
Implementation using suffix tree (1) Before compression: • Build a suffix tree T for S. • For each node v, compute cv : • the smallest leaf index in v’s subtree. • the starting position of the leftmost copy of the substring that labels the path from the root to v. • O(m) time.
Implementation using suffix trees (2) root computing (si,li): a |a| + cv ≤ i p v S[i...m] cv i |a| leaf i
Implementation using suffix trees (3) • To compute (si,li), traverse the unique path in T that matches a prefix of S[i...m]: • Let: p - current point, v - first node at or below p. • Traverse as long as: string_depth(p) + cv ≤ i. • At the last point p of traversal:li = string_depth(p), si = cv . • O(li) time.
Example S = abababab 1 2 3 4 5 6 7 8 i=1 li=0 a i=2 li=0 b i=3 li=2 cv=1 (1,2) i=5 li=4 cv=1 (1,4) a string depth=1 b b cv=2 cv=1 v1 a a b b cv=2 v2 cv=1 a a b $ $ b cv=2 cv=1 $ $ a a b b $ $ $ $ 2 4 6 8 7 5 3 1
Online version • Compress S as it is being input one character at a time. • Possible since S[1...i-1] is known before computing si,li. • Implementation: build suffix tree online. Ukkonen’s algorithm: • In phase i, build implicit suffix tree Ti for prefix S[1...i].
Claim 1 Assume: • The compaction has been done for S[1...i-1]. • Implicit suffix tree Ti-1 for S[1...i-1] has been built. • cv values are given for each node v in Ti-1. Then (si,li) can be obtained in O(li) time.
Suppose we had a suffix tree for S[1...i-1] with cv values We could find (si,li) in O(li) time. li = string_depth(p) si = cv root S(i) S(i+1) ... S(k-1) p c S(k) v
The missing leaves in the implicit suffix tree are not needed. root root S(i) S(i) ... ... S(k-1) S(k-1) p p c S(k) c S(k) v $ S(h) ... S(i-1) S(j) ... S(i-1) leaf j h < j leaf h leaf h
Claim 2 cv values for all implicit suffix trees can be computed in total O(m) time. • In Ukkonen’s algorithm: • Only extension rule 2 updates cv values. • Whenever a new internal node v is created by splitting an edge (u,w): cv cw. • Whenever a new leaf j is created: cj j. constant update time per new node.
Updating cv values new leaf and new node: new leaf: root root S(j) S(j) u S(i) S(i) v c v S(i+1) S(i+1) c2 w c1 j j
Online algorithm • Base case: output S(1) and build T1. • General case: Suppose S[1...i-1] has been compressed and Ti-1 with cv values has been constructed. • Match S(i),S(i+1),... along a path from the root in Ti-1. • Let S(k) be the first that doesn’t match. • Find (si,li). • If li = 0, output S(i) and build Ti with cv. • If li > 0, output (si,li) and build Ti,...,Tk-1 with cv. • Total time: O(m).
Maximal Pair • A maximal pair in string S:A pair of identical substrings a and b in S s.t. the character to the immediate left (right) of a is different from the character to the immediate left (right) of b. • Extending a and b in either direction would destroy the equality of the two strings. • Example: S = xabcyiiizabcqabcyrxar
Maximal Pair (continued) • Overlap is allowed:S = cxxaxxaxxbcxxaxxaaxxaxxb • To allow a prefix or suffix of S to be part of a maximal pair:S #S$ (#,$ don’t appear in S).Example: #abcxabc$
Maximal Repeat • A maximal repeat in string S: A substring of S that occurs in a maximal pair in S. • Example: S = xabcyiiizabcqabcyrxar maximal repeats: abc, abcy, ...
Finding All Maximal RepeatsIn Linear Time • Given: String S of length n. • Goal: Find all maximal repeats in O(n) time. • Lemma: Let T be a suffix tree for S.If string a is a maximal repeat in S,then a is the path-label of an internal node v in T.
Proof – by def. of maximal repeat S = xabcyiiizabcqabcyrxar root a a b c v y q
Conclusion • There can be at most n maximal repeats in any string of length n. • Proof: by the lemma, since T has at most n internal nodes.
Which internal nodes correspond to maximal repeats? • The left character of leaf i in T is S(i-1). • Node v of T is left diverse if at least 2 leaves in v’s subtree have different left characters. • A leaf can’t be left diverse. • Left diversity propagates upward.
Example: S = #xabxa$1 2 3 4 5 6 maximal repeat left diverse x b a a x a $ b b x x $ $ a $ a $ $ 3 6 5 2 4 1 a a x x b #
Theorem The string a labeling the path to an internal node v of T is a maximal repeat v is left diverse.
Proof of • Suppose a is a maximal repeat • It participates in a maximal pair • It has at least two occurrences with distinct left characters: xa, ya, xy • Let i and j be the two starting positions of a. Then leaves i and j are in v’s subtree and have different left characters x,y. • v is left diverse.
Proof of • Suppose v is left diverse there are substrings xap and yaq in S, xy. • If pq a’s occurrences in xap and yaq form a maximal pair a is a maximal repeat. • If p=q since v is a branching node, there is a substring zar in S, rp.If zx It forms a maximal pair with xap.If zy It forms a maximal pair with yap.In either case, a is a maximal repeat.
Proof of (continued) root root Case 1: Case 2: a a v v r... p... p… q… left char x left char y left char z left char x left char y
Compact Representation • Node v in T is a frontier node if: • v is left diverse. • none of v’s children are left diverse. • Each node at or above the frontier is left diverse. • The subtree of T from the root down to the frontier nodes is a compact representation of the set of all maximal repeats of S. • Representation in O(n) though total length may be larger.
Linear time algorithm • Build suffix tree T. • Find all left diverse nodes in linear time. • Delete all nodes that aren’t left diverse, to achieve compact representation:
finding all left diverse nodes in linear time • Traverse T bottom-up, recording for each node: • either that it is left diverse • or the left character common to all leaves in its subtree. • For each leaf: record its left character. • For each internal node v: • If any child is left diverse v is left diverse. • Else If all children have a common character x record x for v. • Else record that v is left diverse.
Finding All Maximal PairsIn Linear Time • Not every two occurrences of a maximal repeat form a maximal pair. Example: S = xabcyiiizabcqabcyrxar • There can be more than O(n) maximal pairs. • The algorithm is O(n+k) where k is the number of maximal pairs.
General Idea For each node u and character x: keep all leaf numbers below u whose left character is x. To find all maximal pairs of a: For each character x, form the cartesian product of the list for x at v1 with every list for a character x at v2. root a v p… q… v1 v2 leaf i leaf j left char x left char y
The Algorithm • Build suffix tree T for S. • Record the left character of each leaf. • Traverse T bottom-up. • At each node v with path-label a: • Output all maximal pairs of a: cartesian product of lists (u,x) and (u’,x’) for each pair of children u u’ and pair of characters x x’. • Create the lists for node v by linking the lists of v’s children.
Time Analysis • Suffix tree construction O(n). • Bottom-up traversal including all list-linking O(n). • All cartesian product operations O(k),where k is the number of maximal pairs. • Total O(n+k).
Finding All Supermaximal Repeats In Linear Time • supermaximal repeat: a maximal repeat that isn’t a substring of any other maximal repeat. • Example: S = xabcyiiizabcqabcyrxarabcy is supermaximal, abc isn’t. • Theorem:A left diverse internal node v in the suffix tree for S represents a supermaximal repeat iff • all of v’s children are leaves • and each has a distinct left character