1 / 63

On-Line Construction

On-Line Construction. of Suffix Trees. E. Ukkonen. Overview. Suffix tries On-line construction of suffix tries in quadratic time Suffix trees On-line construction of suffix trees in linear time Applications. g. o. o. o. o. Suffix Trees.

hester
Download Presentation

On-Line Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On-Line Construction of Suffix Trees E. Ukkonen

  2. Overview • Suffix tries • On-line construction of suffix tries in quadratic time • Suffix trees • On-line construction of suffix trees in linear time • Applications

  3. g o o o o Suffix Trees A suffix tree is a trie-like data structure representing all suffixes of a string. goo

  4. Notations • Let T = t1…tnbe a string. • For 0  i  n, let Ti =t1…tidenote thei-lengthprefix of T. • For 1  i  n + 1, let Ti =ti…tndenote thesuffix of T that starts at the ith position. • Let (T) = {Ti | 1  i  n + 1}.

  5. Suffix Tries The suffix trie of T, denoted by STrie(T), is a trie representing (T).

  6. Suffix Tries (cont.) Definition:STrie(T) is an augmented DFA, STrie(T) = (Q{}, root, F, g, f) where: • Q = {x | x is a substring of T} is the set of the states of the DFA. • is an auxiliary state. • rootis the initial state, corresponding to the empty string . • F = (T)is the set of finite states.

  7. Suffix Tries (cont.) • g : Q{}    Q (a partial function) is the transition function, defined as follows: • g(x,a) = y for all x,yQ and a, s.t. y = xa. • g(,a) = root for all a. • f : Q  Q{} is the suffix function defined as follows: • f(x) = y for all x,yQ, x  root, s.t a, s.t.x = ay. • f(root) = .

  8.  o  o c a a c o a c ac ca ao a o c cao cac aca o a caca acao o cacao An Example – STrie(cacao)

  9. The Size of Suffix Tries Theorem:The size of STrie(T), where |T| = n, is O(n2). Proof:The size of STrie(T) is linear in the number of substrings of T.T has at most O(n2) substrings. Thus the size of STrie(T) is O(n2).

  10. On-Line Construction of Suffix Tries • Let T = t1…tn. •  1  i  n, the algorithm constructs STrie(Ti). • First we construct STrie(T0) = STrie(). • Then,  1  i  n, we obtain STrie(Ti) from STrie(Ti-1).

  11. On-Line Construction of Suffix Tries (cont.) Observation 1:(Ti) = {xti | x (Ti-1)}  {}. Observation 2: The suffixes of Ti can be found by starting at the state Ti and following the suffix links, until .Thus, (Ti) = {fj(Ti) | 0  j  i}. Definition: The path from Ti to  following the suffix links is called the boundary path of STrie(Ti).

  12.  o  o c a a c o a c ac ca ao a o c cao cac aca o a caca acao o cacao On-Line Construction of Suffix Tries (cont.)

  13. a c a   a c c a STrie(Ti-1)  STrie(Ti) cac  caca

  14. The Algorithm create STrie() top   for i  1 to n do r  top while g(r,ti) is undefined do create new state r’ and g(r,ti)  r’ if r  top then f(old-r’)  r’ old-r’  r’ r  f(r) f(old-r’)  g(r,ti) top  g(top,ti)

  15.  o a c a a o c o c o a o The Algorithm (cont.) a o c a c

  16. Running Time Theorem:The running time of the algorithm is linear in the size of STrie(T), which is, in worst case, O(|T|2).

  17. Running Time (cont.) create STrie() top   for i  1 to n do r  top while g(r, ti) is undefined do create new state r’ and g(r, ti)  r’ if r  top then f(old-r’)  r’ old-r’  r’ r  f(r) f(old-r’)  g(r, ti) top  g(top, ti) O(1) for each node added to STrie(T)

  18. Suffix Trees • A suffix treeSTree(T) represents STrie(T) in space linear in |T|. • This is achieved by representing only a subset of Q’{} of Q{}, called the explicit states.

  19. Explicit and Implicit States Definition:A state q is called explicit in the following cases: • q is a leaf • q is a branching state (has at least two transitions) • root and  are also defined to be branching states. Otherwise (if q has exactly one transitions and is not the root or ), q is called implicit.

  20. Explicit and Implicit States (cont).   o c a o a c a o c o a o

  21. Generalized Transition Function • The string w spelled out by the transition path in STrie(T) between two explicit states s and r is represented in STree(T) as a generalized transitiong’(s,w) = r. • A generalized transition g’(s,w) = r is called an a-transition if a and v* s.t. w = av. • Note that for each explicit state s and a there is at most one a-transition from s.

  22. STrie(T)  STree(T)   o c a o a c a o c o a o

  23. STrie(T)  STree(T)   o c a o a c a o c o a o

  24. STrie(T)  STree(T)   o a ca o cao o cao

  25. Suffix Links Definition: If xQ’ is a branching state andx = ay, where a, then the suffix link of x is defined by f’(x) = y, and f’() = . Proposition:If xQ’ is a branching state and f’(x) = y then y is also a branching state. Proof:ab s.t. xa and xb are substrings of T. y is a suffix of x. Thus ya and yb are also substrings of T.

  26.  o a ca o cao o cao STree(T) STree(T) = (Q’{}, root, g’, f’).

  27. The Size of Suffix Trees Theorem:The size of STree(T), where|T| = n, is O(n). Proof:Since we represent each substring w = tk…tpof T by a pair pointers (k,p), the size of STree(T) is linear in the number of explicit states. STree(T) has at most n leaves, and thus at most n - 1 branching states. Therefore, the size of STree(T) is O(n).

  28. Reference Pairs Definition: Let r be an explicit or implicit state. (s,w) is called a reference pair for r if: • s is an explicit state and an ancestor of r. • w is the string spelled out by the transitions from s to r in the corresponding suffix trie. Definition:A reference pair(s,w)forris calledcanonicalifsis the closest explicit ancestor ofr (or r itself, if it is explicit).

  29. Active Point and Endpoint Let s1 = Ti-1, s2, …, si = root, si+1 =  be the boundary path of STrie(Ti-1). Definition:sjis called theactive pointofSTrie(Ti-1) ifj is the smallest index for whichsjis not a leaf. Definition:sj’is called theendpointofSTrie(Ti-1) ifj’ is the smallest index for whichg(sj’,ti)is defined.

  30.  a c a a c c a Active Point and Endpoint (cont.) The endpoint The active point

  31. Active Point and Endpoint (cont.) Proposition:sjand sj’are well definedandj  j’. Proof: • root is not a leaf sjis defined. • g(,ti) is defined sj’is defined. • g(sj’,ti) is defined  sj’is not a leaf j  j’.

  32. Adding ti-Transitions to STrie(Ti-1) Lemma:When obtainingSTrie(Ti)from STrie(Ti-1)the algorithm adds ati-transition to each stateshs.t. 1  h < j’, and only to these states, as follows: • For1  h < j, the new transition expands an old branch of the trie that ends atsh. • Forj  h < j’, the new transition initiates a new branch fromsh.

  33.  The endpoint o a c a a o c The active point o c o a o Adding ti-Transitions to STrie(Ti-1) (cont.)

  34. On-Line Construction of Suffix Trees • We create STree(), and then  1  i  n we obtain STree(Ti)from STree(Ti-1). • When obtainingSTree(Ti)fromSTree(Ti-1), we update STree(Ti-1) according to the transitions we would add to STrie(Ti-1). • Note that s1,…,si-1 are not necessarily explicit states.

  35. On-Line Construction of Suffix Trees (cont.) For 1  h < j: • sh is a leaf. Thus, s, 0  k  i-1 s.t. g’(s,(k,i-1)) = sh.We replace this transition byg’(s,(k,i)) = sh. • This would take too much time. Thus, we denote transitions of the type g’(s,(k,i-1)) in STree(Ti-1) by g’(s,(k,)). Hence, no updates are needed.

  36. On-Line Construction of Suffix Trees (cont.) For j  h < j’: • If sh is an implicit state, we turn it into an explicit state by splitting the transition containing it. • We create a new leaf shti and add a new transition g’(sh,(i,)).

  37. EP EP EP o   a c   o AP EP a c o a AP o c o cao o cao a c o a o On-Line Construction of Suffix Trees (cont.) a o c a c ca cac a cacao caca ac acao ca aca

  38. Lemma 1 Lemma 1:Let (s,(k,p)) be some reference pair for a state r. Then  s’, k’ s.t. (s’,(k’,p)) is the canonical reference pair for r. Proof:Lets’ be the closest explicit ancestor of r, or r itself if r is explicit. tk…tp is the path from the explicit state s to r. Thus, the path from s’ to r is a suffix tk’…tp of tk…tp.

  39. Lemma 2 Lemma 2:Let r be a state on the boundary path of STrie(Ti). Then  s, k s.t. (s,(k,i)) is the canonical reference pair for r. Proof:ris on the boundary path ofSTrie(Ti). r refers to some suffix tk’…ti of Ti.(,(k’,i)) is a reference pair for r. the claim holds by lemma 1.

  40. Lemma 3 Lemma 3:Let (s,(k,i-1)) be a reference pair for the endpoint of STrie(Ti-1). Then (s,(k,i)) is a reference pair for the active point of STrie(Ti). Proof: • sjis the active point of STrie(Ti-1) iff tj…ti-1 is the longest suffix of Ti-1 that occurs at least twice in Ti-1.

  41. Lemma 3 (cont.) Proof (cont.): • sj’ is the endpoint of STrie(Ti-1) iff tj’…ti-1 is the longest suffix of Ti-1 such that tj’…ti-1ti is a substring of Ti-1. • Thus, if sj’ is the endpoint of STrie(Ti-1), thentj’…ti-1ti is the longest suffix of Ti that occurs at least twice in Ti. Therefore, sj’ti is the active point of STrie(Ti).

  42. The Algorithm Transforms STree(Ti-1) into STree(Ti). Input: (s,(k,i)) s.t. (s,(k,i-1) is the active point of STrie(Ti-1). Output: (s’,k’) s.t. (s’,(k’,i-1) is the endpoint of STrie(Ti-1). create STree() s  root k  1 for i  1 to n do (s,k)  update(s,(k,i)) (s,k)  canonize(s,(k,i)) Input: a reference pair (s,(k,p)) for some state r. Output: (s’,k’) s.t. (s’,(k’,p)) is the canonical reference pair for r.

  43. Input: the canonical reference pair for some state r, and ti. Output: true/false if r is the endpoint or not, and the explicit state r (creating it if needed). update(s,(k,i)) old-r  root (endpoint,r)  test-and-split(s,(k,i-1),ti) while not endpoint do create new state r’; g’(r,(i,))  r’ if old-r  root then f’(old-r)  r old-r  r (s,k)  canonize(f’(s),(k,i-1)) (endpoint,r)  test-and-split(s,(k,i-1),ti) if old-r  root then f’(old-r)  s return (s,k)

  44.  (5,) (1,) (2,) (5,) (3,) (5,) (3,) update a o c a c s =  s = root s = root s =  s =  s = root k = 2 k = 5 k = 3 k = 4 k = 1 (1,2) (2,2) i = 2 i = 4 i = 5 i = 3 i = 1

  45. test-and-split(s,(k,p),t) if k  p then find the tk-transition g’(s,(k’,p’)) = s’ from s if t = tk’+p-k+1 then return (true,s) else create a new state r replace g’(s,(k’,p’)) = s’ by g’(s,(k’,k’+p-k)) = r and g’(r,(k’+p-k+1,p’)) = s’ return (false,r) else if  t-transition from s then return (false,s) else return (true,s)

  46. canonize(s,(k,p)) if p < k then return (s,k) else find the tk-transition g’(s,(k’,p’)) = s’ from s while p’ – k’  p – k do k  k + p’ – k’ + 1 s  s’ if k  p then find the tk-transition g’(s,(k’,p’)) = s’ from s return (s,k)

  47. Running Time Theorem:The running time of the algorithm is O(n). Proof:We divide the running time into two components: • The total time of the procedure canonize. • The rest.

  48. update Called n times old-r  root (endpoint,r)  test-and-split(s,(k,i-1),ti) while not endpoint do create new state r’; g’(r,(i,))  r’ if old-r  root then f’(old-r)  r old-r  r (s,k)  canonize(f’(s),(k,i-1)) (endpoint,r)  test-and-split(s,(k,i-1),ti) if old-r  root then f’(old-r)  s return (s,k) In each execution of the loop, a new state is created. O(1)

  49. canonize Called O(n) times if p < k then return (s,k) else find the tk-transition g’(s,(k’,p’)) = s’ from s while p’ – k’  p – k do k  k + p’ – k’ + 1 s  s’ if k  p then find the tk-transition g’(s,(k’,p’)) = s’ from s return (s,k) In each execution of the loop, the value of k increases.

  50. Applications - Exact String Matching Input: two strings: a text T and a pattern P. Output: all the occurrences of P in T. This problem can be solved in O(|T|+|P|) time (Boyer-Moore, Knuth-Morris-Pratt).

More Related