1 / 94

Bidirectional Online Construction of Affix Tree

Bidirectional Online Construction of Affix Tree. B89209016 張嘉真 B89902103 林虹佑 B89902106 高偉鈞. Source. Moritz G. Maa ß. Linear Bidirectional On-Line Construction of Affix Trees. Algorithmica . Online publication May 28, 2003. Outline. Introduction to Affix Tree

berget
Download Presentation

Bidirectional Online Construction of Affix Tree

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bidirectional Online Construction of Affix Tree B89209016 張嘉真 B89902103 林虹佑 B89902106 高偉鈞

  2. Source • Moritz G. Maaß. Linear Bidirectional On-Line Construction of Affix Trees. Algorithmica. • Online publication May 28, 2003

  3. Outline • Introduction to Affix Tree • Online construction of Suffix Tree in reversed order • Online construction of Affix Tree • Online bidirectional construction of Affix Tree

  4. Sigma+ Tree • 字母樹 • a rooted, directed tree with edge labels from sigma+. • every node in T has at most one outgoing edge whose label starts with a. • Example: • S = b a a b a b

  5. n Path(n) • 路徑(n)  w = path(n) • w is the string that is constructed by concatenating all edge labels on the path from the root to the node n. • Example: path(n) =b b a b a c b b a b a c

  6. a b a a b b a b b b b b CST & CPT (1) • Compact Suffix Tree (CST): • 字串(T)={u|u is a suffix of t} • Example: • S = b a a b b Compact Suffix Tree (CST)

  7. a b a b b a b a a a b b CST & CPT (2) • Compact Prefix Tree (CPT): • 字串(T)={u|u is a prefix of t} • Example: • S = b a a b b Compact Prefix Tree (CPT)

  8. a b a b b a b a a a b b Affix Tree(1) • Prefixes of t are suffixes of 反(t) • Prefix Tree(t) = Suffix Tree(反(t)) • Example: • T = b a a b b 反(T) = b b a a b Compact Prefix Tree (CPT) of T

  9. a c b b a c a c b b c c Affix Tree(2) • 斷頭指標Suffix Links • Example: • T = a b a b c The CST for t = ababc

  10. c b a b a b a Affix Tree(2) • 斷頭指標上的label • The substring that is discarded • Example: • t = a b a b c The CST for t = ababc a c b b a c a c b b c c

  11. c b a b a b a Affix Tree(3) • 鏡像樹 • The tree that is formed by the suffix links of T in reversed order • Example : • T = a b a b c The reverse tree of the CST for t = ababc The CST for t = ababc b c a c b b a b a c a a c b b b c c a

  12. b c a b a b a This is not the prefix tree…. The reverse tree of the CST for t • t = a b a b c The CST for t = ababc a c c b b b The CPT for t = ababc (The CST for t’=cbaba) a a c b a c b a c b a c b b b b c a a a a b b a a

  13. a c b a c b b b b a a c a a a c b b b b c c a a Affix Tree(4) Dual suffix tree t = a b a b c • Definition of Affix Tree • 字串(T) = {u | u is a suffix of t} and • 字串(反(T))={u | u is a suffix of 反(t)} • Example CAT for t focused on the suffix structure The CAT focused on the prefix structure

  14. Nested Suffix / Prefix • Nested suffix • t = { __w______w } • Nested prefix • t = { w______w__ } w w

  15. Nested Suffix / Prefix(2) • Longest nested suffix • t = { __aw______bw } • 生長點 in Ukkonen’s algorithm 1 2 3 4 5 S = a b a a b 6 c … [1,-] [2,-] [1,1] 1 1 [4,-] [2,-]

  16. Nested Suffix / Prefix(3) • 短(t)  shortest non-nested suffix • Shortest Leaf

  17. Constructing Affix Tree(1) • Prefixes of t are suffixes of 反(t) • Prefix Tree(t) = Suffix Tree(反(t))

  18. Constructing Affix Tree(2) • Ukkonen’s algorithm CST(T)  CST(Ta) • Construct CST(T)  CST(aT) • CAT(T)=CST(T)+CPT(T)

  19. Constructing Affix Tree(3) • Unidirectional Affix Tree Construction • CAT(T)=CST(T)+CPT(T) = CST(T)+CST(反(T)) CST(Ta)+CST(反(Ta)) =CST(Ta)+CST(a反(T)) =CST(Ta)+CPT(Ta) =CAT(Ta)

  20. Constructing Affix Tree(4) • Bidirectional Affix Tree Construction • Extra 生長點

  21. Constructing Affix Tree(5) • Algorithm for constructing CST in reversed order • Algorithm for combining the construction of CST & CPT  Uni-directional • Algorithm for coordinating 生長點  Bidirectional

  22. Online construction of Suffix Tree in reversed order

  23. Online construction of Suffix Tree in reversed order • Goal: Building CST(at) from CST(t). • Weiner’s algorithm. • Idea: Since CST(t) already represents all suffixes of CST(at) except at. Only the suffix at needs to be inserted.

  24. Online construction of Suffix Tree in reversed order • α(t)  longest nested prefix(t) • Additional Information: We assume that we have |α(at)| in each iteration of the following algorithm. With this, we don’t need the ”indicator vector“ used by Weiner. • |α(at)| = the length of α(at)

  25. From CST(t) to CST(at) • Idea: If we insert the suffix at into the CST(t), the leaf for at will branch at the location representing α(at) in CST(t). • Why?

  26. o l O+l1 l-l1 O+l1+l2 l-l1-l2 Definition …………….A B C A B C A B C A B C • Reference Pair • (base, (offset, length)) • (b,(o,l)) is a reference pair for s if path(b)t[o,…,o+l] = s. (b1, (o, l)) b1 b (b2, (o+ l1, l-l1)) (b, (o, l)) l1 b2 l2 (b3, (o+l1+l2, l-l1-l2)) b3 l3 b4

  27. From CST(t) to CST(at) • Goal: Find 生長點(at) from 生長點(t). • 生長點(at) = the canonical location of α(at) 生長點(t) = the canonical location of α(t)

  28. From CST(t) to CST(at) • Algorithm Step1: find 生長點(at) from 生長點(t). Step2: insert the leaf for at at 生長點(at).

  29. Step1: find 生長點(at) from 生長點(t). • lemma1: α(at) is a prefix of aα(t). • proof: Obviously, both strings are prefixes of at = { aα(t)__, α(at)__ }. Either α(at) is a prefix of aα(t) or aα(t) is a proper prefix of α(at). If aα(t) were a proper prefix of α(at), then α(at) = aα(t)v for some string v. Hence, t = {α(t)v__, __aα (t)v__ }, which contradicts the definition of α.

  30. Step1: find 生長點(at) from 生長點(t). • 生長點(t)=(b,(O,L)) 生長點(at)=(b’,(O’,L’)) • path(b’) is a prefix of α(at). Hence there is a suffix link labeled with a to some node s. a s b’ b

  31. Step1: find 生長點(at) from 生長點(t). • Since α(at) is a prefix of aα(t) (by lemma1) and path(b’) is a prefix of α(at), path(s) is a prefix of α(t). • s is an ancient of b, thus there is a path p=(s,…,b) in CST(t). a a s s b’ b

  32. Step1: find 生長點(at) from 生長點(t). • lemma2: For any intermediate node s’ on the path p (s’≠s, s’ ≠b), there is no node q such that there is a suffix link labeled with a from q to s’. a s s’ b’ a b DON’T EXIST q

  33. Proof of Lemma2 • If a q existed. Since path(s’) is a nested prefix of t, path(q) is a nested prefix of at. We know that q≠b’, therefore, q would be a node above b’. • And |path(q)| = |a| + |path(s’)| > |a| + |path(s)| = |path(b’)| = α(at) which contradicts.

  34. Step1: find 生長點(at) from 生長點(t). • As the result, we can find 生長點(at) from 生長點(t) by the following algorithm: Start at the base b and walk up towards the root until a suffix link labeled with a is found. • We have the new base b’, with the knowledge of |α(at)| we can set the length L to |α(at)|-depth(b’) and get 生長點(at).

  35. Step1: find 生長點(at) from 生長點(t). • 生長點(t) = (b,(O,L)). • Algorithm WHILE(1) if ( find suffix link labeled with a ) then b := b’ break if ( b = root ) then O := O - 1 break else len := getParentEdge(b).length O := O – len b := getParent(b) L := |α(at)| - depth(b)

  36. Step1: find 生長點(at) from 生長點(t). • Actually, to ensure O will never increase and to avoid redundant actions, we don’t keep 生長點(at), but keep the parent of 生長點(at).

  37. Step2: insert the leaf for atat 生長點(at). • There are three possibilities for 生長點(at)’s location: 1. inner node. 2. in some edge. 3. leaf.

  38. Step2: insert the leaf for atat 生長點(at). • inner node. [u,v], len=v-u+1 生長點(at), depth=d [-|at|+d,0] b= depth=|at| O = x L = 0 The leaf for suffix t [-|at|,-|at|] O = x – len L = len

  39. Step2: insert the leaf for atat 生長點(at). • in some edge. 生長點(at)=(b,(O,L)) [O,O+L-1] [??,??] [x,y] some node [x+L,y] b= [O+L,0] The leaf for suffix t depth=|at| [-|at|,-|at|]

  40. Step2: insert the leaf for atat 生長點(at). • How to generate suffix link for the branching node? • lemma3: the suffix link would be labeled a, and pointing to node s’ which is on the path p={s,…,b}. a s b’ s’ b

  41. Proof of Lemma3 • Consider b’’, let path(b’) = aw, path(b’’) = awp. • Thus path(s) = w. • CST(at) branches at b’’ means that at = {__awpx__, __awpy__} for some x,y. • Thus there is some inner node s’ with path(s’) = wp. • By lemma1, we know that wp is a prefix of α(t). • s’ is on the path p={s,…,b}. a s s’ b’ s’ b b’’

  42. How to generate suffix link for the branching node? • Since |α(at)| is known, when we are executing step1, find a node with depth = |α(at)| - 1 and remember it. • Thus to construct the suffix link for the branching node can be done in constant time. a s depth=|α(at)|-1 b’ s’ a b

  43. How to generate suffix link for the branching node… cont. • What if there is already some suffix link pointing to s’? • There is at most one such node p. Why? a s b’ s’ a b xa p x

  44. Step2: insert the leaf for atat 生長點(at). • leaf. depth=d’ [u,v] [x,y], len=y-x+1 [u+1,v] [-|at|+d’,0] [u,u] b= 生長點(at), depth=d O = z L = 0 [-|at|+d,0] depth=|at| The leaf for suffix t O = z – len L = len [-|at|,-|at|]

  45. Step2: insert the leaf for atat 生長點(at). • There are three possibilities for 生長點(at)’s location: 1. inner node. 2. in some edge. 3. leaf. • Inserting the leaf for at can be done in constant time.

  46. Example [0,0] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [0,0] [-1,0] [-1,-1] b= O = 0 L = 0 O = -1 L = 1

  47. Example [0,0] [-1,0] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [0,0] [-1,0] [-1,-1] [-2,0] b= [-1,0] O = -2 L = 1 [-2,-2] O = -1 L = 1

  48. Example [-1,0] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [-1,0] [-2,0] [-3,0] b= O = -2 L = 1 [-2,-2] O = -3 L = 1 [-3,-3]

  49. Example [-2,-2] [-1,0] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [-4,-4] [-1,0] [-1,-1] [-2,0] [-3,0] b= [-3,0] O = -4 L = 1 [-2,-2] [-1,0] [-4,-4] O = -3 L = 1 [-3,-3]

  50. Example [-2,-2] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [-4,-4] [-1,0] [-1,-1] [-4,0] [-3,0] b= [-5,-5] [-3,0] O = -4 L = 1 O = -5 L = 1 [-2,-2] [-1,0] [-4,-4] O = -4 L = 0 [-3,-3]

More Related