940 likes | 1.13k Views
Bidirectional Online Construction of Affix Tree. B89209016 張嘉真 B89902103 林虹佑 B89902106 高偉鈞. Source. Moritz G. Maa ß. Linear Bidirectional On-Line Construction of Affix Trees. Algorithmica . Online publication May 28, 2003. Outline. Introduction to Affix Tree
E N D
Bidirectional Online Construction of Affix Tree B89209016 張嘉真 B89902103 林虹佑 B89902106 高偉鈞
Source • Moritz G. Maaß. Linear Bidirectional On-Line Construction of Affix Trees. Algorithmica. • Online publication May 28, 2003
Outline • Introduction to Affix Tree • Online construction of Suffix Tree in reversed order • Online construction of Affix Tree • Online bidirectional construction of Affix Tree
Sigma+ Tree • 字母樹 • a rooted, directed tree with edge labels from sigma+. • every node in T has at most one outgoing edge whose label starts with a. • Example: • S = b a a b a b
n Path(n) • 路徑(n) w = path(n) • w is the string that is constructed by concatenating all edge labels on the path from the root to the node n. • Example: path(n) =b b a b a c b b a b a c
a b a a b b a b b b b b CST & CPT (1) • Compact Suffix Tree (CST): • 字串(T)={u|u is a suffix of t} • Example: • S = b a a b b Compact Suffix Tree (CST)
a b a b b a b a a a b b CST & CPT (2) • Compact Prefix Tree (CPT): • 字串(T)={u|u is a prefix of t} • Example: • S = b a a b b Compact Prefix Tree (CPT)
a b a b b a b a a a b b Affix Tree(1) • Prefixes of t are suffixes of 反(t) • Prefix Tree(t) = Suffix Tree(反(t)) • Example: • T = b a a b b 反(T) = b b a a b Compact Prefix Tree (CPT) of T
a c b b a c a c b b c c Affix Tree(2) • 斷頭指標Suffix Links • Example: • T = a b a b c The CST for t = ababc
c b a b a b a Affix Tree(2) • 斷頭指標上的label • The substring that is discarded • Example: • t = a b a b c The CST for t = ababc a c b b a c a c b b c c
c b a b a b a Affix Tree(3) • 鏡像樹 • The tree that is formed by the suffix links of T in reversed order • Example : • T = a b a b c The reverse tree of the CST for t = ababc The CST for t = ababc b c a c b b a b a c a a c b b b c c a
b c a b a b a This is not the prefix tree…. The reverse tree of the CST for t • t = a b a b c The CST for t = ababc a c c b b b The CPT for t = ababc (The CST for t’=cbaba) a a c b a c b a c b a c b b b b c a a a a b b a a
a c b a c b b b b a a c a a a c b b b b c c a a Affix Tree(4) Dual suffix tree t = a b a b c • Definition of Affix Tree • 字串(T) = {u | u is a suffix of t} and • 字串(反(T))={u | u is a suffix of 反(t)} • Example CAT for t focused on the suffix structure The CAT focused on the prefix structure
Nested Suffix / Prefix • Nested suffix • t = { __w______w } • Nested prefix • t = { w______w__ } w w
Nested Suffix / Prefix(2) • Longest nested suffix • t = { __aw______bw } • 生長點 in Ukkonen’s algorithm 1 2 3 4 5 S = a b a a b 6 c … [1,-] [2,-] [1,1] 1 1 [4,-] [2,-]
Nested Suffix / Prefix(3) • 短(t) shortest non-nested suffix • Shortest Leaf
Constructing Affix Tree(1) • Prefixes of t are suffixes of 反(t) • Prefix Tree(t) = Suffix Tree(反(t))
Constructing Affix Tree(2) • Ukkonen’s algorithm CST(T) CST(Ta) • Construct CST(T) CST(aT) • CAT(T)=CST(T)+CPT(T)
Constructing Affix Tree(3) • Unidirectional Affix Tree Construction • CAT(T)=CST(T)+CPT(T) = CST(T)+CST(反(T)) CST(Ta)+CST(反(Ta)) =CST(Ta)+CST(a反(T)) =CST(Ta)+CPT(Ta) =CAT(Ta)
Constructing Affix Tree(4) • Bidirectional Affix Tree Construction • Extra 生長點
Constructing Affix Tree(5) • Algorithm for constructing CST in reversed order • Algorithm for combining the construction of CST & CPT Uni-directional • Algorithm for coordinating 生長點 Bidirectional
Online construction of Suffix Tree in reversed order • Goal: Building CST(at) from CST(t). • Weiner’s algorithm. • Idea: Since CST(t) already represents all suffixes of CST(at) except at. Only the suffix at needs to be inserted.
Online construction of Suffix Tree in reversed order • α(t) longest nested prefix(t) • Additional Information: We assume that we have |α(at)| in each iteration of the following algorithm. With this, we don’t need the ”indicator vector“ used by Weiner. • |α(at)| = the length of α(at)
From CST(t) to CST(at) • Idea: If we insert the suffix at into the CST(t), the leaf for at will branch at the location representing α(at) in CST(t). • Why?
o l O+l1 l-l1 O+l1+l2 l-l1-l2 Definition …………….A B C A B C A B C A B C • Reference Pair • (base, (offset, length)) • (b,(o,l)) is a reference pair for s if path(b)t[o,…,o+l] = s. (b1, (o, l)) b1 b (b2, (o+ l1, l-l1)) (b, (o, l)) l1 b2 l2 (b3, (o+l1+l2, l-l1-l2)) b3 l3 b4
From CST(t) to CST(at) • Goal: Find 生長點(at) from 生長點(t). • 生長點(at) = the canonical location of α(at) 生長點(t) = the canonical location of α(t)
From CST(t) to CST(at) • Algorithm Step1: find 生長點(at) from 生長點(t). Step2: insert the leaf for at at 生長點(at).
Step1: find 生長點(at) from 生長點(t). • lemma1: α(at) is a prefix of aα(t). • proof: Obviously, both strings are prefixes of at = { aα(t)__, α(at)__ }. Either α(at) is a prefix of aα(t) or aα(t) is a proper prefix of α(at). If aα(t) were a proper prefix of α(at), then α(at) = aα(t)v for some string v. Hence, t = {α(t)v__, __aα (t)v__ }, which contradicts the definition of α.
Step1: find 生長點(at) from 生長點(t). • 生長點(t)=(b,(O,L)) 生長點(at)=(b’,(O’,L’)) • path(b’) is a prefix of α(at). Hence there is a suffix link labeled with a to some node s. a s b’ b
Step1: find 生長點(at) from 生長點(t). • Since α(at) is a prefix of aα(t) (by lemma1) and path(b’) is a prefix of α(at), path(s) is a prefix of α(t). • s is an ancient of b, thus there is a path p=(s,…,b) in CST(t). a a s s b’ b
Step1: find 生長點(at) from 生長點(t). • lemma2: For any intermediate node s’ on the path p (s’≠s, s’ ≠b), there is no node q such that there is a suffix link labeled with a from q to s’. a s s’ b’ a b DON’T EXIST q
Proof of Lemma2 • If a q existed. Since path(s’) is a nested prefix of t, path(q) is a nested prefix of at. We know that q≠b’, therefore, q would be a node above b’. • And |path(q)| = |a| + |path(s’)| > |a| + |path(s)| = |path(b’)| = α(at) which contradicts.
Step1: find 生長點(at) from 生長點(t). • As the result, we can find 生長點(at) from 生長點(t) by the following algorithm: Start at the base b and walk up towards the root until a suffix link labeled with a is found. • We have the new base b’, with the knowledge of |α(at)| we can set the length L to |α(at)|-depth(b’) and get 生長點(at).
Step1: find 生長點(at) from 生長點(t). • 生長點(t) = (b,(O,L)). • Algorithm WHILE(1) if ( find suffix link labeled with a ) then b := b’ break if ( b = root ) then O := O - 1 break else len := getParentEdge(b).length O := O – len b := getParent(b) L := |α(at)| - depth(b)
Step1: find 生長點(at) from 生長點(t). • Actually, to ensure O will never increase and to avoid redundant actions, we don’t keep 生長點(at), but keep the parent of 生長點(at).
Step2: insert the leaf for atat 生長點(at). • There are three possibilities for 生長點(at)’s location: 1. inner node. 2. in some edge. 3. leaf.
Step2: insert the leaf for atat 生長點(at). • inner node. [u,v], len=v-u+1 生長點(at), depth=d [-|at|+d,0] b= depth=|at| O = x L = 0 The leaf for suffix t [-|at|,-|at|] O = x – len L = len
Step2: insert the leaf for atat 生長點(at). • in some edge. 生長點(at)=(b,(O,L)) [O,O+L-1] [??,??] [x,y] some node [x+L,y] b= [O+L,0] The leaf for suffix t depth=|at| [-|at|,-|at|]
Step2: insert the leaf for atat 生長點(at). • How to generate suffix link for the branching node? • lemma3: the suffix link would be labeled a, and pointing to node s’ which is on the path p={s,…,b}. a s b’ s’ b
Proof of Lemma3 • Consider b’’, let path(b’) = aw, path(b’’) = awp. • Thus path(s) = w. • CST(at) branches at b’’ means that at = {__awpx__, __awpy__} for some x,y. • Thus there is some inner node s’ with path(s’) = wp. • By lemma1, we know that wp is a prefix of α(t). • s’ is on the path p={s,…,b}. a s s’ b’ s’ b b’’
How to generate suffix link for the branching node? • Since |α(at)| is known, when we are executing step1, find a node with depth = |α(at)| - 1 and remember it. • Thus to construct the suffix link for the branching node can be done in constant time. a s depth=|α(at)|-1 b’ s’ a b
How to generate suffix link for the branching node… cont. • What if there is already some suffix link pointing to s’? • There is at most one such node p. Why? a s b’ s’ a b xa p x
Step2: insert the leaf for atat 生長點(at). • leaf. depth=d’ [u,v] [x,y], len=y-x+1 [u+1,v] [-|at|+d’,0] [u,u] b= 生長點(at), depth=d O = z L = 0 [-|at|+d,0] depth=|at| The leaf for suffix t O = z – len L = len [-|at|,-|at|]
Step2: insert the leaf for atat 生長點(at). • There are three possibilities for 生長點(at)’s location: 1. inner node. 2. in some edge. 3. leaf. • Inserting the leaf for at can be done in constant time.
Example [0,0] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [0,0] [-1,0] [-1,-1] b= O = 0 L = 0 O = -1 L = 1
Example [0,0] [-1,0] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [0,0] [-1,0] [-1,-1] [-2,0] b= [-1,0] O = -2 L = 1 [-2,-2] O = -1 L = 1
Example [-1,0] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [-1,0] [-2,0] [-3,0] b= O = -2 L = 1 [-2,-2] O = -3 L = 1 [-3,-3]
Example [-2,-2] [-1,0] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [-4,-4] [-1,0] [-1,-1] [-2,0] [-3,0] b= [-3,0] O = -4 L = 1 [-2,-2] [-1,0] [-4,-4] O = -3 L = 1 [-3,-3]
Example [-2,-2] -8 -7 -6 -5 -4 -3 -2 -1 0 c a b a a b a c a [-4,-4] [-1,0] [-1,-1] [-4,0] [-3,0] b= [-5,-5] [-3,0] O = -4 L = 1 O = -5 L = 1 [-2,-2] [-1,0] [-4,-4] O = -4 L = 0 [-3,-3]