290 likes | 446 Views
On-line Linear-time Construction of Word Suffix Trees. Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & JST). Pattern Searching Problem. Given : text T in S * and pattern P in S *
E N D
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for thePromotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & JST)
Pattern Searching Problem • Given:text T in S* and pattern P in S* • Find:an occurrence of P in T • S: alphabet • S*: set of strings • Using an indexing structure for T, we can solve the above problem in O(|P|) time.
Suffix Trie • A trie representing all suffixes of T $ T = aacb$ a b c a aacb$ acb$ cb$ b$ $ $ c b c b $ b $ $
Unwanted Matching pattern s pace runner text
Unwanted Matching pattern s pace runner text
Unwanted Matching pattern s pace runner text
Introducing Word Separator # • # : word separator - special symbol not in S • D = S* # : dictionary of words • Text T : an element of D+(T is a sequence T1T2…Tk of k words in D) • e.g., T = This#is#a#pen# • S = {A,…,z} • D = {...,This#,...,a#,...is#,...pen#,...}
Word-level Pattern Searching Problem • Given: text T in D+ and pattern P in D+ • Find: an occurrence of P in T which immediately follows # e.g. The#space#runner#is#not#your#good#pace#runner#
Word-level Pattern Searching Problem • Given: text T in D+ and pattern P in D+ • Find: an occurrence of P in T which immediately follows # e.g. The#space#runner#is#not#your#good#pace#runner#
Word Suffix Trie • A trie representing the suffixes of T which immediately follows # (and T itself). T = aa#b# a b a aa#b# a#b# #b# b# # # # b #
Comparison T = aa#b# a a b b # a a # # # b # # b # b b # # # Word Suffix Trie Suffix Trie
Construction • Suffix Trie : Ukkonen’s on-line algorithm (1995) • Word Suffix Trie : We modify Ukkonen’s algorithm by: • Using minimum DFA accepting dictionary D • Redefining suffix links
Minimum DFA • The minimum DFA accepting D = S* # clearly requires constant space (for fixed S). • We replace the root node of the suffix trie with the final state of the DFA. S #
Suffix Links T = aa#b# a,b # a b a # # b # Word Suffix Trie
Suffix Links [cont.] T = aa#b# a,b # a a b b # a # a # # b # # b # b b # # # Word Suffix Trie Suffix Trie
On-line Construction T = aa#b# a,b # S, # a a Word Suffix Trie Suffix Trie
On-line Construction T = aa#b# a,b # S, # a a a a Word Suffix Trie Suffix Trie
On-line Construction T = aa#b# a,b # S, # a a # a a # # # Word Suffix Trie Suffix Trie
On-line Construction T = aa#b# a,b # S, # a a b b # a a # b # # b b b Word Suffix Trie Suffix Trie
On-line Construction T = aa#b# a,b # S, # a a b b # a a # # # b # # b # b b # # # Word Suffix Trie Suffix Trie
Pseudo Code S,# # Just change here!! a b # a # # b # b b # # # a,b # a b a # # b #
Like Dress-up Doll [cont.] S,# # Hair is different a b # a # # b # b b # # # Different looking!!!! a,b # a b a # # b # Body is the same!!
Drawback of Word Suffix Trie • Word suffix tries require O(k|T|) space. • Andersson et al. introduced word suffix trees which can be implemented in O(k) space.
Construction of Word Suffix Trees • Algorithm by Andersson et al.(1996) • for text T = T1T2…Tk, constructs word suffix trees in O(|T|) expected time with O(k) space. • Our algorithm • simulates the on-line word suffix trie algorithm on word suffix trees. • runs in O(|T|) time in the worst cases, with O(k) space.
Normal and Word Suffix Trees T = aa#b# # a b b # # a # a a b # # # # b b b # # # Word Suffix Tree Suffix Tree
Construction Algorithm Just change here!!
Conclusions • We first proposed an on-line word suffix trie construction algorithm. • The keys to the algorithm are the minimal DFA accepting D and the re-defined suffix links. • Further, we introduced an on-line algorithm to build word suffix trees that works with O(k) space and in O(|T|) time in the worst cases.
Further Work • “Sparse Directed Acyclic Word Graphs”by Shunsuke Inenaga and Masayuki TakedaAccepted to SPIRE’06 • “Sparse Compact Directed Acyclic Word Graphs”by Shunsuke Inenaga and Masayuki TakedaAccepted to PSC’06