1 / 44

Sparse Compact Directed Acyclic Word Graphs

Sparse Compact Directed Acyclic Word Graphs. Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & Japan Science and Technology Agency). Traditional Pattern Matching Problem. Given : text T in S * and pattern P in S *

corby
Download Presentation

Sparse Compact Directed Acyclic Word Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Compact DirectedAcyclic Word Graphs Shunsuke Inenaga (Japan Society for thePromotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & Japan Science and Technology Agency)

  2. Traditional Pattern Matching Problem • Given:text T in S* and pattern P in S* • Return:whether or not P appearsin T • S: alphabet (set of characters) • S*: set of strings • A text indexing structure for T enables you to solve the above problem in O(m) time (for fixed S). • m: the length of P

  3. Suffix Trie • A trie representing all suffixes of T $ T = aacb$ a b c a aacb$ acb$ cb$ b$ $ $ c b c b $ b $ $

  4. Unwanted Matching pattern s pace runner text

  5. Unwanted Matching pattern s pace runner text

  6. Unwanted Matching pattern s pace runner text

  7. Introducing Word Separator # • # : word separator - special symbol not in S • D = S* # : dictionary of words • Text T : an element of D+(T is a sequence T1T2…Tk of k words in D) • e.g., T = This#is#a#pen# • S = {A,…,z} • D = {...,This#,...,a#,...is#,...pen#,...}

  8. Word-level Pattern Matching Problem • Given: text T in D+ and pattern P in D+ • Return: whether or not P appears at the beginning of any word in T e.g. T = The#space#runner#is#not#your#good#pace#runner# P = pace#runner#

  9. Word-level Pattern Matching Problem • Given: text T in D+ and pattern P in D+ • Return: whether or not P appears at the beginning of any word in T e.g. T = The#space#runner#is#not#your#good#pace#runner# P = pace#runner#

  10. Word Suffix Trie • A trie representing the suffixes of T which begin at a word. T = aa#b# a b a aa#b# a#b# #b# b# # # # b #

  11. Normal and Word Suffix Tries T = aa#b# a a b b # a a # # # b # # b # b b # # # Word Suffix Trie Normal Suffix Trie

  12. Normal and Word Suffix Trees T = aa#b# a b b # # a # a a b # # # # b b b # # # Word Suffix Tree Normal Suffix Tree

  13. Sizes of Word Suffix Tries and Trees • For text T = T1T2…Tk of length n, • the word suffix trie of T requires O(nk) space, but • the word suffix tree of T requires O(k) space!! • because the word suffix tree has only k leaves and has only branching internal nodes.

  14. Construction of Word Suffix Trees • Algorithm by Andersson et al.(1996) • for text T = T1T2…Tk of length n, constructs word suffix trees in O(n) expected time with O(k) space. • Our algorithm (CPM’06) • builds word suffix trees in O(n) time in the worst case, with O(k) space.

  15. Our Construction Algorithm • We modify Ukkonen’s on-line normal suffix tree construction algorithm by using minimum DFA accepting dictionary D • We replace the root node of the suffix tree with the final state of the DFA.

  16. Minimum DFA • The minimum DFA accepting D = S* # clearly requires constant space (for fixed S). S #

  17. On-line Construction of Word Suffix Trees T = aa#b# a,b # a

  18. On-line Construction of Word Suffix Trees T = aa#b# a,b # a

  19. On-line Construction of Word Suffix Trees T = aa#b# a,b # a a

  20. On-line Construction of Word Suffix Trees T = aa#b# a,b # a a

  21. On-line Construction of Word Suffix Trees T = aa#b# a,b # a a #

  22. On-line Construction of Word Suffix Trees T = aa#b# a,b # a a #

  23. On-line Construction of Word Suffix Trees T = aa#b# a,b # b a a # b

  24. On-line Construction of Word Suffix Trees T = aa#b# a,b # b a a # b

  25. On-line Construction of Word Suffix Trees T = aa#b# a,b # b # a a # b #

  26. Pseudo-Code Just change here

  27. Like Dress-up Doll...

  28. S,# a b # # a b # # # b b # # Different looking!!!! a,b # b a # a # b # Like Dress-up Doll... [cont.] Just change hair!! Body is the same!!

  29. a # b # b a minimization # b # # # b # Compact Directed Acyclic Word Graphs T = aa#b# a b # # a b # # # b b # # Compact Directed Acyclic Word Graph (CDAWG) Suffix Tree

  30. a a b # minimization # b # Sparse CDAWGs T = aa#b# b a # a # b # Sparse Compact Directed Acyclic Word Graph (SCDAWG) Word Suffix Tree

  31. a # b b minimization # a a # b b # a b # Sparse CDAWGs [cont.] T = a#b#a#bab# a b # b # a # a a b a # b # # b # b a a b b # # Word Suffix Tree SCDAWG

  32. SCDAWG Construction • SCDAWGs can be constructed by minimizing word suffix trees in O(k) time. • using Revuz’s DAG minimization algorithm (1992)

  33. SCDAWG Construction [cont.] • Question : Direct construction for SCDAWGs? • Answer : YES!Using minimal DFA accepting dictionary D, we can directly build SCDAWGs in O(n) time and O(k) space. • We modify the CDAWG on-line construction algorithm (Inenaga et al. 05) by using the above DFA.

  34. S,# a # b b # a # # a b # # b a b # # # b Different looking!!!! # a,b # Pseudo-Code Just change here Body is the same!!

  35. Some Events • Basically on-line construction of SCDAWGs is similar to that of word suffix trees. • Except for the two following unique events: • Edge merging • Node splitting

  36. Edge Merging a,b T = a#b#a#bab#bc... # a # b b # a # a # # b b

  37. Edge Merging a,b T = a#b#a#bab#bc... # a # b b # a # a # # b b a a

  38. Edge Merging a,b T = a#b#a#bab#bc... # a # b b # a # a a # # b b a a

  39. Edge Merging a,b T = a#b#a#bab#bc... # a # b b # a a # b a

  40. Node Splitting a,b T = a#b#a#bab#bc... # a # b b # a a # b b # a b #

  41. Node Splitting a,b T = a#b#a#bab#bc... # a # b b # a a # b b # a b b # b

  42. Node Splitting a,b T = a#b#a#bab#bc... # a # b b # # a a a # b a # b # a b b b b # a # b b b # b

  43. Conclusion • We introduced new text indexing structure sparse compact directed acyclic word graphs (SCDAWGs) for word-level pattern matching. • We presented an on-line algorithm to construct SCDAWGs directly, in O(n) time with O(k) space. • The key is the use of minimum DFA accepting dictionary D.

  44. Related Work • “Sparse Directed Acyclic Word Graphs”by Shunsuke Inenaga and Masayuki TakedaAccepted to SPIRE’06

More Related