390 likes | 529 Views
Ternary Directed Acyclic Word Graphs (TDAWG). Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara. Present by Peera Liewlom (The Last Algorithm Group). CIAA 2003. Eighth International Conference on Implementation and Application of Automata
E N D
Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last Algorithm Group)
CIAA 2003 • Eighth International Conference on Implementation and Application of Automata • July 16-18, 2003, Santa Barbara, CA, USA • Topic / Committee / Community
Why did I select this paper ? • DAWG start 1985… not so far • Continueing development • cDAWG, ASDAWG, morphic DAWG, WDAWG, SDAWG, two-tree DAWG, DASG, CSDAWG etc. • TST : 1997 – 98, TDAWG : 2003 • DAWG : Widely Apply by Bioinformatics, NLP, Graph Theory, String Matching, Automata etc. • Speed & Space Trends in Huge Data Management • Topic for Algorithm Group • Matching the interesting topics in this seminar group
Content • DFA (use in string matching’s problem) • DAWG • Ternary Search Tree • Paper : TDAWG, Experiment & Result • Paper : Conclusion • Paper : Discussion
Formalities • Deterministic Finite Accepter (DFA) : set of states : input alphabet : transition function : initial state : set of final states
Another Example accept accept accept
= { all strings without substring }
TST History • Jon L. Bentley and Robert Sedgewick • Algorithms for Sorting and Searching Strings, Proceeding. 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), January 1997. • Ternary Search Trees, Dr. Dobb's Journal, April 1998. • Dictionary of Algorithms and Data Structures, National Institute of Standard and Technology, http://www.nist.gov/
DST BST TST
Introduction • DFA how to implement the transitions of each state ? (Time & Space efficiency) • TST “implant” BST for transitions • Good Time • DAWG smallest DFA for all suffixes • Good Space • TDAWG • Proof : TDAWG VS. DAWG
Hypothesis / Theorem (1/2) • Time = Construct + Search (useable for online) • DFA function • = Alphabet (Chinese & Japan ~ 1000 chars) • State • Table O(|p|) p = length of pattern • Table use very large memory • Link List O(| | x |p|) search time • If is large … problem for search time
Hypothesis / Theorem (2/2) • For TDAWG • Use O(|S|) space • Use O(log|| x |p|) for search time • Use O(|| x |S|2) construct time (Bentley & Sedwick) • Use O(|| x |S|) construct time (this paper … apply from Blummer’s online DAWG construction) • Comparison : TDAWG VS. DAWG(table & link list) • Space , Search Time , Construction Time
Conclusion • New data structure … TDAWG • Construction time (English text 256) • TDAWG < linklistDAWG < tableDAWG • Space Requirment • linklistDAWG < TDAWG ~ 20 % • tableDAWG not compare in same scale • Search Time • Short pattern: tableDAWG best , TDAWG < linklistDAWG • Log curve VS. Linear Curve (long pattern?)
Discussion & Future Work • In Asian Language (characters~1000s) should have better search time than English (character 256) because log(||x|p|) • Apply to other DAWG… cDAWG, minimumDAWG …etc. • More efficiency by AVL tree (AVL-balance) • Bioinformatic have 4 character . But, Sliding window with 12 characters = 412