340 likes | 372 Views
Explore the MASDAWG concept, its construction, size determination, and applications in pattern matching. Understand MASDAWG construction directly, its relevance to region-sensitive and beginning-sensitive pattern matching. Investigate the Wildcard DAWG for VLDC pattern matching.
E N D
Shunsuke Inenaga, Masayuki Takeda, Ayumi Shinohara, Hiromasa Hoshino, Setsuo Arikawa The Minimum DAWG for All Suffixes of a String and its Applications
The Minimum for All Suffixes of a String and its Applications DAWG Dynamic Attractive Worldcup Game?
Directed Acyclic Word Graph!!
TheDirected Acyclic Word Graphof a stringw,DAWG(w), represents all substrings of w (Blumer et al., 1985). DAWG(w) is the smallest automaton that accepts all suffixes of w (Crochemore, 1986). Directed Acyclic Word Graph
b a a b b b b a accepting node DAWG of string “abbab” DAWG(abbab) For any string w, DAWG(w) can be built in linear time and space in |w| (Blumer et al., 1985).
b b b a a b b a b b b a a a b b a a b a b b The DAWGs for All Suffixes of “abbab” DAWG(abbab) The collection of these DAWGs is called the naive All-Suffixes DAWG (ASDAWG) of w. DAWG(bbab) DAWG(bab) DAWG(ab) DAWG(b) DAWG(e)
0 b b b a a b b a 1 b b b a DAG minimization algorithm (Revuz, 1992) a a 2 b b a a b a 3 b b 4 22 nodes 22 edges 5 Minimizing the naive ASDAWG(abbab)
5 0 b b b a a b b a b 1 a a a b 2 b b a 3 MASDAWG(abbab) 4 12 nodes 15 edges The Minimum ASDAWG(abbab)
What is MASDAWG exactly? MASDAWG(w) is the smallest dag with n initial nodes, where the subgraph consisting of nodes reachable from the k-th initial node and their out-going edges is DAWG(w[k:n]).
5 0 b b b a a abbab b b b a b 0 1 2 3 4 5 b a a a b 1 a a a b 2 b b a 3 4 MASDAWG(abbab) MASDAWG(abbab) DAWG(bbab)
Time Taken to Construct MASDAWGs MASDAWG(w) can be obtained in time linear in the number of edges of the naive ASDAWG(w). On the other hand, the size of the naive ASDAWG(w) is O(n2).
The Size of MASDAWG Theorem 1 The number of nodes in MASDAWG(w)is Q (n) if |S| = 1; Q (n2) if |S| > 1.
1 0 2 3 4 5 a a a a a The Size of MASDAWG For a unary alphabet S = {a} MASDAWG(a5)
The Size of MASDAWG For an alphabet S with |S| > 1 The series of string (ab)m(ba)m gives the lower bound W(n2).
Is it possible to construct MASDAWG(w) directly? Direct Construction of MASDAWGs Question
0 1 0 2 1 1 0 On-Line Construction of MASDAWG(abbab) MASDAWG(e) MASDAWG(ab) b a b MASDAWG(a) b a
2 3 1 2 0 Direct Construction of MASDAWG(abbab) MASDAWG(abb) b b a b b b b
3 1 2 0 Direct Construction of MASDAWG(abbab) MASDAWG(abb) b b a b b b
2 3 4 3 1 2 0 Direct Construction of MASDAWG(abbab) MASDAWG(abba) b a b a b a b b a b a a
4 5 4 3 2 1 0 Direct Construction of MASDAWG(abbab) MASDAWG(abbab) b a b b a Finish!!! b a b b a b a a b b a b
b a b b a b a b b a a a b b a b 3 4 5 2 1 0 Direct Construction of MASDAWG(abbab) MASDAWG(abbab)
b a b b a b a b DAWG(abbab); abba DAWG(bbab); bba DAWG(bab); ba DAWG(ab); a b a b a a b b a b 4 3 4 5 2 1 0 Length Information MASDAWG(abbab)
DAWG(w[i:n]); x1 . . x2 DAWG(w[i:n]); x1 … DAWG(w[i+1:n]); x2 DAWG(w[i+k-1:n]); xk … . . . . xk xk+1 … DAWG(w[i+k:n]); xk+1 … … . . DAWG(w[i+l-1:n]); xl … DAWG(w[i+l-1:n]); (i, k, k, |xk|) (i, l, k, |xk|) xl (i+k, l, k+1, |xk+1|) Length Information
Direct Construction of MASDAWGs Theorem 2 For any string w, MASDAWG(w) can be constructed directly, in linear time and space in its output size.
Why MASDAWGs? Application of MASDAWGs Question • Beginning Sensitive Pattern Matching • Region Sensitive Pattern Matching • VLDC Pattern Matching
n w Does p appear in ? Beginning Sensitive Pattern Matching Beginning Sensitive Pattern a pair <p, i> where p is a string and i is a non-negative integer. i BS-Pattern Matching Problem Instance: text w and BS-pattern <p, i> Determine: whether p is a substring of w[i:n]
abbab 0 1 2 3 4 5 BS-Pattern<ab, 1> ? BS-Pattern Matching with “abbab” MASDAWG(abbab) 5 0 b b b b a a b b a b 1 1 a a a a b 2 b b a 3 4
n w Does p appear in ? Region Sensitive Pattern Matching Region Sensitive Pattern a triple <p,(i, j)> where p is a string and i, j are non-negative integers. j i RS-Pattern Matching Problem Instance: text w and RS-pattern <p, (i, j)> Determine: whether p is a substring of w[i:j]
abbab 0 1 2 3 4 5 RS-Pattern<ab, (1, 4)> ? RS-Pattern Matching with “abbab” MASDAWG(abbab) 5 0 b b b b a a 0 3 1 4 4 5 5 2 b b a 2 b 1 1 a a 1 1 a a b 2 b b 2 3 a 3 3 4 4
Let * be a variable-length-don’t-care (wildcard) that matches any string. A pattern containing characters in S and *’s is called a VLDC-pattern. VLDC-Pattern Matching An example of a VLDC-pattern is ab*ba*. The VLDC-pattern ab*ba*matches string abababb with the first and the second *’s being replaced by a and bb, respectively.
Wildcard DAWGs The Wildcard DAWG of a string w, WDAWG(w), is the smallest automaton recognizing all VLDC-patterns matching w. WDAWG(w) is inherently the same structure as MASDAWG(w).
a * b b b a a * b * b * a * b * * a * a a a * b b b * * * * WDAWG(abbab)
VLDC-Pattern Matching VLDC-Pattern Matching Problem Instance: text w and VLDC-pattern q Determine: whether q matches w
Space-Economical Construction of Index • Structures for All Suffixes of a String Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, Hideo Bannai, Setsuo Arikawa (To appear in MFCS 2002) • Discovering Best Variable-Length-Don’t-Care • Patterns Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa (Submitted to DS 2002) Coming Soon