Lecture on Information Knowledge Network "Information retrieval and pattern matching"

Lecture on Information Knowledge Network"Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA Lecture on Information knowledge network

The 5thRegular expression matching About regular expression Flow of processing Construction of syntax tree (parse tree) Construction of NFA for RE Simulating the NFA Lecture on Information knowledge network

What is regular expression? • Notation for flexible and strong pattern matching • Example of a regular expression of filenames: > rm ＊.txt > cp Important[0-9].doc • Example of a regular expression of search tool Grep： > grep –E “for.+(256|CHAR_SIZE)” ＊.c • Example of a regular expression of programming language Perl： $line = m|^http://.+\.jp/.+$| • A regular expression can express a regular set (regular language). • It expresses a language L (sets of strings) which can be accepted by a finite automaton matches to any files whose extensions are “.txt. “ matches to Important0.doc～Important9.doc matches to strings which start with “http://” and include ”.jp/”. Lecture on Information knowledge network

Definition of regular expression • Definition:A regular expression is a string over Σ∪{ε, |, ･, *, (, )}, which is recursively defined by the following rules. • (1) An element of {ε}∪Σ is a regular expressions. • (2) If α and β are regular expressions, then (α･β) is a regular expression. • (3) If α and β are regular expressions, then (α|β) is a regular expression. • (4) If α is a regular expression, α* is a regular expression. • (5) Only ones led on from the above are regular expressions. Example: (A･((A･T)|(C･G))*)　　→ A(AT|CG)* (α･β) is often described αβ for short ※ Symbols ‘| ’, ‘・’, ‘*’ are called operator. Moreover, for a regular expression α, “+" is often used in the meaning of α+ =α・α*. Lecture on Information knowledge network

q0 q2 An equivalent DFA to the left example q1 a a,b b a,b Semantic of regular expression • A regular expression is mapped into a subset of Σ*(language L) • (i) ||ε|| = {ε} • (ii) For a∈Σ,|| a || = { a } • (iii) For regular expressions α and β, ||(α・β)|| = ||α||・||β|| • (iv) For regular expressions α and β, ||(α|β)|| = ||α||∪||β|| • (v) For a regular expression α, ||α*|| = ||α||* • For example： (a・(a | b)*) || (a・(a | b) *) ||= ||a||・||(a | b)*||= {a}・||(a | b)||*= {a}・({a}∪{b})*= { ax | x∈{a, b}* } ※exercise:What is the equivalent language to (AT|GA)(TT)* ? Lecture on Information knowledge network

What is the regular expression matching problem? • Regular expression matching problem： • It is the problem to find any strings in L(α)＝||α||, which is defined by a given α, from a given text. • The ability of regular expression to define a language is equal to that of finite automaton! • We can construct a finite automaton that accepts the same language expressed by a regular expression. • We also can describe a regular expression that expresses the same language accepted by a finite automaton. ※ Please refer to "Automaton and computability" (2.5 regular expressions and regular sets), written by SetsuoArikawa and Satoru Miyano. • What we should do for matching a regular expression is to make an automaton (NFA/DFA) corresponding to the regular expression and then to simulate it. • A regular expression is easier to convert to NFA than to DFA. • The initialization state of the automaton is always active. • The pattern expressed by a given regular expression occurs when the automaton reaches to the final states by reading a text. Lecture on Information knowledge network

Flow of pattern matching process Constructing NFA by Thompson method General flow Scan texts Parsing Regular expression Parse tree NFA Report the occurrences Translate Scan texts Constructing NFA by Glushkovmethod DFA Flow with filtering technique Multiple pattern matching Extracting Verify Regularexpression A set of strings Find the candidates Report the occurrences Lecture on Information knowledge network

・・ | * | * Depth of parentheses Operator ・・ | ・・ | ・・ A T G A ・・・ A G ・ A A A Construction of parse tree • Parse tree:a tree structure used in preparation for making NFA • Each leaf node is labeled by a symbol a∈Σ or the empty word ε. • Each internal node is labeled by a operator symbol on {|, ・, *}. • Although a parser tool like Lex and Flex can parse regular expressions, it is too exaggerated. (The pseudo code of the next slide is enough to do that). Example: the parse tree TRE for regular expression RE=(AT|GA)((AG|AAA)*) ( A T | G A ) ( ( A G | A A A ) * ) A T G A 1 1 | 2 | A G A A A Lecture on Information knowledge network

Pseudo code • Parse (p=p1p2…pm, last) • v ← θ; • while plast≠$ do • if plast∈Σ or plast=ε then /* normal character */ • vr ← Create a node with plast; • if v≠θthen v ← [・](v, vr); • else v ← vr; • last ← last + 1; • else if plast = ‘|’ then /* union operator */ • (vr, last) ← Parse(p, last + 1); • v ← [ | ](v, vr); • else if plast = ‘*’ then /* star operator */ • v ← [ * ](v); • last ← last + 1; • else if plast = ‘( ’ then /* open parenthesis */ • (vr, last) ← Parse(p, last + 1); • last ← last + 1; • if v≠θthen v ← [・](v, vr); • else v ← vr; • else if plast = ‘)’ then /* close parenthesis */ • return (v, last); • end of if • end of while • return (v, last); Lecture on Information knowledge network

ε A G 9 10 11 ε ε A T ε 8 16 1 2 3 ε ε ε A A A ε ε 12 13 14 15 0 7 17 ε ε G A ε 4 5 6 NFA construction by Thompson method K. Thompson. Regular expression search algorithm. Communications of the ACM, 11:419-422, 1968. • Idea: • Traversing the parse tree TRE for a given RE in post-order traversal, we construct the automaton Th(v) that accepts language L(REv) corresponding to a partial tree whose top is node v. • The key point is that Th(v) can be obtained by connecting with ε transitions the automatons corresponding to each partial tree whose top is a child of v. • Properties of Thompson NFA • The number of states < 2m, and the number of state transitions < 4m →O(m). • It contains many ε transitions. • The transitions other than ε connect the states from i to i+1. Example：　Thompson NFA for RE = (AT|GA)((AG|AAA)*) Lecture on Information knowledge network

NFA construction algorithm • For the parse tree tree TRE, as traversing the tree in post-order traversal, it generates and connects the automatons for each node as follows. (i) When v is the empty word ε (iv) When v is a selection ”|” → (vL| vR) ε vL ε IL FL I F ε I F (ii) When v is a character “a” ε ε vR IR FR a I F (v) When v is a repetition”*” → v* (iii) When v is a concatenation ”・”→ (vL・vR) ε v ε ε vL vR IL FR I F ε Lecture on Information knowledge network

・ ε | * ε A G A G 9 10 11 ε ε 9 10 11 ε ・・ | ε A T A T ε 8 16 1 2 3 ε ε ε ε A A A 8 16 ε ε 1 2 3 ε A T G A ・・ ε ε A A A ε ε 12 13 14 15 12 13 14 15 0 7 17 A G ・ A ε ε 0 7 17 G A ε ε G A ε ε 4 5 6 A A 4 5 6 Move of the NFA construction algorithm Ex.： Parse tree TRE for RE=(AT|GA)((AG|AAA)*) 18 7 17 6 16 3 15 10 1 2 4 5 13 14 8 9 11 12 Ex.： Thompson NFA for RE = (AT|GA)((AG|AAA)*) Lecture on Information knowledge network

Pseudo code • Thompson_recur (v) • if v = “|”(vL, vR) or v = “・”(vL, vR) then • Th(vL) ← Thompson_recur(vL); • Th(vR) ← Thompson_recur(vR); • else if v=“*”(vC) then Th(v) ← Thompson_recur(vC); • /* the above is for recursive traversal (post-order) */ • if v=(ε) thenreturn construction (i); • if v=(α), α∈Σ thenreturn construction (ii); • if v=“・”(vL, vR) then return construction (iii); • if v=“|”(vL, vR) then return construction (iv); • if v=“*”(vC) then return construction (v); • Thompon(RE) • vRE ← Parse(RE$, 1); /* construct the parse tree */ • Th(vRE) ← Thompson_recur(vRE); Lecture on Information knowledge network

G3 G A7 A A5 A A1 A T2 T A4 A A5 A G6 G A7 A A8 A A9 A 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 A5 A A7 A A7 A A5 A NFA construction by Glushkovmethod V-M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16:1-53, 1961. • Idea • Making a new expression RE’ by numbering each symbol a∈∑ sequentially from the beginning to the end. (Let ∑’ be the alphabet with subscripts) • Example: RE = (AT|GA)((AG|AAA)*) → RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*) • After constructing a NFA that accepts language L(RE'), we obtain the final NFA by removing the subscript numbers. • Properties of Glushkov NFA • The number of states is just m+1, and the number of state transitions is O(m2). • It doesn't contain any ε transitions. • For any node, all the labels of transitions entering to the node are the same. Example：　A NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*) Example：　The Glushkov NFA Lecture on Information knowledge network

NFA construction algorithm (1) • Construction procedure: • Making a new expression RE’ by numbering each symbol a∈∑ sequentially from the beginning to the end. • Pos(RE’) = {1…m}, and ∑’ is the alphabet with subscript numbers. • As traversing the parse tree TRE’ in post-order traversal, for each language REv’ corresponding to a partial tree whose top is v, it calculates set First(REv’), set Last(REv’), function Emptyv, and function Follow(RE', x) of position x as follows. • First(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, αxu∈L(RE’)} • Last(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, uαx∈L(RE’)} • Follow(RE’, x) = {y∈Pos(RE’) | ∃u, v∈∑’*, uαxαyv∈L(RE’)} • EmptyRE: a function that returns {ε} if ε belongs to L(RE), otherwise returns φ. This can be recursively calculated as follows.Emptyε = {ε}, Emptyα∈∑ = φ, EmptyRE1|RE2 = EmptyRE1 ∪ EmptyRE2, EmptyRE1・RE2 = EmptyRE1 ∩ EmptyRE2, EmptyRE* = {ε}. • The NFA is constructed based on the values obtained from the above. Positions of the initial states Positions of the final states transition functions Whether the initial state of the NFA is a final state or not? Lecture on Information knowledge network

G3 A7 A5 A1 T2 A4 A5 G6 A7 A8 A9 0 1 2 3 4 5 6 7 8 9 A5 A7 A7 A5 NFA construction algorithm (2) • The Glushkov NFA GL’= (S, ∑’, I, F, δ’) that accepts language L(RE') • S :A set of states. S = {0, 1, …, m} • ∑' :The alphabet with subscript numbers • I :The initial state id; I = 0 • F :The final states; F = Last(RE’)∪(EmptyRE・{0}). • δ' :Transition function defined by the followings∀x∈ Pos(RE’), ∀y∈ Follow(RE’, x), δ’(x, αy) = y The transitions from the initial state are as follows:∀y∈ First(RE’), δ’(0, αy) = y Example： NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*) Lecture on Information knowledge network

Pseudo code • Glushkov_variables (vRE, lpos) • if v = [ | ](vl,vr) or v = [・](vl,vr) then • lpos ← Glushkov_variables(vl, lpos); • lpos ← Glushkov_variables(vr, lpos); • else if v = [ * ](v*) then lpos ← Glushkov_variables(v*, lpos); • end of if • if v = (ε) then • First(v) ← φ, Last(v) ← φ, Emptyv ← {ε}; • else if v = (a), a∈Σ then • lpos ← lpos + 1; • First(v) ← {lpos}, Last(v) ← {lpos}, Emptyv ← φ, Follow(lpos) ← φ; • else if v = [ | ](vl,vr) then • First(v) ← First(vl)∪First(vr); • Last(v) ← Last(vl)∪Last(vr); • Emptyv ← Emptyvl∪Emptyvr; • else if v = [・](vl,vr)then • First(v) ← First(vl)∪(Emptyvl・First(vr)); • Last(v) ← (Emptyvr・Last(vl))∪Last(vr); • Emptyv ← Emptyvl∩Emptyvr; • for x∈Last(vl) do Follow(x) ← Follow(x)∪First(vr); • else if v = [ * ](v*) then • First(v) ← First(v*), Last(v) ← Last(v*), Emptyv ← {ε}; • for x∈Last(v*) do Follow(x) ← Follow(x)∪First(v*); • end of if • return lpos; O(m3) time totally It takes O(m2)time Lecture on Information knowledge network

Pseudo code (cont.) • Glushkov (RE) • /* make the parse tree by parsing the regular expression */ • vRE ← Parse(RE$, 1); • /* calculate each variable by using the parse tree */ • m ← Glushkov_variables(vRE, 0); • /* construct NFA GL(S,∑, I, F,δ) by the variables */ • Δ←φ; • for i ∈ 0…m do create state I; • for x ∈ First(vRE) do Δ←Δ∪ {(0, αx, x)}; • for i ∈ 0…m do • for i ∈ Follow(i) do Δ←Δ∪ {(i,αx, x)}; • end of for • for x∈ Last(vRE)∪(EmptyvRE・{0}) do mark x as terminal; Lecture on Information knowledge network

Take a breath Taiwan High-speed Railway@Taipei 2011.11.8 Lecture on Information knowledge network

Flow of pattern matching process An NFA can be simulated in O(mn) time Constructing NFA by Thompson method Parsing Scan texts Regularexpression Parse tree NFA Reportthe occurrences Translate Scan texts Constructing NFA by Glushkovmethod DFA To translate, we need O(2m)time and space There exists a method of converting directly into a DFA ※Please refer the section 3.9 of “Compilers – Principles, Techniques and Tools,” written by A. V. Aho, R. Sethi, and J. D. Ullman, Addison-Wesley, 1986. Lecture on Information knowledge network

Methods of simulating NFAs • Simulating a Thompson NFA directly • The most naïve method • Storing the current active states with a list of size O(m), the method updates the states of the NFA in O(m) time for each symbol read from a text. • It obviously takes O(mn) time. • Simulating a Thompson NFA by converting into an equivalent DFA • It is a classical technique. • Refer “Compilers – Principles, Techniques and Tools,” written by A. V. Aho, R. Sethi, and J. D. Ullman, Addison-Wesley, 1986. • The conversion is done as preprocessing → it takes O(2m) time and space. • There are also techniques that converses dynamically as scanning a text. • Hybrid method • E. W. Myers. A four russians algorithm for regular expression pattern matching. Journal of the ACM, 39(2):430-448, 1992. • It is a method that combines NFA and DFA to do efficient matching. • It divides the Thompson NFA into modules which include O(k) nodes for each, and then converses each module into DFA. It simulates the transitions between modules as a NFA. • High-speed NFA simulation by bit-parallel technique • Simulating the Thompson NFA: proposed by Wu and U. Manber[1992] • Simulating the Glushkov NFA: proposed by G. Navarro and M. Raffinot[1999] Lecture on Information knowledge network

T T A A 0 1 8 0 1 9 0 2 T A T T A G C G C G 0 1 0 1 5 7 0 1 5 7 9 A C,T C A G G C A T C T C 0 0 1 8 9 C T G C,T C A A G C,T G A 0 3 0 4 0 3 6 0 1 4 5 7 0 1 5 7 8 A C,T A A G G G G G Simulating by converting into an equivalent DFA Ex.： A DFA converted from the Glushkov NFA for RE = (AT|GA)((AG|AAA)*) C,T • DFA Classical (N = (Q,∑, I, F,Δ), T = t1t2…tn) • Preprocessing: • for σ∈∑ do Δ←Δ∪ (i, σ, I); • (Qd,∑, Id, Fd,δ) ← BuildDFA(N); /* Make an equivalent DFA with NFA N */ • Searching: • s ← Id; • for pos ∈ 1…n do • if s∈Fdthen report an occurrence ending at pos – 1; • s ← δ(s, tpos); • end of for Lecture on Information knowledge network

Bit-parallel Thompson (BPThompson) S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10):83-91, 1992. • Simulating the Thompson NFA by bit-parallel technique • For Thompson NFAs, note that the next of the i-th state is the i+1th except for ε transitions.→ Bit-parallelism similar to the Shift-and method can be applicable. • ε transitions are separately simulated. • This needs the mask table of size 2L (L is the number of states of the NFA) • It takes O(2L + m|∑|) time for preprocessing. • It scans a text in O(n) time when L is small enough. • About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ) • The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|Q|-11, Fn = |sj∈F 0|Q|-1-j10j • Definitions of mask tables: • Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j • En[ i ] = |sj∈E(i) 0|Q|-1-j10j (where E(i) is the ε-closure of state si) • Ed[D] = |i, i=0 OR D&0L-i-110i ≠ 0L En[ i ] • B[σ] = |i∈0…mBn[i, σ] Lecture on Information knowledge network

Pseudo code • BuildEps (N = (Qn,∑,In,Fn,Bn,En) ) • for σ∈∑ do • B[σ] ← 0L; • for i∈0…L–1 do B[σ] ← B[σ] | Bn[i,σ]; • end of for • Ed[0] ← En[0]; • for i∈0…L–1 do • for j∈0…2i – 1 do • Ed[2i + j] ← En[ i ] | Ed[ j ]; • end of for • end of for • return (B, Ed); • BPThompson (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn) • Preprocessing: • (B, Ed) ← BuildEps(N); • Searching: • D ← Ed[ In ]; /* initial state */ • for pos∈1…n do • if D & Fn≠ 0Lthen report an occurrence ending at pos–1; • D ← Ed[ (D << 1) & B[tpos] ]; • end of for Lecture on Information knowledge network

Bit-parallel Glushkov (BPGlushkov) G. Navarro and M. Raffinot. Fast regular expression search. In Proc. of WAE99, LNCS1668, 199-213, 1999. • Simulating the Glushkov NFA by bit-parallel technique • For Glushkov NFAs, note that, for any node, all the labels of transitions entering to the node are the same.→ Although the bit-parallel similar to the Shift-And method cannot be applicable, each state transition can be calculated by Td[D]&B[σ]. • The number of mask tables is 2|Q| (while it is 2L for BPThompson). • It takes O(2m + m|∑|) time for preprocessing. • It scans a text in O(n) time when m is small enough. • It is more efficient than BPThompson in almost all cases. • About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ) • The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|Q|-11, Fn = |sj∈F 0|Q|-1-j10j • Definitions of mask tables: • Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j • B[σ] = |i∈0…mBn[i, σ] • Td[D] = |(i,σ), D&0m-i10i ≠ 0m+1, σ∈∑ Bn[i,σ] Lecture on Information knowledge network

Pseudo code • BuildTran (N = (Qn,∑,In,Fn,Bn,En) ) • for i∈0…m do A[ i ] ← 0m+1; • for σ∈∑ do B[σ] ← 0m+1; • for i∈0…m, σ∈∑ do • A[ i ] ← A[ i ] | Bn[I,σ]; • B[σ] ← B[σ] | Bn[i,σ]; • end of for • Td[0] ← 0m+1; • for i∈0…m do • for j∈0…2i – 1 do • Td[2i + j] ← A[ i ] | Td[ j ]; • end of for • end of for • return (B, Ed); • BPGlushkov (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn) • Preprocessing: • for σ∈∑ do Bn[0,σ] ← Bn[0,σ] | 0m1; /* initial self-loop */ • (B, Ed) ← BuildTran(N); • Searching: • D ← 0m1; /* initial state */ • for pos∈1…n do • if D & Fn≠ 0m+1then report an occurrence ending at pos–1; • D ← Td[D] & B[tpos]; • end of for Lecture on Information knowledge network

Other topics • Extended regular expression: • The one with allowing two operations, intersection and complementation, in addition to connection, selection, and repetition. • ￢(UNIX)∧(UNI(.)* | (.)*NIX) • It is different from POSIX regular expression. • H. Yamamoto, An Automata-based Recognition Algorithm for Semi-extended Regular Expressions, Proc. MFCS2000, LNCS1893, 699-708, 2000. • O. Kupferman and S. Zuhovitzky, An Improved Algorithm for the Membership Problem for Extended Regular Expressions, Proc. MFCS2002, LNCS2420, 446-458, 2002. • Researches on speeding-up regular expression matching • Filtration technique using BNDM + verification • G. Navarro and M. Raffinot, New Techniques for Regular Expression Searching, Algorithmica, 41(2): 89-116, 2004. - In this paper, the method of simulating the Glushkov NFA with mask tables of O(m2m) bits is also presented. Lecture on Information knowledge network

Regular expression • the ability of it to define the language is the same as that of finite automaton. • Flow of regular expression matching • After translating it to a parse tree, the corresponding NFA is constructed. Matching is done by simulating the NFA • Filtration + pattern plurals collation + inspection + NFA simulation • Methods for constructing an NFA • Thompson NFA: • The number of states < 2m, and the number of state transitions < 4m →O(m). • It contains many ε transitions. • The transitions other than ε connect the states from i to i+1. • Glushkov NFA • The number of states is just m+1, and the number of state transitions is O(m2). • It doesn't contain any ε transitions. • For any node, all the labels of transitions entering to the node are the same. • Methods of simulating NFAs • Simulating Thompson NFAs directly → O(mn) time • Converting into an equivalent DFA → It runs in O(n) for scanning, but it takes O(2m) time and space for preprocessing. • Speeding-up by bit-parallel techniques:Bit-parallel Thompson and Bit-parallel Glushkov • The next theme • Pattern matching on compressed texts: an introduction of Kida’s research (it’s a trend of 90's in this field!) The 5th summary Lecture on Information knowledge network

Appendix • About the definitions of terms which I didn’t explain in the first lecture. • A subset of ∑* is called a formal language or a language for short. • For languages L1, L2∈∑*, the set { xy | x∈L1 andy∈L2 }is called a product of L1and L2, and denoted by L1・L2 or L1L2for short. • For a language L⊆∑*, we define L0 = {ε}, Ln = Ln-1・L (n≧1)Moreover, we define L* = ∪n=0…∞ Lnand call it as a closure of L. We also denote L+ = ∪n=1…∞ Ln. • About look-behind notations • I told in the lecture that I couldn’t find the precise description of look-behind notations. But I eventually found that! • Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, The MIT Press, Elsevier, 1990. • (Japanese translation)コンピュータ基礎理論ハンドブックⅠ：アルゴリズムと複雑さ，丸善，1994. • Chapter 5, section 2.3 and section 6.1 • According to this, it seems that the notion of look-behind appeared in 1964. • It exceeds the frame of context-free grammar! • The matching problem of it is proved to be NP-complete. Lecture on Information knowledge network

Lecture on Information Knowledge Network "Information retrieval and pattern matching"