380 likes | 416 Views
Regular Expressions into Finite Automata. Anne Bruggemann-Klein Presenting: Rutie Mesing. Outline. Building the Glushkov automaton in O((size of E) 2 ) Defining the Star Normal Form Building the Glushkov automaton in O(size of E) for deterministic regular expressions
E N D
Regular Expressions intoFinite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing
Outline • Building the Glushkov automaton in O((size of E)2) • Defining the Star Normal Form • Building the Glushkov automaton in O(size of E) for deterministic regular expressions • Strong and weak unambiguity • Quadratic time decision algorithm for weak unambiguity
General definitions • E– regular expression • L(E)– the language specified by the regular expression E • The size of a regular expression E • The number of symbols it contain, including the syntactic symbols such as brackets, +, ., and * • The size of an NFA • The number of its transitions
pos(E), (x) (a+b)*a(ab)* (a1+b2)*a3(a4b5)* • pos(E)– the set of subscripted symbols in an expression E • x, y, z are used to denote positions • a, b, c are used for elements of • For a position x, (x) is the corresponding symbol of
a b 3 2 b b a a 1 a The Glushkov Automaton (NFA) • ME = (QE {qI}, , E, qI, FE) • QE = pos(E) • For a∈ , letE (qI,a) = {x| x∈first(E), (x)=a} • For x∈pos(E), a∈, letE(x,a) = {y| y∈follow(E,x), (y)=a} • FE = last(E){qI}if ∈L(ME) last(E) otherwise • Proposition 2.1L(ME) = L(E) Example (a*+ba)* = (a1*+b2a3)*
The canonical method (O(n3)) for computingfirst,last&follow • Converting E into a syntax tree • Leafs are labeled with: , or positions of E • Internal nodes: +, . or * • Building time: O(n) (n = size of E) • Each nodevin the syntax tree corresponds to asubexpression EvofE. • Postorder traversal of the syntax tree computing: • nullable(v): Boolean – canEv contain • first(v), last(v): 2pos(E) • Foreachxpos(E) thereisaglobalvariable: follow(x): 2pos(E) O(n3)
v is a node labeled +: • nullable (v) := nullable (leftchild ) or nullable (rightchild ); first(v) := first(leftchild ) first(rightchild ); ( ) • last(v) := last(leftchild ) last(rightchild ); ( ) • v is a node labeled .: • nullable (v) := nullable (leftchild ) and nullable (rightchild ); • for each x in last(leftchild) do • follow (x) := follow (x) first(rightchild ); ( ) • if nullable(leftchild) then • first(v) := first(leftchild ) first(rightchild ) ( ) • else • first(v) := first(leftchild ); if nullable(rightchild)then last(v):= last(leftchild ) last(rightchild ) ( ) else last(v) := last(rightchild ); v is a node labeled *: nullable (v):= true; foreach x in last(child)do follow (x):= follow (x) first(child ); ( ) first(v):= first(child ); last(v):= last(child ); end case; case v is a node labeled : nullable (v) := false; first(v) := ; last(v) := ; v is a node labeled : nullable (v) := true; first(v) := ; last(v) := ; v is a node labeled x: nullable (v) := false; follow (x) := ; first(v) := {x}; last(v) := {x};
Lemma 2.5 The following invariant holds after node v has been visited. 1. nullable (v) is true if and only if ∈L(Ev). 2. first(v) = first(Ev), last(v) = last(Ev). Furthermore, if node v has been visited but the parent of v has not, then 3. follow (x) = follow (Ev, x) for x ∈ pos(Ev). Especially, for the root note v0 , 1. first(v0) = first(E), last(v0) = last(E). 2. follow (x) = follow (E, x), for x∈pos(E).
Observations • All unions labeled ( ) or ( ) are disjoint • pos(F) pos(G)= • Only unions labeled ( ) are not necessarily disjoint • Example: E=(a*b*)*, H=a*b* • Elements of first(H) are added to follow(H,x) for x∈last(H), but some elements of first(H) may already belong to follow(H,x) for some x∈last(H). • O(n3) for computing first(E), last(E)and follow(E,x)
Computingfirst,last&follow inabettertimebound (O(n2)) • General Strategy: • We only consider expressions for which all unions, including the ones of type ( ), are disjoint. • Such expressions are in star normal form (SNF). • Then we show that our algorithm runs in time O(size(ME)) for expressions E in star normal form. • Finally, we show why the restriction to star normal form is justified.
Star Normal Form - Definition A regular expression is in star normal form if for each starred subexpression H* of E the SNF-conditions: follow(H, last(H)) first(H) = and ∉L(H) hold.
Lemma 2.7 • Let E be a regular expressionin star normal form. • MEcan be computed fromEin timeO(size(E) + size(ME)) • Proof • ( ) takes constant time (list concatenation). • ( ) or ( ): • Observation: • For any subexp. F of subexp. G of E, x∈pos(F) • follow(F,x) follow(G,x) follow(E,x) • Run time for ( ) or ( ) in a node v and for position x is proportional to the number of positions in follow(Ev,x) that are not present in any of the subexpressions of Ev. • Total run time spent in instructions ( ) or ( ): x ∈pos(E)|follow(E,x)| disjoint unions (SNF) Which is less or equal to the number of transitions in ME
Why the restriction to star normal form is justified Theorem 3.1 • For each regular expression E, there is a regular expression E such that • ME = ME (Glushkov Automaton) • Eis in star normal form • Ecan be computed from E in linear time.
From starred expression E* into Eo* • Goal: • SNF conditions fulfilled for Eo • Observation • After removing from MEall “feedback” transitions • leading from a final states (apart from qi) • to states that qi is directly connected to, • and changing qi to be non final • The resulting NFA is the Glushkov automaton of E with follow(E,last(E))first(E)=. Example E = (a1*b2*)* b b 2 a b a 1 a Eo = (a1+b2) b 2 a 1
E - inductive definition Example E = (a1*b2*)* b b 2 a b a 1 a Eo = (a1+b2) b 2 a 1
Lemma 3.3 1. size(Eo) ≤size(E). 2. ∉L(Eo) 3. pos(Eo) = pos(E). 4. first(Eo) = first(E), last(Eo) = last(E). 5. follow (Eo, x) = follow (E, x), for all x∈pos(E) \ last(E). 6. follow (Eo, x) = follow (E, x) \ first(E), for all x∈last(E), follow (Eo, last(Eo )) first(Eo) = 7. follow (Eo*, x) = follow (E*,x), for all x∈pos(E). 8. ME* = ME * The proof is in induction onE Claims 7, 8 follow directly from 5 and 6 o
From E to E • If we substitute in E each starred subexpression H* with H* • Proceeding bottom up in E • We can expect to get an expression E in star normal form with ME=ME
E - inductive definition Eo = (a1+b2) Example E = (a1*b2*)* b b 2 b 2 a b a 1 1 a a E=(a*b*)* E=(a*b*)* = (a*b*)* = (a+b)* = (a+b)*
ME = ME Lemma 3.5 • L(E) = L(E) • size(E) size(E) • pos(E) = pos(E) • first(E) = first(E) • last(E) = last(E) • follow(E, x) = follow(E,x), for x∈pos(E) • qI∈FE if and only if qI∈FE These claims imply the first part of Theorem 3.1, ME = ME
E in SNF • follow(H, last(H )) first(H ) = • ∉L(H) • The proof is by induction on the size of E. • The star case [E=F*] E = F* • SNF conditions hold for F (Lemma 3.3) • F in SNF, by induction hypothesis • Need to show that F = F
Lemma 3.6 E = E E = E E = E • Proof – by induction on E • The star case [E = F*] (1) E = F = F = E def indu def (2) E = F* = F = F = F = E def def (1) indu def & (3) E = F* = F* = F* = F* = E def def (2) indu & (1) def
Compute EfromEinlineartime • For H subexpression of E, we need H and H for computing E • H and H are computed simultaneously during the postorder traversal • Left to prove that at each node only a constant amount of time is spent
Conclusions so far • Theorem 3.9 The Glushkov automaton ME can be computed from a regular expression E in time linear in size(E)+size(ME) • Proof • E is computed from E in linear time. • E is in star normal form • ME can becomputed from E in timeO(size(E)+size(ME))
Deterministic regular expression • A regular expression E is deterministic if the corresponding NFA ME is deterministic. Theorem 3.11 1. It can be decided in linear time whether a regular expression E is deterministic. 2. If E is deterministic, then the deterministic finite automaton ME can be computed from E in linear time.
Theorem 3.11 - Proof • E is deterministic if and only if E is • Isomorphic Glushkov automata we can assume that E is in star normal form. • We start to compute first(E), last(E), and follow (E,x) for xpos(E) incrementally • keeping track of the follow(E,x) in a |pos(E)|||matrix E= (a1+b2)* E= (a1+b2)*a3 E is deterministic pos pos E is nondeterministic
Ambiguity in automata and expressions • Unambiguous NFA– definition: • for each word w, there is at most one path from the initial state to a final state that spells out w. • Weakly unambiguous • Intuition • Each word of E has a unique path through E • Definition • A regular expression E is weakly unambiguous if and only if the NFA ME is unambiguous. • Strongly unambiguous • Intuition • Each word of E can be uniquely decomposed into subwords of E
Strongly unambiguous • Concatenation – • L.L’ is unambiguous if v,wL, v’,w’L’, vv’=ww’ v=w and v’=w’. • L* is unambiguous if • v1...vmL, w1…wnL, m,n0, v1…vm=w1…wn m=n and vi=wi for 1im.
Strongly unambiguousIn terms of automata • Let M’E be the NFA recognizing L(E) according to any of the standard constructions • Lemma 4.5 • E is strongly unambiguous if and only if M’E is unambiguous • Lemma 4.6 • If E is strongly unambiguous, then E is weakly unambiguous • Proof • Elimination of transitions transforms M’E into ME. • Different paths in M’E spelling out a word w correspond to different paths in ME doing the same. • Unambiguity of M’E (Lemma 4.5) unambiguity of ME
Epsilon Normal Form • Epsilon Normal Form condition: No subexpression of E denotes the empty word umbiguously
Strongly unambiguous expressions are in star and in epsilon normal form • Lemma 4.10 • If E* is strongly unambiguous, thenfollow(E,last(E))first(E) = • Proof • Assume that there exist xlast(E), yfollow(E,x)first(E), zlast(E) x is a final state in ME. (and also z) • x1...xnx yy1…ymz is a path through ME • But this path is also the composition of two paths through ME • This makes L(E)* ambiguous.
Theorem 4.9 • E is strongly unambiguous if and only if 1. E is weakly unambiguous 2. E is in star normal form 3. E is in epsilon normal form • Proof • For expressions in star and epsilon normal form, weak and • strong unambiguity are identical (using Lemma 4.7) • Strongly unambiguous expressions are in star and in epsilon normal form. (Lemma 4.10)
Test for weak unambiguity in quadratic time • Theorem 4.11 • Regular expressions in epsilon normal form can be tested for weak unambiguity in quadratic time. • Proof • Let E be in epsilon normal form. • E can be transformed into star normal form E • without changing the Glushkov automaton • linear time. • E is also in epsilon normal form. • E is weakly unambiguous if and only if E is if and only if E is strongly unambiguous. • strong unambiguity of expressions can be decided in quadratic time
Open problems • It is easy to see that a regular expression can be tested for epsilon normal form in linear time. • Can a given regular expression be transformed into epsilon normal form in linear time? • Our transformation into star normal form can deal with starred subexpressions. • Hence, the crucial point is how expressions E = F+G with L(F)L(G) can be handled. • A straightforward approach would eliminate the empty string either from L(F) or from L(G). • This opens up another question: • Is there a lineartime algorithm transforming a regular expression E into an expression E’ with L(E’) = L(E)\{}?