1 / 37

Regular Expressions into Finite Automata

Regular Expressions into Finite Automata. Anne Bruggemann-Klein Presenting: Rutie Mesing. Outline. Building the Glushkov automaton in O((size of E) 2 ) Defining the Star Normal Form Building the Glushkov automaton in O(size of E) for deterministic regular expressions

ssnell
Download Presentation

Regular Expressions into Finite Automata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular Expressions intoFinite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing

  2. Outline • Building the Glushkov automaton in O((size of E)2) • Defining the Star Normal Form • Building the Glushkov automaton in O(size of E) for deterministic regular expressions • Strong and weak unambiguity • Quadratic time decision algorithm for weak unambiguity

  3. General definitions • E– regular expression • L(E)– the language specified by the regular expression E • The size of a regular expression E • The number of symbols it contain, including the syntactic symbols such as brackets, +, ., and * • The size of an NFA • The number of its transitions

  4. pos(E), (x) (a+b)*a(ab)* (a1+b2)*a3(a4b5)* • pos(E)– the set of subscripted symbols in an expression E • x, y, z are used to denote positions • a, b, c are used for elements of  • For a position x, (x) is the corresponding symbol of 

  5. Positions sets: first(E), last(E)inductive definition

  6. Positions sets: follow(E,x)inductive definition

  7. a b 3 2 b b a a 1 a The Glushkov Automaton (NFA) • ME = (QE {qI}, , E, qI, FE) • QE = pos(E) • For a∈ , letE (qI,a) = {x| x∈first(E), (x)=a} • For x∈pos(E), a∈, letE(x,a) = {y| y∈follow(E,x), (y)=a} • FE = last(E){qI}if ∈L(ME) last(E) otherwise • Proposition 2.1L(ME) = L(E) Example (a*+ba)* = (a1*+b2a3)*

  8. The canonical method (O(n3)) for computingfirst,last&follow • Converting E into a syntax tree • Leafs are labeled with: , or positions of E • Internal nodes: +, . or * • Building time: O(n) (n = size of E) • Each nodevin the syntax tree corresponds to asubexpression EvofE. • Postorder traversal of the syntax tree computing: • nullable(v): Boolean – canEv contain • first(v), last(v): 2pos(E) • Foreachxpos(E) thereisaglobalvariable: follow(x): 2pos(E)  O(n3)

  9. v is a node labeled +: • nullable (v) := nullable (leftchild ) or nullable (rightchild ); first(v) := first(leftchild ) first(rightchild ); ( ) • last(v) := last(leftchild ) last(rightchild ); ( ) • v is a node labeled .: • nullable (v) := nullable (leftchild ) and nullable (rightchild ); • for each x in last(leftchild) do • follow (x) := follow (x) first(rightchild ); ( ) • if nullable(leftchild) then • first(v) := first(leftchild ) first(rightchild ) ( ) • else • first(v) := first(leftchild ); if nullable(rightchild)then last(v):= last(leftchild ) last(rightchild ) ( ) else last(v) := last(rightchild ); v is a node labeled *: nullable (v):= true; foreach x in last(child)do follow (x):= follow (x) first(child ); ( ) first(v):= first(child ); last(v):= last(child ); end case; case v is a node labeled  : nullable (v) := false; first(v) := ; last(v) := ; v is a node labeled : nullable (v) := true; first(v) := ; last(v) := ; v is a node labeled x: nullable (v) := false; follow (x) := ; first(v) := {x}; last(v) := {x};

  10. Lemma 2.5 The following invariant holds after node v has been visited. 1. nullable (v) is true if and only if ∈L(Ev). 2. first(v) = first(Ev), last(v) = last(Ev). Furthermore, if node v has been visited but the parent of v has not, then 3. follow (x) = follow (Ev, x) for x ∈ pos(Ev). Especially, for the root note v0 , 1. first(v0) = first(E), last(v0) = last(E). 2. follow (x) = follow (E, x), for x∈pos(E).

  11. Observations • All unions labeled ( ) or ( ) are disjoint • pos(F) pos(G)= • Only unions labeled ( ) are not necessarily disjoint • Example: E=(a*b*)*, H=a*b* • Elements of first(H) are added to follow(H,x) for x∈last(H), but some elements of first(H) may already belong to follow(H,x) for some x∈last(H). • O(n3) for computing first(E), last(E)and follow(E,x)

  12. Computingfirst,last&follow inabettertimebound (O(n2)) • General Strategy: • We only consider expressions for which all unions, including the ones of type ( ), are disjoint. • Such expressions are in star normal form (SNF). • Then we show that our algorithm runs in time O(size(ME)) for expressions E in star normal form. • Finally, we show why the restriction to star normal form is justified.

  13. Star Normal Form - Definition A regular expression is in star normal form if for each starred subexpression H* of E the SNF-conditions: follow(H, last(H))  first(H) =  and ∉L(H) hold.

  14. Lemma 2.7 • Let E be a regular expressionin star normal form. • MEcan be computed fromEin timeO(size(E) + size(ME)) • Proof • ( ) takes constant time (list concatenation). • ( ) or ( ): • Observation: • For any subexp. F of subexp. G of E, x∈pos(F) • follow(F,x) follow(G,x) follow(E,x) • Run time for ( ) or ( ) in a node v and for position x is proportional to the number of positions in follow(Ev,x) that are not present in any of the subexpressions of Ev. • Total run time spent in instructions ( ) or ( ): x ∈pos(E)|follow(E,x)| disjoint unions (SNF) Which is less or equal to the number of transitions in ME

  15. Why the restriction to star normal form is justified Theorem 3.1 • For each regular expression E, there is a regular expression E such that • ME = ME (Glushkov Automaton) • Eis in star normal form • Ecan be computed from E in linear time.

  16. From starred expression E* into Eo* • Goal: • SNF conditions fulfilled for Eo • Observation • After removing from MEall “feedback” transitions • leading from a final states (apart from qi) • to states that qi is directly connected to, • and changing qi to be non final • The resulting NFA is the Glushkov automaton of E with follow(E,last(E))first(E)=. Example E = (a1*b2*)* b b 2 a b a 1 a Eo = (a1+b2) b 2 a 1

  17. E - inductive definition Example E = (a1*b2*)* b b 2 a b a 1 a Eo = (a1+b2) b 2 a 1

  18. Lemma 3.3 1. size(Eo) ≤size(E). 2. ∉L(Eo) 3. pos(Eo) = pos(E). 4. first(Eo) = first(E), last(Eo) = last(E). 5. follow (Eo, x) = follow (E, x), for all x∈pos(E) \ last(E). 6. follow (Eo, x) = follow (E, x) \ first(E), for all x∈last(E), follow (Eo, last(Eo )) first(Eo) =  7. follow (Eo*, x) = follow (E*,x), for all x∈pos(E). 8. ME* = ME * The proof is in induction onE Claims 7, 8 follow directly from 5 and 6 o

  19. From E to E • If we substitute in E each starred subexpression H* with H* • Proceeding bottom up in E • We can expect to get an expression E in star normal form with ME=ME

  20. E - inductive definition Eo = (a1+b2) Example E = (a1*b2*)* b b 2 b 2 a b a 1 1 a a E=(a*b*)* E=(a*b*)* = (a*b*)* = (a+b)* = (a+b)*

  21. ME = ME Lemma 3.5 • L(E) = L(E) • size(E)  size(E) • pos(E) = pos(E) • first(E) = first(E) • last(E) = last(E) • follow(E, x) = follow(E,x), for x∈pos(E) • qI∈FE if and only if qI∈FE These claims imply the first part of Theorem 3.1, ME  = ME

  22. E in SNF • follow(H, last(H ))  first(H ) =  • ∉L(H) • The proof is by induction on the size of E. • The star case [E=F*]  E = F* • SNF conditions hold for F (Lemma 3.3) • F in SNF, by induction hypothesis • Need to show that F = F

  23. Lemma 3.6 E = E E = E E = E • Proof – by induction on E • The star case [E = F*] (1) E = F = F = E def  indu def  (2) E = F* = F = F = F = E def  def  (1) indu def &  (3) E = F* = F* = F* = F* = E def  def  (2) indu & (1) def 

  24. Compute EfromEinlineartime • For H subexpression of E, we need H and H for computing E • H and H are computed simultaneously during the postorder traversal • Left to prove that at each node only a constant amount of time is spent

  25. Conclusions so far • Theorem 3.9 The Glushkov automaton ME can be computed from a regular expression E in time linear in size(E)+size(ME) • Proof • E is computed from E in linear time. • E is in star normal form • ME can becomputed from E in timeO(size(E)+size(ME))

  26. Deterministic regular expression • A regular expression E is deterministic if the corresponding NFA ME is deterministic. Theorem 3.11 1. It can be decided in linear time whether a regular expression E is deterministic. 2. If E is deterministic, then the deterministic finite automaton ME can be computed from E in linear time.

  27. Theorem 3.11 - Proof • E is deterministic if and only if E is • Isomorphic Glushkov automata  we can assume that E is in star normal form. • We start to compute first(E), last(E), and follow (E,x) for xpos(E) incrementally • keeping track of the follow(E,x) in a |pos(E)|||­matrix E= (a1+b2)* E= (a1+b2)*a3 E is deterministic   pos pos E is nondeterministic

  28. Ambiguity in automata and expressions • Unambiguous ­NFA– definition: • for each word w, there is at most one path from the initial state to a final state that spells out w. • Weakly unambiguous • Intuition • Each word of E has a unique path through E • Definition • A regular expression E is weakly unambiguous if and only if the NFA ME is unambiguous. • Strongly unambiguous • Intuition • Each word of E can be uniquely decomposed into subwords of E

  29. Strongly unambiguous • Concatenation – • L.L’ is unambiguous if v,wL, v’,w’L’, vv’=ww’ v=w and v’=w’. • L* is unambiguous if • v1...vmL, w1…wnL, m,n0, v1…vm=w1…wn m=n and vi=wi for 1im.

  30. Strongly unambiguousIn terms of automata • Let M’E be the ­NFA recognizing L(E) according to any of the standard constructions • Lemma 4.5 • E is strongly unambiguous if and only if M’E is unambiguous • Lemma 4.6 • If E is strongly unambiguous, then E is weakly unambiguous • Proof • Elimination of  transitions transforms M’E into ME. • Different paths in M’E spelling out a word w correspond to different paths in ME doing the same. • Unambiguity of M’E (Lemma 4.5)  unambiguity of ME

  31. Lemma 4.7 – weakly unambiguous

  32. Epsilon Normal Form • Epsilon Normal Form condition: No subexpression of E denotes the empty word umbiguously

  33. Strongly unambiguous expressions are in star and in epsilon normal form • Lemma 4.10 • If E* is strongly unambiguous, thenfollow(E,last(E))first(E) =  • Proof • Assume that there exist xlast(E), yfollow(E,x)first(E), zlast(E) x is a final state in ME. (and also z) • x1...xnx yy1…ymz is a path through ME • But this path is also the composition of two paths through ME • This makes L(E)* ambiguous.

  34. Theorem 4.9 • E is strongly unambiguous if and only if 1. E is weakly unambiguous 2. E is in star normal form 3. E is in epsilon normal form • Proof • For expressions in star and epsilon normal form, weak and • strong unambiguity are identical (using Lemma 4.7) • Strongly unambiguous expressions are in star and in epsilon normal form. (Lemma 4.10)

  35. Test for weak unambiguity in quadratic time • Theorem 4.11 • Regular expressions in epsilon normal form can be tested for weak unambiguity in quadratic time. • Proof • Let E be in epsilon normal form. • E can be transformed into star normal form E • without changing the Glushkov automaton • linear time. • E is also in epsilon normal form. • E is weakly unambiguous if and only if E is  if and only if E is strongly unambiguous. • strong unambiguity of expressions can be decided in quadratic time

  36. Open problems • It is easy to see that a regular expression can be tested for epsilon normal form in linear time. • Can a given regular expression be transformed into epsilon normal form in linear time? • Our transformation into star normal form can deal with starred subexpressions. • Hence, the crucial point is how expressions E = F+G with L(F)L(G) can be handled. • A straight­forward approach would eliminate the empty string either from L(F) or from L(G). • This opens up another question: • Is there a linear­time algorithm transforming a regular expression E into an expression E’ with L(E’) = L(E)\{}?

  37. The End

More Related