700 likes | 968 Views
CS 3813: Introduction to Formal Languages and Automata. Chapter 3 Regular languages and expressions
E N D
CS 3813: Introduction to Formal Languages and Automata Chapter 3 Regular languages and expressions These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata, 3rd ed., by Peter Linz, published by Jones and Bartlett Publishers, Inc., Sudbury, MA, 2001. They are intended for classroom use only and are not a substitute for reading the textbook.
Operations on formal languages Let L1 = {10} and L2 = {011, 11}. Union: L1 L2 = {10, 011, 11} Concatenation: L1 L2 = {10011, 1011} Kleene Star: L1* = {λ, 10, 1010, 101010, … } Other operations: intersection, complement, difference
Definition Of Regular Languages A regular language over an alphabet is one that contains either a single string of length 0 or 1, or strings which can be obtained by using the operations of union, concatenation, or Kleene* on strings of length 0 or 1.
Alternative definition of regular languages The simplest possible regular languages are the empty set and languages consisting of a single string that is either the empty string or has length one. For example; if = {a,b}, the simplest languages are , {λ}, {a}, and {b}. A regular language is a language that can be built from these simple languages, by using the three operations of union, concatenation, and Kleene star.
Regular Languages correspond to Regular Expressions • L = Ø RE is Ø • L = {} RE = • L = {a} RE = a • L = L1 L2 RE = (r1 + r2) • L = L1 L2 RE = (r1r2) • L = L1* RE = (r1*)
Regular expressions A useful shorthand for describing regular languages. Compare to arithmetic expressions, such as (x + 3)/2. An arithmetic expression is constructed using arithmetic operators, such as addition and division. A regular expression is constructed using operations on languages, such as concatenation, union, and Kleene star. The value of an arithmetic expression is a number. The value of a regular expression is a language.
Recursive definition of a regular expression is a regular expression corresponding to the language . λ is a regular expression corresponding to the language {λ}. For each symbol a , a is a regular expression corresponding to the language {a}. For any regular expressions r and s, corresponding to the regular languages L(r) and L(s), respectively, each of the following is a regular expression: (r + s) corresponds to the language L(r) L(s) (r · s) or (rs) corresponds to the language L(r)L(s) (r*) corresponds to the language (L(r))*
Examples a*+ b= {λ, a, b, aa, aaa, aaaa, aaaaa, … } a*ba* = {w * : w has exactly one b} (a + b)* = any string of a’s and b’s (a + b)*aa (a + b)* = {w * : w contains aa} (a + b)*aa (a + b)* + (a + b)*bb (a + b)* = {w * : w contains aa or bb} (a +λ)b* = {abn : n 0} + {bn : n 0} As with arithmetic expressions, there is an order of precedence for operators -- unless you change it using parentheses. The order is: star closure first, then concatenation, then union.
More examples All strings containing no more than two a’s: (b + c)*(λ + a)(b + c)*(λ + a)(b + c)* All strings containing no runs of a’s of length greater than two: (b + c)*(λ + a + aa)(b + c)*((b + c)(b + c)*(λ + a + aa)(b + c)*)* All strings in which all runs of a’s have lengths that are multiples of three: (aaa + b + c)*
Hints for writing regular expressions Assume = {a, b, c}. Zero or more a’s: a* One or more a’s: aa* Any string at all: (a + b + c)* Any nonempty string: (a + b + c)(a + b + c)* Any string that does not contain a: (b + c)* Any string containing exactly one a: (b + c)*a(b + c)*
Practice Let = {a,b,c}. Give a regular expression for the following languages: (a) all strings containing exactly two a’s (b) all strings containing no more than three a’s
Practice Let = {a,b,c}. Give a regular expression for the following languages: (a) all strings containing exactly two a’s (b + c)*a(b + c)*a(b + c)* (b) all strings containing no more than three a’s (b + c)*(λ + a)(b + c)*(λ + a)(b + c)*(λ + a)(b + c)*
Practice What languages correspond to the following regular expressions? a*b (aaa + bba) (ab)*
More practice Give regular expressions for the following languages, where the alphabet is = {a, b, c}. --all strings ending in b --all strings containing no more than two a’s -- all strings of even length
More practice Give regular expressions for the following languages, where the alphabet is = {0, 1}. --all strings of one or more 0’s followed by a 1 --all strings of two or more symbols followed by three or more 0’s -- all strings that do not end with 01
Do these strings match the regular expression? Regular expressionString (01* + 1) 0101 (a +λ)b b (ab)*a* λ (a + b)(ab) bb
Let M = (Q, S, q0, d, A) be an FA. A string x S *is accepted by M if d*(q0, x) A The language accepted (or recognized) by M is the set L(M) = {x S *| x is accepted by M} A language L over the alphabet S is regular iff there is a Finite Automaton that accepts L. Accepting (review)
Kleene’s theorem 1) For any regular expression r that represents language L(r), there is a finite automaton that accepts that same language. 2) For any finite automaton M that accepts language L(M), there is a regular expression that represents the same language. Therefore, the class of languages that can be represented by regular expressions is equivalent to the class of languages accepted by finite automata -- the regular languages.
Kleene’s theorem part 1 NFA regular expression proved Kleene’s Theorem part 2 DFA
Theorem 3.1 1st half of Kleene’s theorem: Let r be a regular expression. Then there exists some nondeterministic regular accepter that accepts L(r). Consequently, L(r) is a regular language. Proof strategy: for any regular expression, we show how to construct an equivalent NFA. Because regular expressions are defined recursively, the proof is by induction.
Base step: Give a NFA that accepts each of the simple or “base” languages, , {λ}, and {a} for each a . a
Inductive step: For each of the operations -- union, concatenation and Kleene star -- show how to construct an accepting NFA. Closure under union: M1 λ λ M2 λ λ
M2 λ M1 Closure under concatenation:
Closure under Kleene Star: λ M1 λ λ λ
Union, concatenation, and Kleene star of two regular languages will result in a regular language, since we can write a regular expression for them. Intersection and difference (complement) of two regular languages will also produce a regular language. The class of regular languages is said to be closed under these operations. (More in Ch. 4.) Closure properties of Regular Languages
Exercise Use the construction of the first half of Kleene’s theorem to construct a NFA that accepts the language L(ab*aa + bba*ab).
Exercise Use the construction of the first half of Kleene’s theorem to construct a NFA that accepts the language L(ab*aa + bba*ab). λ λ FA accepting ab*aa q0 qf λ λ FA accepting bba*ab
Exercise Construct a NFA that accepts the language corresponding to the regular expression: ((b(a+b)*a) + a)
Theorem 3.2 Kleene’s theorem part 2: Let L be a regular language. Then there exists a regular expression r such that L = L(r). Any language accepted by a finite automaton can be represented by a regular expression. The proof strategy: For any DFA, we show how create an equivalent regular expression. In other words, we describe an algorithm for converting any DFA to a regular expression.
ab (a+b) a* Expression diagram • A labeled directed graph (similar to a finite state diagram) in which transitions are labeled by regular expressions • Has a single start state with no incoming transitions • Has a single accepting state with no outgoing transitions • Example:
Algorithm for converting a DFA into an equivalent regular expression Initial step: Change every transition labeled a,b to (a+b). Add a single start state with an outgoing λ-transition to the current start state, and add a single final state with incoming λ-transitions from every previous final state. Main step: Until expression diagram has only two states (initial state and final state), repeat the following: -- pick some non-start, non-final state -- remove it from the diagram and re-label transitions with regular expressions so that the same language is accepted
ab*b ab*a The key step is removing states and re-labeling transitions with regular expressions. Here are some examples of how to do this. b a a ab*a b ab*a a b ab*b a b b a a
Exercise a,b a a (a+b) λ λ b b Continue ...
Exercise a (a+b) λ λ b (a+b) λ a*b a*b (a+b)*
a b a b Exercise Find a regular expression that corresponds to the language accepted by the following DFA.
a b a b Exercise λ λ a*bb*a a*bb* λ (a*bb*a)*a*bb*
Homework Find a regular expression that corresponds to the language accepted by the following DFA. 0 q1 q0 0 1 1 q2 1 0
Applications of regular expressions • Validation • checking that an input string is in valid format • example 1: checking format of email address on WWW entry form • example 2: UNIX regex command • Search and selection • looking for strings that match a certain pattern • example: UNIX grep command • Tokenization • converting sequence of characters (a string) into sequence of tokens (e.g., keywords, identifiers) • used in lexical analysis phase of compiler
Grammar • A grammar G = (V, T, S, P) consists of the following quadruple: • a set V of variables (non-terminal symbols), including a starting symbol S NT • a set T of terminals (same as an alphabet, ) • A start symbol S V • a set P of production rules • Example: S aS | A A bA | λ
Derivation • Strings are “derived” from a grammar • Example of a derivation S aS aaS aaA aabA aab • At each step, a nonterminal is replaced by the sentential form on the right-hand side of a rule (a sentential form can contain nonterminals and/or terminals) • Automata recognize languages; grammars generate languages
Context-free grammar • A grammar is said to be context-free if every rule has a single non-terminal on the left-hand side • This means you can apply the rule in any context. More complicated languages (such as English) have context-dependent rules. A language generated from a context-free grammar is called a context-free language
Regular grammar • A grammar is said to be right-linear if all productions are of the form AxB or Ax, where A and B are variables and x is a string of terminals • A grammar is said to be left-linear if all productions are of the form ABx or Ax • A regular grammar is either right-linear or left-linear.
Linear grammar • A grammar can be linear without being right- or left-linear. • A linear grammar is a grammar in which at most one variable can occur on the right side of any production rule, without any restriction on the position of the variable. • Example: S aS | A A Ab | λ
Another formalism for regular languages • Every regular grammar generates a regular language, and every regular language can be generated by a regular grammar. • A regular grammar is a simpler, special-case of a context-free grammar • The regular languages are a proper subset of the context-free languages
Exercises • Find a regular grammar that generates the language on = {a,b} consisting of all strings with no more than three a’s
Exercises • Find a regular grammar that generates the language on = {a,b} consisting of all strings with no more than three a’s S bS | aA | λ A bA | aB | λ B bB | aC | λ C bC | λ
Exercises • Find a regular grammar that generates the language consisting of even-length strings over {a,b}
Exercises • Find a regular grammar that generates the language consisting of even-length strings over {a,b} S aaS | abS | baS | bbS | λ
Exercise What language is generated by the following context-free (but not regular) grammar? S aSa | bSb | a | b | λ
Exercise What language is generated by the following context-free grammar? S aSa | bSb | a | b | λ The odd/even palindrome language: L = {w(a+b+λ)wR}