690 likes | 1.96k Views
Languages, Grammars, and Regular Expressions. Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth Rosen. Alphabets and Languages. Definition: A vocabulary (or alphabet ) V is a finite, nonempty set of symbols .
E N D
Languages, Grammars, and Regular Expressions Chuck Cusack • Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5th edition, by Kenneth Rosen
Alphabets and Languages • Definition: A vocabulary(oralphabet) V is a finite, nonempty set of symbols. • Definition: A word or sentence over V is a finite string of symbols from V. • Definition: The empty stringornull string, denoted by l, is the string containing no symbols. • Definition: The set of all words over V is denoted by V*. • Definition: A language overV is a subset of V*.
Language Examples • Let V={0,1} • 00110, 11111, 00, and 11 are words over V • 012, a234, and 222 are not words over V • V*={0,1,00,01,10,11,000,…} • In other words,V*is the set of all binary strings • The set of strings consisting of only 0s is a language over V* • {1,10,100,1000,10000,…} is a language over V*
Concatenation • Definition: Let V be a vocabulary, and A and B be subsets of V*. The concatenation of A and B, denoted by AB, is the set of all strings of the form xy, where xÎA and yÎB. • Example: Let A={0, 10}, and B={1,12}. Then • AB={01, 012, 101, 1012} • BA={10, 110, 120, 1210} • AA={00, 010, 100, 1010} • AAA=A(AA)={000, 0010, 0100, 01010, 1000, 10010, 10100, 101010}
Concatenation: An • Definition: Let V be a vocabulary, and Aa subset of V*. Then A0={l}, and for n>0, we can define An=A(n-1)A • Example: Let A={0, 10}. Then • A0={l} • A1=A0A={l}A=A={0,10} • A2=A1A ={00, 010, 100, 1010} • A3= A2A={000, 0010, 0100, 01010, 1000, 10010, 10100, 101010}
Kleene Closure • Definition: Let V be a vocabulary, and Aa subset of V*. The Kleene closure of A, denoted by A*, is the set consisting of concatenations of an arbitrary number of strings from A. That is, • Definition:A+ is the set of nonempty strings over A. In other words,
Kleene Closure Example • Example: Let A={0, 1}. Then • A0={l} • A1={0,1} • A2={00, 01, 10, 11} • A3={000, 001, 010, 011, 100, 101, 110, 111} • A*={0,1}*={All binary strings} • Example: Let B={111}. Then • B0={l}, B1={111}, B2={111111} • B3={111111111} • B* is the set of strings with 3n 1s, for every n³0.
Regular Sets • Definition: A regular set is a set that can be generated starting from the empty set, empty string, and single elements from the vocabulary, using concatenations, unions, and Kleene closures in arbitrary order. • We will give a more precise definition after we define a regular expression.
Regular Expressions • Definition: The regular expressions over a set I are defined recursively by: • Æ (the empty set) is a regular expression, • l(the set containing the empty string) is a regular expression, • x is a regular expression for all xÎI, • (AB) , (AÈB) , and A* are regular expressions if A and B are regular expressions • Definition: A regular set is a set represented by a regular expression. • Examples: 001*, 1(0È1)0, (0È1)*11, and AB*C are regular expressions
Regular Expression Example • The regular set defined by the regular expression 01*is the set of strings starting with a 0 followed by 0 or more 1s. • The regular set defined by (10)*is the set of strings containing 0 or more copies of 10. • The regular set defined by 0(0È1)*1is the set of all binary strings beginning with 0 and ending with 1. • The regular set defined by (0È1)1(0È1) is the set of strings {010, 011, 110, 111}.
Regular Expression Applications • Regular expressions are actually used quite often in computer science. • For instance, if you are editing a file with vi, and want to see if it contains the string blah followed by a number followed by any character followed by the letter Q, you can use the regular expression blah[0-9][0-9]*.Q • This works because vi uses regular expressions for searching.
Grammars and Languages • Many languages can be defined by grammars. • We are particularly interested in phrase-structure grammars. • Before we can define phrase-structure grammars, we need to define a few more terms.
Special Symbols • Definition: A nonterminal symbol(or just nonterminal) is a symbol which can be replaced by other symbols. • Definition: A terminal symbol(or just terminal) is a symbol which cannot be replaced by other symbols. • Definition: The start symbolis a special symbol, usually denoted by S. • The set of terminals is denoted by T, and the set of nonterminals by N. • S is a nonterminal.
Productions • Definition: A productionis a rule which tells how to replace one string from V* with another string. • Productions are denoted by ab, which denotes that a can be replaced by b. • Example • Let SA0, AA1, and A0 be productions • Then I can replace S with A0 • Since I can replace A with A1, A0 can become A10 • Since I can replace A with 0, A10 can become 010 • Thus, I can replace S with 010
Phrase-Structure Grammars • Definition: A phrase-structure grammar is a 4-tuple G=(V,T,S,P), where • V is a vocabulary • TV is a set of terminals • SV is a start symbol • P is a set of productions • N=V-T is the set of nonterminals • Each production contains at least one nonterminal on its left side. • We will always use S as the start symbol.
Direct Derivations • Let G=(V,T,S,P) be a phrase-structure grammar. • Let A=lar and B=lbr, where l, a, b, r Î V*. • Let abbe a production. • Then we can derive B from A. • Thus we say that A is directly derivable fromB. • We write this as AB
Derivations • Let G=(V,T,S,P) be a phrase-structure grammar • Let A1, A2,…,An V* be such that A1A2…An • Then we say that An is derivable fromA1. • We write A1* An • The sequence of productions used is called a derivation.
Generating Languages • Let G=(V,T,S,P) be a grammar • Definition: The language generated by G, denoted L(G) , is the set of all strings of terminals that are derivable from S. • Put another way, L(G)={w T* | S * w }
Example 1 Let G be the grammar with • V={S,0,1} • T={0,1} • P={SS0, S0} • Clearly S0, so 0L(G) • Also, SS000, so 00L(G) • And, SS0S00000, so 000L(G) • It is not hard to see that L(G) is the language consisting of all strings with 1 or more 0s.
Example 2 Let G be the grammar with V={S,0,1}, T={0,1}, and P={SSS, S1, S0} • Clearly S0, so 0L(G) • Also, S1, so 1L(G) • Since SSSS101, so 01L(G) • In general, we can get a sequence of Ss, and replace each with either 0 or 1. • Given this fact, it is easy to see that L(G) ={0,1}+, the set of all non-empty binary strings
Example 3 Let G be the grammar with V={S,A,B,0,1}, T={0,1}, and P={SAB, BBB, AAA, A0, B1} • Clearly SAB0B01, so 01L(G) • Also, SABAAB0AB00B001, so 001L(G) • Similarly, we can get 011, 0011, 0001, etc. • In general, we can get a sequence of n0s followed by m1s, where n>0, m>0. • Thus L(G) ={0n1m | m and n are positive integers}
Type 0 Grammars • Type 0 grammars have no restrictions on the types of productions that are allowed. • Thus type 0 grammars are just phrase-structure grammars. • This is not too exciting, so we will move on to type 1 grammars.
Type 1 Grammars • In a type 1 grammar, productions are of the form • aXbacb,where XN and a,b,cV* with c¹l • (or Sl, but ignore this for now) • Thus, a production can only be applied if the symbol X is surrounded by a and b. • In other words, the production can only be applied in a certain context. • This is why type 1 grammars are also called context-sensitive grammars.
Type 2 Grammars • Productions are of the form • Xa, where XN and aV*. • Thus, if X is in a string, we can replace X with a no matter what surrounds X. • In other words, the context in which X appears does not matter. • This is why type 2 grammars are called context-free grammars. • Context-free grammars produce context-free languages.
Type 3 Grammars • Productions are of the form • Xa, where XN and aT • XaY, where X,YN and aT • Sl • Type 3 grammars are called regular grammars. • Regular grammars produce regular languages. • It is easy to see that a type 3 grammar is a type 2 grammar.
Type 0: phrase-structure Type 1: context-sensitive Type 2: context-free Type 3: regular Types of Grammars • The following summarizes the relationships between the types of grammars
Regular Grammar Example • Let G be the grammar with • V={S,A,0,1}, • T={0,1}, and • P={S0A, A0A, A1A, A1} • We can determine what the language is by constructing a few words. • S0A01 • S0A00A001 S0A01A011 • S0A00A000A0001 S0A00A001A0011 • S0A01A010A0101 S0A01A011A0111 • We can see that in general, L(G) is the set of binary strings beginning with 0 and ending with 1.
Regular Languages and Sets • Theorem: Let A be a subset of V*. Then A is a regular language if and only if A is a regular set. • In other words, a language defined by a regular grammar can also be defined by a regular expression, and vice-versa. • Example: We just saw that the grammar with V={S,A,0,1}, T={0,1}, and P={S0A, A0A, A1A, A1} generates the set of binary strings beginning with 0 and ending with 1. • Recall that the regular set defined by 0(0È1)*1 is also the set of all binary strings beginning with 0 and ending with 1.
Grammar Applications • Context-free grammars are used to define the syntax of most programming languages. • Regular grammars are used in several applications, including the following • Searching text for patterns • Lexical analysis (during program compilation) • Efficient algorithms exist to determine if a string is in a context-free or regular language. • This is important for tasks like determining whether or not a program is syntactically valid.
Backus-Naur Form • Backus-Naur form (BNF) is a more compact representation of productions in a type 2 grammar. • All productions with the same left hand side are combined into one production • The symbol is replaced with ::= • All terminals are enclosed in <and> • The right hand sides of the various productions are combined, and separated by |
Backus-Naur Form Example • Consider the set of productions • SAB • BBB • AAA • A0 • B1 • In BNF, they are represented by • <S> ::= <A><B> • <B> ::= <B><B> | 1 • <A> ::= <A><A> | 0
Backus-Naur Form Example 2 • The Backus Naur form for the production of a signed integer is • <signed integer> ::= <sign><integer> • <sign> ::= + | - • <integer> ::= <digit> | <digit><integer> • <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Backus-Naur Form Applications • Specifying the syntax for programming languages including • Java • LISP • Specifying database languages • SQL • Specifying markup languages • XML