730 likes | 875 Views
Chapter 2 Grammar and Formal Language. Zhang Jing, Yu SiLiang College of Computer Science & Technology Harbin Engineering University.
E N D
Chapter 2 Grammar and Formal Language Zhang Jing, Yu SiLiang College of Computer Science & Technology Harbin Engineering University
The goal of this chapter is to help readers to review some basic knowledge of mathematics that is related with the theory of compiler, and understand the mathematics symbolic language—formal language. In specific, we shall talk about the concepts of string, grammar, parser tree, formal language and so on. All the concepts are the basic knowledge for reader and will benefit them to go through the following chapters. zhangjing@hrbeu.edu.cn
String • 1.Alphabet– finite character set , and it is non-empty set. For example, A={a, b, c, … , z}, B={0, 1}, A and B are alphabets. • 2. String String is a sequence of characters, empty string can be represented byε,Usually small letters represent string. zhangjing@hrbeu.edu.cn
For example, If there is alphabet A={a, b, c}, The characters of it are a, b, c . The strings are a, b, c, ab, ac, aa, abc , … Alphabet B= {0,1}, the characters correspond to it are 0,1 The strings are 0,1,00,01,10,11, 000,…,01000,… • Note: If the character order is not same, the string is different as well. For example, string “ab” and “ba” are not same. 001 and 010 are different string. . zhangjing@hrbeu.edu.cn
3. String Length The length of string is the number of characters. The string is x, the length of string is |x|. Take alphabet B for example, |01|=2,|000|=3,|01000|=5, The length of null string, |ε|=0 • 4.Connection of string There are strings x and y, write down y after x, namely, “xy”, we call “xy” as the connection of string x and y . . If there are x=abc ,y=de, then xy=abcde,yx=deabc Note: εx=xε=x zhangjing@hrbeu.edu.cn
5.Power of string If x is string, then x2=xx x3=xxx …… xn=xx…x=xn-1x=xxn-1 xn is the n power of x x0=ε There is x=aTb, so the power of x and length of it are as follows. x0=ε | x0|=0 x1= aTb | x1|=3 x2= aTb aTb | x2|=6 x3= aTb aTb aTb | x3|=9 zhangjing@hrbeu.edu.cn
6.Head and tail of string z=xy is string, the head of z is x, the tail of z is y. if y≠ε, we call the x is true head of z. Similarly, if x≠ε, y is the tail of z and it is true tail. . There is string of u = abc, so the heads of u areε、a、ab、abc, true head of it are ε、a、ab, the tails of u areε、c、bc、abc, true tails of it are ε、c、bc . . zhangjing@hrbeu.edu.cn
7.Product of string A and B A and B are strings, the product of them is AB={xy|(x∈A)∧(y∈B)} This means that AB is the set where x belongs to A and y belongs to B . There are set A={a, b}, B={0,1}, so, AB={a0, a1, b0, b1} • Note:{ε}A=A{ε}=A zhangjing@hrbeu.edu.cn
8. Power set of string A is set of string, the power of the set A is A0={ε} A1=A …… An=AA…A =AAn-1=An-1A There is A={a, b}, then A0={ε} A2={a, b}{a, b}={aa, ab, ba, bb} A3={aaa, aab, aba, abb, baa, bab, bba, bbb} Length of A2 is 4 (namely, 22=4), while length of A3 is 8 (namely, 23=8) zhangjing@hrbeu.edu.cn
9. Positive closure of string set The positive closure of string set can be written as A+=A1∪A2∪…∪An∪… If A={a, b}, then A+={a, b}∪{aa, ab, ba, bb}∪… ={a, b, aa, ab, ba, bb, aaa, … , bbb, …} • 10. Closure of string set The closure of string set A is written as A* A*=A0∪A+={ε}∪A+ If A={a, b}, so A* ={ε, a, b, aa, ab, ba, bb, aaa, …, bbb, …} zhangjing@hrbeu.edu.cn
Grammar and Formal language • We know that compiling procedure includes scanner, parser, semantic analyzer and so on. How do they work? What is the principle of them? Is there any rule between them? There is a very important rule or grammar that helps to analyze the procedure of compiling, that is, formal language. Formal language is completely described and rigidly governed by rules. This section we will introduce the basic concepts of rule, grammar and language. . zhangjing@hrbeu.edu.cn
1.Rule Rule is an group(U,x), often it is written as U ::= x (or U→x) . While U is the rule’s left, and string x (not null) is the rule’s right, that’s to say, the rule’s left is defined or formed by the right. In addition, left and right side of rule are connected by “::=” or “→”, actually, the rule is called Backus Naur Form(BNF). . The following are all rules. S::=0S1 S::=01 U::=Tb T::=a zhangjing@hrbeu.edu.cn
2. Grammar G[Z] Grammar[Z] is a set which is not null and is finite. Z is the identified symbol (or the start symbol) and it must appear on the left of rule, all the characters in set which appear in rules are called vocabulary of V . . zhangjing@hrbeu.edu.cn
Example 2.1. G[S]={S:: =0S1,S:: =01} This grammar includes two rules, the vocabulary V={S, 0, 1} • Example 2.2. G[U]={U:: =Tb,T:: =a} So, V={U, T, a, b},the identified symbol in the grammar is U. • Example 2.3. G[Z] ={ Z :: =D, Z :: =Z D, D:: = 0 |1 |…|9} V={Z,D,0,1,…9},the start symbol in the grammar is Z. zhangjing@hrbeu.edu.cn
Note: Rules D::= 0 |1 |…|9 are same with D::= 0 D::= 1 D::= 2 … D::= 9 “|” above means a choice among alternatives. zhangjing@hrbeu.edu.cn
3.Nonterminal symbol The symbol appearing on left of rule is called nonterminal symbol, the set consisting of all the nonterminal symbols is VN. In examples 2.1, 2.2 and 2.3, their VN are as follows VN={S} VN={U, T} VN={Z, D} zhangjing@hrbeu.edu.cn
4.Terminal symbol The set of characters that does not belong to VN are terminal symbol, and it is written as VT, for above examples, VT are as follows, VT={0, 1} VT={a, b} VT={0, 1, … , 9} Note: Vocabulary V is defined as: V=VN∪VT zhangjing@hrbeu.edu.cn
5. The definitionofgrammar A grammar is a finite nonempty set of rules, and it is defined as a 4-tuple: G=(VN, VT ,P, S) While, VN is nonterminal symbol ,VT is terminal symbol,P is set of rules,S is the start symbol. Example 2.3 can be written as: G[Z] =(VN, VT, P, Z) VN ={Z, D} VT={0, 1,…9} P: Z ::= D Z ::=Z D D::= 0 |1 |…|9 zhangjing@hrbeu.edu.cn
Till now, we have learnt what grammar is and how to represent grammar, next we want to understand or judge if a string is the result of a grammar. . Take the string “12” for example to judge if it is the string of example 2.3. . Z ⇒ ZD ⇒ DD ⇒ 1D ⇒ 12 So “12” is the string of example 2.3. The procedure above is a derivation that starts from the start symbol, in fact derivation means the nonterminal symbol is replaced by the right of the rule. . zhangjing@hrbeu.edu.cn
6.Derivation A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols) . . There are rules: v=xUy, w=xuy, If U::=u is a rule of grammar, x,y is string which belong to V*, then replace U in v=xUy by U::=u, the derivation result is . xUy ⇒ xuy (v ⇒ w) Direct derivation means that the string of w=xuy is the direct derivation of v=xUy, or we can say v directly produce string w (or we can say w is the direct derivation of v). . zhangjing@hrbeu.edu.cn
Note: Rule U::=u has the derivation: U ⇒u, while x=ε and y=ε Grammar G[S]in Example 2.1 can be deduced as follows. S ⇒01 (x=y=ε,deduced by rule of S::=01) S ⇒0S1 (x=y=ε,deduced by rule of S::=0S1) 00S11 ⇒000S111 (x=00 ,y=11,deduced by rule of S::=0S1) 00S11 ⇒000111 (x=00 ,y=11,deduced by rule of S::=01) zhangjing@hrbeu.edu.cn
The derivation can be written as, v= u0 ⇒u1 ⇒u2 ⇒…un-1⇒un =w while u0、u1、…、un are the string of V*(n>0), we can say this derivation is n derivation, or we can also write it as, v ⇒ +w If there is v =w, the derivation can be written as v ⇒*w As for the example 2.1, the derivation is, 0S1 ⇒ 00S11 ⇒ 000S111 ⇒ 00001111 0S1 ⇒ +00001111(The length of derivation is 3) zhangjing@hrbeu.edu.cn
7. Sentence pattern If there is grammar G[Z], and Z ⇒*x, string x is the sentence pattern of grammar G[Z], means that x is deduced by the identify symbol Z, or we can say all the string which is deduced by the identify symbol Z are the sentence pattern of grammar G[Z]. In example 2.1, S is identify symbol, S ⇒01 S ⇒0S1 ⇒00S11 ⇒000111 So, string 01、0S1、00S11、000111 are sentence patterns of grammar G[S], 000S11、0000111 in V *are not sentence pattern of grammar G[S],because they can not be deduced by S. . zhangjing@hrbeu.edu.cn
8. Sentence Only thesentence patterns which are made up of terminal symbols are sentence. In above example, 01、000111 are sentences, but 0S1、00S11 are not sentences. A correct source program should be composed by sentences. zhangjing@hrbeu.edu.cn
9. Language Language is a set which includes all the sentences formed by grammar G[Z]. . L(G)={x|Z ⇒+x, x∈VT+} Grammar in Example 2.1 is G=({S},{0,1},{S::=0S1,S::=01},S) L(G)is as follows. 1 1 1 1 S ⇒ 0S1 ⇒ 00S11 ⇒ 000111… ⇒0…0S1…1 2 ⇒0…01…1 Namely, S → +0n1n L(G)={0n1n|n≥1} Note: VT+ ={0, 1, 00, 01, 10, 11, …}in grammar G, so we can say that L(G)is one of the subset of VT+. zhangjing@hrbeu.edu.cn
Example 2.4 Grammar G1=(VN,VT,P,U) While, VN={U,T} VT={a, b} P={U:: =Tb,T:: =a} Grammar G2=(VN,VT,P,U) While, VN={U} VT={a, b} P={U:: =ab} What are L(G1)and L(G2)? zhangjing@hrbeu.edu.cn
In Grammar G1,there are derivation: U ⇒Tb ⇒ ab, so L(G1)={ab}, Similarly, “ab” is deduced from U in grammar G2, L(G2)={ab}. From the analysis above, we know that G1≠G2,but L(G1)=L(G2),it means that from different grammars we can obtain same language, G1and G2 are , equal. .. zhangjing@hrbeu.edu.cn
Example 2.5 G[S]: VN={S,A} VT={3} While, P is S::=S+S|S*S|A A::=3 S ⇒S+S ⇒S+S*S ⇒A+A*A ⇒ 3+3*3, the calculation of it is 12 S ⇒ S*S ⇒ S+S*S ⇒ A+A*A ⇒ 3+3*3, the calculation of it is 18, because 3+3*3 here means (3+3)*3. zhangjing@hrbeu.edu.cn
The languages above are L(G) ={3+3*3}, but the calculation of them are different, That is to say, same languages have different meanings. . zhangjing@hrbeu.edu.cn
10.Phrase Grammar G[Z],w=xuy is a sentence pattern of it , If there is Z ⇒*xUy , U ⇒ +u While u is phrase of sentence pattern “xuy”. If there is U::=u instead of U ⇒ +u, we call u is simple phrase of xuy. zhangjing@hrbeu.edu.cn
Example 2.6. • G=({U, T, V, M}, {a,b}, P, U) While P: U:: =TM T:: =Va V:: =a M:: =b For string “aab”, the derivation of it is, U ⇒TM ⇒ VaM ⇒ aaM ⇒ aab Because there is T ⇒ +aa, aa is phrase of aab. In addition, there is V:: =a and M:: =b, we know that a and b are simple phrase of aab. zhangjing@hrbeu.edu.cn
11. Handle The left most position of simple phrases is handle. In example 2.6, there are two simple phrases a and b, the left one “a” is the handle. . zhangjing@hrbeu.edu.cn
12. The leftmost (rightmost)derivation A leftmost derivation is one in which the leftmost nonterminal in each sentential form is the one that is expanded. Rightmost derivation works right to left instead. A derivation may be neither leftmost nor rightmost derivation. . In example 2.6, the left derivation is, U ⇒TM ⇒ VaM ⇒ aaM ⇒ aab The right derivation is U ⇒ TM ⇒ Tb ⇒ Vab ⇒ aab Neither left nor right derivation is U ⇒ TM ⇒ VaM ⇒ Vab ⇒ aab zhangjing@hrbeu.edu.cn
We define: (1) Formal derivationis right derivation. (2)Formal sentence patternis formed by formal derivation. . (3)The grammar left recursive is that the rule is written like this, , U::=U… (4)The grammar right recursive is that the rule is like this, , U::=…U (5) Both left and right recursive are direct recursive (6) Grammar left derivation is, U ⇒ +U… (7)Grammar right deduce is, • U ⇒ +…U zhangjing@hrbeu.edu.cn
Parsing tree and ambiguity • We know, the next step after lexical analyzer in compiler is parser. The input of parser is a sequence of tokens and the output of it is parsing tree. Parsing tree is a very important concept and help to understand some of the concepts that we have introduced before .. . zhangjing@hrbeu.edu.cn
1.Parsing tree There is a grammar G=(VN, VT, P, Z), the parsing tree of it is a tree with following features:: (1) The root is labeled by the start symbol. . (2) Each node is labeled by a symbol that the symbol belongs to the set of V=VN∪VT.. (3) If the node is not leaf and it has one child node at least, the node must be labeled by a nonterminal . . (4) If a node is labeled by U, its child nodes are x1、x2、…、xn , then there is a grammar U::= x1x2…xn .. zhangjing@hrbeu.edu.cn
Example 2.7 G[Z] =(VN, VT, P, S) VN ={Z, D} VT={0, 1,…9} P: Z ::= D Z ::=Z D D::= 0 |1 |…|9 The derivation of phrases 12: Z ⇒ZD ⇒ DD ⇒ 1D ⇒ 12 zhangjing@hrbeu.edu.cn
Parsing tree of phrases 12 is shown by Fig.2.1. zhangjing@hrbeu.edu.cn
The derivation of phrases aab in example 2.6: U ⇒TM ⇒VaM ⇒aaM ⇒aab, its Parsing tree is shown by Fig.2.2. zhangjing@hrbeu.edu.cn
In Fig 2.2, there are three subtree; each leaf of subtree is a phrase. So there are phrases “a”, “aa” and “b”. The simple phrases are “a” and “b”. So we can say the definition of simple phrases is the leaf of the subtree that only has parent and children.The handle of aab is “a”, because “a” is the leftmost simple phrase. .. zhangjing@hrbeu.edu.cn
2.Ambiguity A grammar produces more than one parsing tree for a sentence is called as an ambiguous grammar. A unique selection of the parsing tree for a sentence is unambiguous grammar. . zhangjing@hrbeu.edu.cn
Example 2.8 Grammar G[E], P: E ::= i E ::=E + E E::= E *E E ::= ( E ) For string: i * i + i, the derivation of it are: (1) E ⇒E + E ⇒ E * E + E ⇒ i * E + E ⇒ i * i + E ⇒ i * i + i (2) E ⇒ E * E ⇒ i * E ⇒ i * E + E ⇒ i * i + E ⇒ i * i + i zhangjing@hrbeu.edu.cn
Their parsing tree are shown by Fig.2.3: zhangjing@hrbeu.edu.cn
Example 2.9 There is a Grammar G[S]: P: S::=IF B THEN S S::=IF B THEN S ELSE S S::=S’ S’::=S2 |S3 B::=B1 |B2 The string is “IF B1 THEN IF B2 THEN S2 ELSE S3”, and the question is how to get the parsing tree of it and judge if the parsing tree is unique? The answer is shown by Figure 2.4. . zhangjing@hrbeu.edu.cn
3. Removing ambiguity Ambiguous grammar is not allowed in compiling program. Actually, there are some methods to eliminate the ambiguity. . (1) Modify the ambiguous grammar in order to make it being an unimbiguous grammar. Example 2.8 can be modified as follows, , zhangjing@hrbeu.edu.cn
(2) Add some limitation to semantic. For example, we add a condition to example 2.9 to eliminate the ambiguity. The condition is that each “ELSE” must match with “THEN” that is close to it. That means in example 2.9, only parsing tree (a) is correct. . So, we can modify the rules in example 2.9 like this, G’: S::=M|U M::=IF B THEN M ELSE M|S’ U::=IF B THEN S|IF B THEN M ELSE U S’::=S2 |S3 B::=B1 |B2 We can see that G’and G is equal, but G’ is unambiguous grammar . . zhangjing@hrbeu.edu.cn
Extended Backus Naur Form • We have discussed the grammar represented by Backus Naur Form. In this section, we will introduce Extended BNF- improves readability and writability of BNF. zhangjing@hrbeu.edu.cn
{ } There is recursive rules in BNF,such as E::=T|E+T and T::=F|T*F, if we want to eliminate the left recursive, we can rewrite the rules like this, , E::=T{+T} T::=F{*F} {x}means zero or more repetitions of x {x}mn means that x is repeated for n times at most and m times at least. When m=0, {x}mn can be written as {x}n . . zhangjing@hrbeu.edu.cn
[ ] [x]means that the string x is optional. There are two results, one is[x]=ε, the other one is x. S::=IF B THEN S S::=IF B THEN S ELSE S The rules above can be written as, S::=IF B THEN S [ELSE S] zhangjing@hrbeu.edu.cn
( ) For grammar U ::=E + E|E * E, we can rewrite it as U::=E(+E|*E). So the function of “( )” is for grouping. . For grammar Z::=01|0S|0S0, it is rewritten as, , Z::=0S(ε|0) |01 or Z::=0(S(ε|0) |1) . zhangjing@hrbeu.edu.cn