790 likes | 1.12k Views
Chapter 6 Simplification of Context-free Grammars and Normal Forms These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata , 4 th ed., by Peter. Parsing.
E N D
Chapter 6 Simplification of Context-free Grammars and Normal Forms These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata, 4th ed., by Peter
Parsing • Given a string w and a grammar G, a parser finds a derivation of the string w from the grammar G, or else determines that the string is not part of the language • Thus, a parser solves the membership problem for a language, which is the problem of deciding, for any string w and grammar G, whether w belongs to the language generated by G • Typically, a parser also constructs a parse tree for the string (which can be used by a compiler for code generation)
Two questions • Can we solve the membership problem for context-free languages? That is, can we develop a parsing algorithm for any context-free language? • If so, can we develop an efficient parsing algorithm? • We saw in the previous chapter that we can, if we place restrictions on the grammar.
Simplified forms and normal forms Simplified forms can eliminate ambiguity and otherwise “improve” a grammar What we would like to do is to have all productions in a CFG be in a form such that the string length is strictly non-decreasing. Once the productions are in this form, whenever we find in the process of deriving a string that the derivation string is longer than the input string, we know that the string cannot belong to the language.
Simplified forms and normal forms Normal forms of context-free grammars are interesting in that, although they are restricted forms, it can be shown that every CFG can be converted to a normal form. The two types of normal forms that we will look at are Chomsky normal form and Greibach normal form.
The empty string The empty string often complicates things, so we would like to define (and work with) a subset of a language which accepts the empty string. Let L be a context-free language and let G’ = (V, T, S, P) be a context free grammar for L – { λ }. Then we can construct a grammar G that generates L by adding the following to G’: Create a new Start variable, S0 Add two new production rules to G’: S0 S S0λ
The empty string Most of the proofs for CFG languages are demonstrated by using λ-free languages. It usually can be shown quite easily that the proof can also be extended to “equivalent” languages for which the only difference is the acceptance of the empty string. (yes, this is handwaving, but . . .)
Simplified forms Theorem 6.1: Let G = (V, T, S, P) be a context-free grammar. Suppose that P contains a production rule of the form: A x1Bx2 Assume that A and B are different variables and that B y1 | y2 | . . . | yn is the set of all productions in P which have B as the left side.
Simplified forms Theorem 6.1: (continued) Let G’ = (V, T, S, P’) be the grammar in which P’ is constructed by deleting A x1Bx2 from P, and adding to it A x1y1x2 | x1y2x2 | . . . | x1ynx2 Then it may be shown that L(G’) = L(G) (see the Linz textbook, for the proof)
Simplified forms Example: A a | aaA | abBc B abbA | b Here we can’t eliminate all rules with B on the left side, but we can eliminate it from the right side of any A rules. The equivalent productions would be: A a | aaA | ababbAc | abbc B abbA | b
Simplified forms Example: Suppose that our complete simplified grammar is: S A A a | aaA | ababbAc | abbc B abbA | b Since you can’t get to B from S, there is no longer any way that any B rules can play a part in any derivation; they are useless.
Simplified forms Another example: Suppose that our grammar is: S aSb | λ | A A aA Notice that the production rule A aA can never be used to produce a sequence of all terminals. It is therefore useless. The production rule S A is also useless. (Why?) Both of these rules may be deleted without effectively changing the grammar.
Reachable Definition: A variable A in a CFG grammar G = (V, T, S, P) is reachable if S * xAy for some x, y (V T)*. Reachable variables are variables that appear in strings derivable from S.
Example Reachable variables: R0 = {S} R1 = {S, E, A} R2 = {S, E, A, C} R3 = {S, E, A, C} S EA A abA | ab C EC | Ab E bC G EbE | CE | ba
Useful variables Definition: Let G = (V, T, S, P) be a context-free grammar. Let A V; then A is live iff there is at least one string w L(G) such that xAy * w with x, y in (V T)* Informally, live variables are those from which strings of terminals can be derived. Variables which are not live are said to be dead.
Example S AB | CD | ADF | CF | EA A abA |ab B bB | aD | BF | aF C cB | EC | Ab D bB | FFB E bC | AB F abbF | baF | bD | BB G EbE | CE | ba Live variables: L0={A, G} L1={A, G, C} L2={A, G, C, E} L3={A, G, C, E, S}
Useful variables Definition 6.1 (modified): A variable A in a CFG grammar G = (V, T, S, P) is useful if, for some string w L(G), there is a derivation of w that takes the form S * xAb* w. Informally, a variable is useful if it can be used in a derivation of a string in the language L(G). A variable which is not useful is said to be useless. Variables which are dead are useless. Variables which are not reachable are useless.
Useless variables So a variable is useless if either: 1. it is not live (i.e., cannot derive a terminal string), or 2. it is not reachable from the start symbol A production is useless if it involves any useless variables.
Exercise Example: Given G = ({S, A, B, C}, {a, b}, S, P), with P = S aS | A | C A a B aa C aCb eliminate all useless variables and productions. First, we find any dead variables. It should be obvious that C can never generate a string of all-terminals. C is dead.
Exercise Delete any productions involving C. New grammar: S aS | A A a B aa Next, we check to see if there are any variables which cannot be reached from the start symbol. To do this, we may use a dependency graph.
Exercise Example: S aS | A | C A a B aa C aCb Dependency graph: B S A Clearly, B is not reachable from S. C
Exercise Delete any productions involving B. New grammar: S aS | A A a The only productions that were deleted from the original grammar were useless. This new grammar generates all and only the strings generated by the original grammar. It is equivalent to the original grammar.
Useless variables Theorem 6.2: Let G = (V, T, S, P) be a context-free grammar. Then there exists an equivalent grammar G’ = (V’, T’, S, P’) that does not contain any useless variables or productions. Note that useless variables may be removed from V to give V’, and any terminals not occurring in any useful production may be removed from T to give T’.
Simplified forms and normal forms Two undesirable types of productions in a CFG can make the string length in sentential forms not increase: l-productions - these productions are of the form A l, and they actually decrease the length of the string unit productions - these productions are of the form A B, and they allow rules to be applied to a string without increasing the length of the string and without getting us any closer to the goal of ending up with a string of all terminals
l-productions Definition 6.2: Any production of a context-free grammar of the form A λ is called a λ-production. Any variable A for which the derivation A *λ is possible is called nullable.
Nullable variables A nullable variable in a context-free grammar G = (V, T, S, P) is defined as follows: 1. Any variable A for which P contains the production A l is nullable. 2. If P contains the production A B1B2…Bn and B1B2…Bn are nullable variables, then A is nullable. 3. No other variables in V are nullable. The nullable variables in V are precisely those variables A for which A *l.
The effect of l-productions Suppose we are trying to see if our CFG generates the string aabaa, which contains 5 terminal characters. In the process of applying productions, we have generated an intermediate string, aaYbYaa, containing 7 characters. Since l-productions decrease the length of the string, it might still be possible to generate aabaa from aaYbYaa (if there were a derivation path Y l).
l-productions Note that withoutl-productions, a grammar would have no way to reduce the number of characters in its intermediate strings. In such a grammar, we could stop processing intermediate strings as soon as they exceeded the length of the target string.
l-productions So, given a CFG G without l-productions, we could determine if a given string x of length |x| belonged to L(G) simply by applying production rules and generating all strings of length |x|. If x had not been generated up to that point, it could not belong to that language.
l-productions Given the grammar S aS1b S1 aS1b | λ What is the effect of the production S1 λ? The effect is to delete S1 from any sentential form occurring on the right-hand side of a production rule.
l-productions If we apply the production S1 λ to S aS1b the resulting production rule is S ab If we apply the production S1 λ to S1 aS1b the resulting production rule is S1 ab
l-productions Therefore, we can eliminate any λ-productions from this grammar by adding the new productions obtained by substituting λ for S1 wherever S1 appears on the right-hand side of the production rules, and then deleting the λ-production. When we do this, we obtain the equivalent grammar: S aS1b | ab S1 aS1b | ab
l-productions Theorem 6.3: Let G be any context-free grammar with λ not in L(G). Then there exists an equivalent grammar G’ having no λ-productions.
Algorithm FindNull Establish the set N0, which is the set of all variables A in the grammar that go directly to l. Now loop: The first time through the loop, add to this set all variables B that go to A. The second time through the loop, add to this set all variables C that go to B. The third time through the loop, add to this set all variables D that go to C. etc. . . . Stop when no new variables were added to the set during the last iteration of the loop.
Example Let G be the CFG with the productions: S ABCBCDA A CD B Cb C a | l D bD | l Here, C and D are nullable because there are production rules C l and D l. But A is also nullable, because A CD, and both C and D are nullable.
Algorithm: Eliminate l-productions Given a CFG G = (V, T, S, P) construct a CFG G’= (V, T, S, P’) with no l-productions as follows: 1. Initialize P’ = P 2. Find all nullable variables in V, using FindNull. 3. For every production A x in P (x {V T}*), where x contains nullable variables, add to P’ every production that can be obtained from this one by deleting from x one or more of the occurrences in xof nullable variables. 4. Delete all l-productions from P’. 5. In addition, delete any duplicates and delete productions of the form A A.
Implications of Theorem 6.3: Let G = (V, T, S, P) be any context-fee grammar, and let G’ be the grammar obtained from G by the previous algorithm. Then: 1. G’ has no l-productions, and 2. L(G’) = L(G) - {l}. 3. Moreover, if G is unambiguous, then so is G’.
Example Given a context-free grammar with the following production rules, find the nullable variables: S ABC A B | a B C | b | λ C AB | D D Cd N0 = {B} N1 = {B, A} N2 = {B, A, C} N3 = {B, A, C, S}
Example (continued) S ABC A B | a B C | b | l C AB | D D Cd N = {A, B, C, S} S ABC S ABC | BC | AC | AB | A | B | C C AB | D C AB | A | B | D D Cd D Cd | d
Example (continued) S ABC | AB | AC | BC | A | B | C A B | a B C | b C AB | A | B | D D Cd | d Note that we have gotten rid of all l-productions. However, other beneficial changes can still be made.
Unit productions Definition 6.3: Any production of a context-free grammar of the form A B, where A, B V is called a unit-production.
Unit productions Theorem 6.4: Let G = (V, T, S, P) be any context-free grammar without λ-productions. Then there exists a context-free grammar G’ = (V’, T’, S, P’) that does not have any unit-productions and that is equivalent to G. Proof: See p. 159 in the Linz text.
Definition of A-derivable variables The set of “A-derivable variables” is the set of all variables B for which A *B. 1. If A B is a production, then B is A-derivable. 2. If: C is A-derivable C B is a production B A then B is A-derivable. 3. No other variables are A-derivable.
Algorithm: Eliminating Unit Productions Given a context-free grammar G = (V, T, S, P) with no l-productions, construct a grammar G’= (V, T, S, P’) having no unit productions as follows: 1. Initialize P’ to be P. 2. For each A V, find the set of A-derivable variables. 3. For every pair (A, B) such that B is A-derivable, and every non-unit production B x (where x {V T}+), add the production A x to P’. 4. Delete all unit productions from P’.
Example Original grammar: S S+T | T T T*F | F F (S) | a {S -derivable} = {T} {T-derivable} = {F} {S-derivable} ={T, F} Resulting grammar: S S+T | T*F | (S) | a T T*F | (S) | a F (S) | a
Summary Theorem 6.5: Let L be a context-free language that does not contain λ. Then there exists a context-free language that generates L and that does not have any useless productions, λ-productions, or unit-productions. Proof: Find a CFG that generates L. Apply the procedures in theorems 6.2, 6.3, and 6.4. The result is an equivalent CFG that generates L but does not have any useless productions, λ-productions, or unit-productions..
Summary Note that the procedure specified above must occur in a particular order. The procedure for removing λ-productions can create new unit-productions, and the procedure for eliminating unit-productions must start with a CFG that has no λ-productions. The required sequence is: 1. Remove λ-productions 2. Remove unit productions 3. Remove useless productions
Unit productions Given a context-free grammar G’ without unit-productions, any production rule must either: • Convert a non-terminal to a terminal, or • Replace a non-terminal with at least two other symbols
Unit productions Let: l = length of the current string t = the number of terminals in the current string The value of l + t is 1 for the starting string S and 2k for a string (all terminals) of length k in the language. The value of l + t for an intermediate string of length k containing 1 or more variables would be < 2k. Any intermediate string with l + t > 2k cannot generate a string of length k in the language.
Simplified forms What does this mean for us? Given a grammar G and a language L(G), it means that if you have a string, x, in L(G) and |x| = k, then starting from S there are no more than 2k - 1 steps in the derivation of x.