410 likes | 738 Views
Context-Free Grammars. We have seen finite automata and regular expressions as equivalent means to describe languagesSimple languages onlyCan't do e.g., {anbn}Context-free grammars are a more powerful way to describe languagesCan describe recursive structure. CFGs. First used to describe human languagesNatural recursion among things like noun, verb, preposition and their phrases because e.g. noun phrases can appear inside verb phrases and prepositional phrases, and vice versaImportant ap31980
E N D
1. CS240 • Language Theory and Automata • Fall 2008
2. Context-Free Grammars We have seen finite automata and regular expressions as equivalent means to describe languages
Simple languages only
Can’t do e.g., {anbn}
Context-free grammars are a more powerful way to describe languages
Can describe recursive structure
3. CFGs First used to describe human languages
Natural recursion among things like noun, verb, preposition and their phrases because e.g. noun phrases can appear inside verb phrases and prepositional phrases, and vice versa
Important application of CFGs: specification and compilation of programming languages
Parser describes and analyzes syntax of a program
Algorithms exist to construct parser from CFG
4. Context-free Languages and Pushdown Automata CFLs are described by CFGs
Include the regular languages plus others
Pushdown automata are to CFLs as finite automata are to RLs
Equivalent: recognize same languages
5. Example Production rules: substitutions
Non-terminals: variable that can have a substitutions
Terminals: symbols that are part of the alphabet, no substitutions
Start variable: left side of top-most rule
6. Generating strings Use a grammar to describe a language by generating each string as follows:
Write down the start variable
Find a variable that is written down and a rule that starts with that variable (I.e. the variable is on the left of the arrow). Replace the variable with the right side for that rule
Repeat step 2 until no variables remain
7. Example Derivation Sequence of substitutions is called a derivation
Derivation of aaacbbb:
A ? aAb ?aaAbb ?aaaAbbb ? aaaBbbb ? aaacbbb
8. CFG for English <sentence> ? <noun-phrase> <verb-phrase>
<noun-phrase> ? <cmplx-noun>
| <cmplx-noun> <prep-phrase>
<verb-phrase> ? <cmplx-verb>
| <cmplx-verb> <prep-phrase>
<prep-phrase> ? <prep> <cmplx-noun>
<cmplx-noun> ? <article> <noun>
<cmplx-verb> ? <verb>
| <verb> <noun-phrase>
<article> ? a | the
<noun> ? boy | girl | flower
<verb> ? touches | likes | sees
<prep> ? with
9. Formal CFG Notation Productions = rules of the form
head ? body
head is a variable.
body is a string of zero or more variables and/or terminals.
Start Symbol = variable that represents "the language."
Notation: G = (V, ?, P, S)
V = variables
? = terminals
P = productions
S = start symbol
10. Derivations ?A? ? ??? whenever there is a production A ? ?
(Subscript with name of grammar, e.g., ?G, if necessary.)
Example: abbAS ? abbaAbS
? ? ? means string can become ? in zero or more derivation steps.
Examples:
abbAS ? abbAS (zero steps)
abbAS ? abbaAbS (one step)
abbAS ? abbaabb (three steps)
11. Language of a CFG
L(G) = set of terminal strings w such that S ?G w, where S is the start symbol.
Notation
a, b,… are terminals, … y, z are strings of terminals.
Greek letters are strings of variables and/or terminals, often called sentential forms.
A,B,… are variables.
…, Y, Z are variables or terminals.
S is typically the start symbol.
12. Leftmost/Rightmost Derivations We have a choice of variable to replace at each step.
Derivations may appear different only because we make the same replacements in a different order.
To avoid such differences, we may restrict the choice.
A leftmost derivation always replaces the leftmost variable in a sentential form.
Yields left-sentential forms.
Rightmost derivation defined analogously.
?, ?, etc., used to indicate derivations are leftmost or rightmost.
13. Example A simple example generates strings of a's and b's such that each block of a's is followed by at least as many b's.
Note vertical bar separates different bodies for the same head.
14. Example Leftmost derivation:
S ? AS ? AbS ? abbS ? abbAS ? abbaAbS ? abbaabbS ? abbaabb
Rightmost derivation:
S ? AS ? AAS ? AA ? AaAb ? Aaabb ? Abaabb ? abbaabb
Note we derived the same string
15. Derivation Trees Nodes = variables, terminals, or ?
Variables at interior nodes, terminals and ? at leaves
A leaf can be ? only if it is the only child of its parent
A node and its children from the left must form the head and body of a production
16. Example
17. Ambiguous Grammars A CFG is ambiguous if one or more terminal strings have multiple leftmost derivations from the start symbol.
Equivalently: multiple rightmost derivations, or multiple parse trees.
Example
Consider S ? AS | ?, A ? Ab | aAb | ab
The string aabbb has the following two leftmost derivations from S:
S ? AS ? aAbS ? aAbbS ? aabbbS ? aabbb
S ? AS ? AbS ? aAbbS ? aabbbS ? aabbb
Intuitively, we can use A ? Ab first or second to generate the extra b.
18. Inherently Ambiguous Languages A CFL L is inherently ambiguous if every CFG for L is ambiguous.
Such things exist, see book.
Example
The language of our example grammar is not inherently ambiguous, even though the grammar is ambiguous.
Change the grammar to force the extra b's to be generated last.
19. Why Care? Ambiguity of the grammar implies that at least some strings in its language have different structures (parse trees).
Thus, such a grammar is unlikely to be useful for a programming language, because two structures for the same string (program) implies two different meanings (executable equivalent programs) for this program.
Common example: the easiest grammars for arithmetic expressions are ambiguous and need to be replaced by more complex, unambiguous grammars.
An inherently ambiguous language would be absolutely unsuitable as a programming language, because we would not have any way of finding a unique structure for all its programs.
20. Expression Grammar
21. Pushdown Automata Add a stack to a FA
Typically non-deterministic
An automaton equivalent to CFG's
22. Example Notation for "transition diagrams":
a,Z/X1X2 …Xk
Meaning: on input a, with Z on top of the stack, consume the a, make this state transition, and replace the Z on top of the stack by X1X2 …Xk (with X1 at the top).
24. Formal PDA P = (Q, ?, ?, ?, q0, Z0, F)
Q, ?, q0, and F have their meanings from FA.
? = stack alphabet
Z0 in ? = start symbol (symbol initially on stack)
? = transition function, takes a state, input symbol (or ?) and a stack symbol and gives you a finite number of choices of:
A new state (possibly the same)
A string of stack symbols (or ?) to replace the top stack symbol
25. Instantaneous Descriptions (ID's) For a FA, the only thing of interest is its state. For a PDA, we want to know its state and the entire contents of its stack.
It is also convenient to maintain a fiction that there is an input string waiting to be read.
Represented by an ID (q, w, ?), where q = state, w = waiting input, and ? = stack [top on left].
26. Moves of the PDA If ?(q,a,X) contains (p,?), then
(q,aw,X?) (p,w,??)
Extend to * to represent 0, 1, or many moves.
Subscript by name of the PDA, if necessary.
Input string w is accepted if (q0,w,Z0) * (p, ?,?) for any accepting state p and any stack string ?.
L(P) = set of strings accepted by P.
27. Example
28. Acceptance by Empty Stack Another one of those technical conveniences:
when we prove that PDA's and CFG's accept the same languages, it helps to assume that the stack is empty whenever acceptance occurs
N(P) = set of strings w such that
(q0, w,Z0) * (p, ?, ?) for some state p.
Note p need not be in F
In fact, if we talk about N(P) only, then we need not even specify a set of accepting states
29. Example For our previous example, to accept by empty stack:
Add a new transition ?(p,?,Z0) = {(p,?)}
That is, when starting to look for a new a-b block, the PDA has the option to pop the last symbol off the stack instead
p is no longer an accepting state, in fact, there are no accepting states
30. A language is L(P1) for some PDA P1 if and only if it is N(P2) for some PDA P2.
Can show with constructive proofs
31. Given P1 = (Q, ?, ?, ?, q0, Z0, F), construct P2:
Introduce new start state p0 and new bottom-of-stack marker X0.
First move of P2 : replace X0 by Z0X0 and go to state q0. The presence of X0 prevents P2 from "accidentally" emptying its stack and accepting when P1 did not accept.
Then, P2 simulates P1, i.e., give P2 all the transitions of P1.
Introduce a new state r that keeps popping the stack of P2 until it is empty.
If (the simulated) P1 is in an accepting state, give P2 the additional choice of going to state r on ? input, and thus emptying its stack without reading any more input. Final State ?Empty Stack
32. Given P2 = (Q, ?, ?, ?, q0, Z0, F), construct P1:
Introduce new start state p0 and new bottom-of-stack marker X0
First move of P1 : replace X0 by Z0X0 and go to state q0. Then, P2 simulates P1, i.e., give P2 all the transitions of P1
Introduce a new state r for P1, it is the only accepting state
P1 simulates P2
If (the simulated) P1 ever sees X0 it knows P2 accepts so P1 goes to state r on ? input Empty Stack ? Final State
33. Equivalence of CFG's and PDA's The title says it all
We'll show a language L is L(G) for some CFG if and only if it is N(P) for some PDA P
34. Only If (CFG to PDA) Let L = L(G) for some CFG G = (V,?, P, S)
Idea: have PDA A simulate leftmost derivations in G, where a left-sentential form (LSF) is represented by:
The sequence of input symbols that A has consumed from its input, followed by…
A's stack, top left-most
Example: If (q, abcd, S) * (q, cd, ABC), then the LSF represented is abABC
35. Moves of A If a terminal a is on top of the stack, then there had better be an a waiting on the input. A consumes a from the input and pops it from the stack, if so
The LSF represented doesn't change!
If a variable B is on top of the stack, then PDA A has a choice of replacing B on the stack by the body of any production with head B
36. APPENDIX
37. Equivalence of Parse Trees, Leftmost, and Rightmost Derivations The following about a grammar G = (V,?, P, S) and a terminal string w are all equivalent:
S ? w (i.e., w is in L(G)).
S ? w
S ? w
There is a parse tree for G with root S and yield (labels of leaves, from the left) w.
Obviously (2) and (3) each imply (1).
38. Parse Tree Implies LM/RM Derivations Generalize all statements to talk about an arbitrary variable A in place of S.
Except now (1) no longer means w is in L(G).
Induction on the height of the parse tree.
Basis: Height 1: Tree is root A and leaves w = a1, a2, …, ak.
A ? w must be a production, so A ? w and A ? w.
39. Induction: Height > 1: Tree is root A with children = X1, X2, …, Xk.
Those Xi's that are variables are roots of shorter trees.
Thus, the IH says that they have LM derivations of their yields.
Construct a LM derivation of w from A by starting with A ? X1X2 …Xk, then using LM derivations from each Xi that is a variable, in order from the left.
RM derivation analogous.
40. Derivations to Parse Trees Induction on length of the derivation.
Basis: One step. There is an obvious parse tree.
Induction: More than one step.
Let the first step be A ? X1X2 …Xk.
Subsequent changes can be reordered so that all changes to X1 and the sentential forms that replace it are done first, then those for X2, and so on (i.e., we can rewrite the derivation as a LM derivation).
The derivations from those Xi's that are variables are all shorter than the given derivation, so the IH applies.
By the IH, there are parse trees for each of these derivations.
Make the roots of these trees be children of a new root labeled A.
41. Example Consider derivation S ? AS ? AAS ? AA ? A1A ? A10A1 ? 0110A1 ? 0110011
Sub-derivation from A is: A ? A1 ? 011
Sub-derivation from S is: S ? AS ? A ? 0A1 ? 0011
Each has a parse tree, put them together with new root S.
42. Only-If Proof (i.e., Grammar ?PDA) Prove by induction on the number of steps in the leftmost derivation S ? ? that for any x, (q, wx, S) |-* (q, x, ?) , where
w? = ?
? is the suffix of ? that begins at the leftmost variable (? = ? if there is no variable).
Also prove the converse, that if (q,wx, S) |-* (q, x, ?) then S ? w?.
Inductive proofs in book.
As a consequence, if y is a terminal string, then S ? y iff (q, y, S) |-* (q,?,? ), i.e., y is in L(G) iff y is in N(A).