380 likes | 388 Views
This lecture covers topics such as grammars for expressions and if-then-else, formal proofs of L(G), top-down parsing, left factoring, and removing left recursion.
E N D
Lecture 6 Grammar Modifications CSCE 531 Compiler Construction • Topics • Grammars for expressions and if-then-else • Formal proofs of L(G) • Top-down parsing • Left factoring • Removing left recursion • Readings: 4.3-4.4 • Homework: 4.1, 4.2a, 4.6a, 4.11a January 30, 2006
Overview • Last Time • Should have mentioned DFA minimization • Grammars, Derivations, Ambiguity • Lec05-Grammars: Slides 1-27 • Today’s Lecture • Ambiguity in classic programming language grammars • Expressions • If-Then-Else • Top-Down parsing • References • Sections 4.3-4.4 • Parse demos • http://ag-kastens.uni-paderborn.de/lehre/material/compiler/parsdemo/ • Chomsky Hierarchy – types of grammars and recognizers • http://en.wikipedia.org/wiki/Chomsky_hierarchy • Homework: 4.1, 4.2a, 4.6a, 4.11a
DFA Minimization • Algorithm 3.6 in text • We will not cover this algorithm other than this slide. • Partition states into F and Q-F (final and non final states) • Refine the partitioning as much as possible. • Refinement – a string x=x1x2…xtdistinguishes between two states Si and Sk if starting in each and following the path determined by x one ends in an accepting state and the other ends in a non-accepting state x Si Sa Accepting x Sna Sk Non-accepting
LM Derivation of 5 * X + 3 * Y +17 • Parse tree • E • E E + T | E – T | T • T T * F | T / F | F • F id | num | ( E ) • E E+T E+E+T T+E+T • T*F+E+T T*F+E+T • F*F+E+T num*F+E+T • num*id+E+T • num*id+T+T • num*id+T*F+T • num*id+T*F+T • num*id+F*F+T • num*id+num*F+T …
Notes on rewritten grammar • It is more complex; more nonterminals, more productions. • It requires more steps in the derivation • But it does eliminate the ambiguity, so we make the right choices in derivations.
Ambiguous Grammar 2 If-else Another classic ambiguity problem in programming languages is the IF-ELSE Stmt if Expr then Stmt | if Expr then Stmt else Stmt | other stmts S if E then S | if E then S else S | OS
Ambiguity This sentential form has two derivations if Expr1 then if Expr2 then Stmt1 else Stmt2
Removing the ambiguity • To eliminate the ambiguity • We must rewrite the grammar to avoid generating the problem • We must associate each else with the innermost unmatched if • S withElse
Removing the IF-ELSE Ambiguity Stmt if Expr then Stmt | if Expr then Stmt else Stmt | other stmts Stmt MatchedStmt | UnmatchedStmt MatchedStmt if Expr then MatchedStmt else MatchedStmt | OthersStatements UnmatchedStmt if Expr then MatchedStmt else | if Expr then MatchedStmt else UmatchedStmt
Ambiguity if Expr1then if Expr2 then Stmt1 else Stmt2
Ambiguity that is more than Grammar The examples of Ambiguity that we have looked at are solved by tweaking the CFG Overloading can create deeper ambiguity, a = f(17) In some languages, f could be either a function or a subscripted variable Disambiguating this requires semantics not just syntax Declarations, type information to say what “f” is. Requires an extra-grammatical solution Must handle these with a different mechanism Step outside grammar rather than use a more complex grammar
Regular versus Context free Languages • A regular language is a set of strings that can be: • Recoginzed by a DFA, • Recognized by an NFA, or (/and) • Denoted by regular expressions. • Example of non-regular languages? • A context free language is one that is generated by a context free grammar. • S 0S1 | ε
Formal verification of L(G) • Example 4.7: • Induction on length of derivation of a sentential forms • Formulate inductive hypothesis in terms of sentential forms • Basis step n=1 • Assume derivations of length n satisfy the Inductive Hypothesis. • Show that derivations of length n+1 also satisfy
Regular Grammars (Linear Grammars) • A right-linear grammar is a restricted form of context free grammar in which the productions have a special form: • N T* N2 • N T* • Where N and N2 (possibly the same) are non-terminals and T* is a string of tokens • In these productions if there is a non-terminal on the right hand side then it is the last symbol • Linear grammars (right and left linear) are also called regular grammars. Why?
DFA Right-linear Grammar • Consider DFA M = (Q, Σ, δ, q0, F) • (notice re-ordering! and Q!) • Construct a grammar G = (N,T,P,S) where • N = Q i.e. each state corresponds to a non-terminal • T = Σ • For each transition δ(si, a) = sj, we have a production • Si a Sj • And for each state S in F we add a production • S ε • Then L(M) = L(G) How would we formally prove this? • Thus regular languages are a subset of the Context free languages
Fig 3.23 p 117 N0 a N1 | b N0 N1 a N1 | b N2 N2 … N3 … Example DFA Regular Grammar
Chomsky Hierarchy • Noam Chomsky linguist: Formal levels of grammars • Regular grammars, N T* N • Context-free grammars, N (N U T)* • Context sensitive grammars, αNω αβω • We can rewrite αNω β, but only in the “context” αNω • Unrestricted grammars, α β with α and β in (N U T)* • Recognizers: • DFA (regular) • Pushdown automata, DFA augmented with stack • Linear bounded Turing machine • Turing machine http://en.wikipedia.org/wiki/Chomsky_hierarchy
Non-Context Free Languages • Certain languages cannot have a context free grammar that generates them, they are not context free languages • Examples • Σ = { a, b, c}, L = {wcw | w is in Σ*} • {anbncn| n > 0} • However they are context sensitive, or are they? • Well, not relevant for this course. • We would eliminate any non-context-free construct from a programming language! (at least for parsing) • S abc | aSBc • cB Bc • bB bb • Alternative form of Cont. Sensitive • productions αβ • satisfy |α| <= |β|
Parsing Techniques Top-down parsers Start at the root and try to generate the parse tree Pick a production and try to match the input If we make a bad choice then backtrack and try another choice Grammars that allow backtrack-free parsing sometimes will exist and are Bottom-up parsers Start at the leaves and grow toward root As input is consumed, encode possibilities in an internal state Start in a state valid for legal first tokens Bottom-up parsers handle a large class of grammars
Top-down Parsing Algorithm Add the start symbol as the root of the parse tree While the frontier of the parse tree != input { Pick the “leftmost” non-terminal in the frontier, A Choose an A-production, A β1,β2,…βk, and expand the tree (other choices saved on stack) If a token is added to the frontier that does not match the input backtrack and choose another production (if we run out of choices the parse fails.) } We now will look at modifications to grammars to facilitate top-down parsing.
Reconsider Our Expression Grammar • First we number the productions for documentation • E E + T • E E – T • E T • T T * F • T T / F • T F • F id • F num • F ( E ) • Example: 5 * X + 3 * Y +17 • Token seq.: num * id + num * id + num
How do we choose which production? It should be guided by trying to match the input E.g., if the next input symbol is the token “if” and we are choosing between between S if Expr then S else S S while Expr do S What choice is best? Well the choice is obvious! But if the next input symbol is the token “if” and we are choosing between between S if Expr then S else S S while Expr do S What choice is best? Well the choice is obvious!
How do we choose which production? (continued) But if the next input symbol is the token “if” and we are choosing between S if Expr then S else S S if Expr then S What choice is best? Well now the choice is not obvious!
Other Grammar Modifications to Guide Parser • Left Factoring • Stmt if Expr then Stmt else Stmt | if Expr then Stmt • If the next tokens are “if” and “id” then we have no basis to choose, in fact we have to look ahead to see the “else” • Stmt if Expr then Stmt Rest • Rest else Stmt | ε • Left Recursion • A Aα | β • Why recursive? • AAαAαα Aααα … Aαn βαn • What do we do? • A βA’ and A’ αA’ | ε • A βA’ βαA’ βααA’… βαnA’ βαn
General Left Factoring Algorithm • Algorithm 4.2 • Input: a grammar G • Output: an equivalent left-factored grammar. • Method: • For each nonterminal A • find the longest prefix α common to two or more A-productions • A αβ1 | αβ2 | … | αβm | ξ , where ξ represents the A-productions that don’t start with the prefix α • Replace with • A αA’ | ξ • A’ β1 | β2 | … | βm
A graphical explanation for the same idea becomes … Left Factoring 1 A 2 3 1 Z A 2 3 A 1 | 2 | 3 A Z Z 1 | 2 | n From Engineering a Compiler by Keith D. Cooper and Linda Torczon
Graphically becomes … Left Factoring No basis for choice Word determines correct choice Identifier Factor Identifier [ ExprList ] Identifier ( ExprList ) Factor Identifier [ ExprList ] ( ExprList ) From Engineering a Compiler by Keith D. Cooper and Linda Torczon
Eliminating Left Recursion: Expr Grammar • General approach for immediate left recursion • Replace A Aα | β • with A βA’ and A’ αA’ | ε • So for the expression grammar • E E + T | E – T | T • We rewrite the E productions as • E T E’ • E’ + T E’ | ε
Replace T T * F | F with T F T’ T’ * F T’ | ε No replacing needed for the F productions, so the grammar becomes: E T E’ E’ + T E’ | - T E’ | ε T F T’ T’ * F T’ | / F T’ | ε F id | num | ( E ) Eliminating Left Recursion: Expr Grammar
Eliminating Immediate Left Recursion • In general consider all the A productions • A Aα1 | Aα2 | … | Aαn | β1 | β2 | … | βm • Replace them with • A β1A’ | β2A’ | … | βmA’ • A’ α1A’ | α2A’ | … | αnA’ | ε • But not all left recursion is immediate. Consider • S Aa | Bb |c Then SAaCaaScaa • A Ca | aA | a A * Aβ • C Sc • B b B | b
Eliminating Left Recursion Algorithm • Algorithm 4.1 Eliminating Left Recursion • Input: Grammar with no cycles or ε-productions • Output: Equivalent Grammar with no left recursion • Arrange the nonterminals in order A1, A2, … Ann • for i = 1 to n do • for J = 1 to i-1 do • replace each production of the form Ai AJξδ by the productions Ai δ1ξ | δ2ξ | … | δkξ where AJ δ1 | δ2 | … | δk the current Ai-productions • end • Eliminate immediate left recursion in the Ai-productions • end
Eliminating Left Recursion How does this algorithm work? 1. Impose arbitrary order on the non-terminals 2. Outer loop cycles through Nonterminals in some order 3. Inner loop ensures that a production expanding Aihas no non-terminal AJ in its rhs, for J < i 4. Last step in outer loop converts any direct recursion on Ai to right recursion using the transformation showed earlier 5. New non-terminals are added at the end of the order & have no left recursion At the start of the ith outer loop iteration For all k < i, no production that expands Ak contains a non-terminal As in its rhs, for s < k
Example • Order of symbols: G, E, T G E EE + T ET TE ~ T Tid From Engineering a Compiler by Keith D. Cooper and Linda Torczon
Example • Order of symbols: G, E, T 1. Ai = G G E EE + T ET TE ~ T Tid From Engineering a Compiler by Keith D. Cooper and Linda Torczon
Example • Order of symbols: G, E, T 1. Ai = G G E EE + T ET TE ~ T Tid 2. Ai = E G E ET E' E' + T E' E' e T E ~ T Tid From Engineering a Compiler by Keith D. Cooper and Linda Torczon
Example • Order of symbols: G, E, T 1. Ai = G G E EE + T ET TE ~ T Tid 2. Ai = E G E ET E' E' + T E' E' e T E ~ T Tid 3. Ai = T, As = E G E ET E' E' + T E' E' e T T E' ~ T Tid Go to Algorithm From Engineering a Compiler by Keith D. Cooper and Linda Torczon
Example • Order of symbols: G, E, T 1. Ai = G G E EE + T ET TE ~ T Tid 2. Ai = E G E ET E' E' + T E' E' e T E ~ T Tid 3. Ai = T, As = E G E ET E' E' + T E' E' e T T E' ~ T Tid 4. Ai = T G E ET E' E' + T E' E' e Tid T' T'E' ~ T T' T' e From Engineering a Compiler by Keith D. Cooper and Linda Torczon
Predictive Parsing Basic idea Given A , the parser should be able to choose between & FIRST sets For some rhsG, define FIRST() as the set of tokens that appear as the first symbol in some string that derives from That is, x FIRST() iff*x, for some If A and A both appear in the grammar, and FIRST() FIRST() = This would appear to allow the parser to make a correct choice with a lookahead of exactly one symbol ! (if there are no e-productions then it does.)