500 likes | 512 Views
This article explores the concept of parsing context-free grammars (CFG) and dealing with ambiguous grammars using examples and explanations.
E N D
CSE P501 – Compilers Parsing Context Free Grammars (CFG) Ambiguous Grammars Next Jim Hogg - UW - CSE - P501
Parsing ‘Middle End’ Back End Target Source Front End chars IR IR Scan Select Instructions Optimize tokens IR Allocate Registers Parse IR AST Emit Convert IR IR Machine Code AST = Abstract Syntax Tree IR = Intermediate Representation Jim Hogg - UW - CSE - P501
Valid Tokens != Valid Program MiniJava includes the following tokens (among many others): • class int [ ( . true < this ) + * ; while = if id ilit ! / new { So a MiniJavaScanner would happily accept the following program: • int ; = true { while ( x < true * if { or 123 ) goto count_99 We rely on a MiniJavaParser to reject this kind of gibberish But how do we specify what makes a validMiniJava program? Jim Hogg - UW - CSE - P501
What is Parsing? • Analogous to parsing an English sentence • Analyze words into subject, verb, object, etc • How to parse a program? • Analyze tokens into language constructs: assignment, if-clause, function-call, while-loop, etc Jim Hogg - UW - CSE - P501
Parsing – so what’s the problem? • The set of valid programs is infinite • The set of invalid programs is infinite • Q: How to specify all valid programs, succinctly? • A: Define a Grammar • More specifically, a Context-Free Grammar (CFG) Jim Hogg - UW - CSE - P501
Context-Free Grammars (CFG) Grammar for the Hokum Language • ProgStm;Prog|Stm • StmAsStm|IfStm • AsStmVar=Exp • IfStmifExpthenAsStm • VorCVar|Const • ExpVorC|VorC+VorC • Var[a-z] • Const[0-9] • Context-Free Grammar ~ CFG ~ Grammar ~ Backus-Naur Form ~ BNF • Productions, or Rules • Terminals & Non-Terminals; Start (Symbol) • Multiple languagespresent in the description Jim Hogg - UW - CSE - P501
Example Hokum Programs Legal Hokum BNF Grammar a= 1; b = a + 4; z = 1; if b + 3 then z = 2 ProgStm;Prog|Stm StmAsStm|IfStm AsStm Var=Exp IfStm ifExpthenAsStm VorC Var|Const Exp VorC|VorC+VorC Var [a-z] Const [0-9] Illegal a= x < 20 b= a + 4 + 5 ; z = 1 if (a == 33) z < 2 ; But how do we know which programs are legal or illegal, in Hokum? Jim Hogg - UW - CSE - P501
ProgStm;Prog|Stm StmAsStm|IfStm AsStm Var=Exp IfStm ifExpthenAsStm VorC Var|Const Exp VorC|VorC+VorC Var [a-z] Const [0-9] Derivation Prog => Stm; Prog => AsStm; Prog => Var= Exp ; Prog => a = Exp; Prog => a = VorC ; Prog => a = Const ; Prog => a = 1 ; Prog => a = 1 ; Stm => a = 1 ; IfStm => a = 1 ; if Expthen AsStm => a = 1 ; if VorC+ VorC then AsStm => a = 1 ; if Var + VorC then AsStm => a = 1 ; if a + VorCthen AsStm => a = 1 ; if a + Const then AsStm => a = 1 ; if a + 1 then AsStm => a = 1 ; if a + 1 then Var = Exp => a = 1 ; if a + 1 then b = Exp => a = 1 ; if a + 1 then b = VorC => a = 1 ; if a + 1 then b = Const => a = 1 ; if a + 1 then b = 2 • => versus • Leftmost, rightmost, middlemost • Sentential Form & Sentence • What is a Context-Sensitive Grammar? Jim Hogg - UW - CSE - P501
ProgStm;Prog|Stm StmAsStm|IfStm AsStm Var=Exp IfStm ifExpthenAsStm VorC Var|Const Exp VorC|VorC+VorC Var [a-z] Const [0-9] Parse Tree Prog Prog ; Stm Stm AsStm IfStm Var = Exp then Exp if AsStm a VorC Var = Exp VorC + VorC Const VorC b Var Const Const 1 1 a 2 Jim Hogg - UW - CSE - P501
Junk Nodes in the Parse Tree Prog Prog ; Stm Stm AsStm IfStm Var = Exp then Exp if AsStm a VorC Var = Exp VorC + VorC Const VorC b Var Const Const 1 1 a 2 Jim Hogg - UW - CSE - P501
AST (Abstract Syntax Tree) Prog = IfStm Var:a Const:1 + = Const:2 Var:b Var:a Const:1 Jim Hogg - UW - CSE - P501
Why Not Just RegEx? Try to invent a regex description for arithmetic expressions: single-character variable names; operators + - and [Note: red denotes terminals, below] v = [a-z] // variable o = + | - | | // operator v ( o v )* // derives a + b cok, but now add ( ) (? v (o v )? )* // derives (a + b) c // but also gibberish like:a + b) c ( Almost every programming language includes such balanced pairs: ( ), { }, begin end. Conclusion: regex won’t work. More generally, regex correspond to DFAs. They can only ‘count’ pairs up to a finite limit. Jim Hogg - UW - CSE - P501
Context-Sensitive Grammar? • All compiler work uses Context-Free Grammars, or CFGs • Why so-called? Alternatively: • What is a non-context-free grammar? (ie, a Context-Sensitive Grammar) • Suppose production B • CFG: we can replace B by , no matter what • eg: B => • CSG: we can replace B by only in certain contexts. Ie, only when B is preceded and/or followed by certain strings • eg: c B d Jim Hogg - UW - CSE - P501
Example CSG The following CSG generates the language an bncn for n >= 1 • S a b c • | a S B c • c B W B • W B W X • W X B X • B X B c • b B b b Note: CSGs will not be discussed further, nor examined, as part of P501 Jim Hogg - UW - CSE - P501
Parsing • The syntax of most programming languages can be specified by a Context-Free Grammar or CFG • Parsing = "How to fill the gap between Start Symbol and Sentence" • L(G) = the language generated by G = the set of sentences generated by G • Parsing: Given G and a sentencewin L(G ), construct the derivation, (parse tree) for w in some order • As we parse, do something useful at each node in the tree Jim Hogg - UW - CSE - P501
"in some order" • Top-down • Start with the root • Traverse the parse tree depth-first; scan tokens, Left-to-right; create a Leftmost derivation • LL(k) • Bottom-up • Start at leaves and build up to the root; scan tokens, Left-to-right; create a Rightmost derivation (in reverse) • LR(k) and subsets: LALR(k) and SLR(k) Jim Hogg - UW - CSE - P501
"do something useful at each node" • Perform some semantic action: • Construct nodes of full parse tree (rare) • Construct abstract syntax tree (common) • Construct linear, lower-level representation • like assembler code • Generate target code on the fly • 1-pass compiler • not common in production compilers: poor code quality Jim Hogg - UW - CSE - P501
Context-Free Grammars – Formal Description • A grammar G is a tuple <N, T, P, S> where • N a finite set of Non-terminal symbols • T a finite set of Terminal symbols • P a finite set of Productions • A subset of N × (N T) * • S the start symbol, a distinguished element of N • If not specified otherwise, this is taken as the Non-Terminal on the LHS of the first production Jim Hogg - UW - CSE - P501
Standard Notations • a, b, c element of T • A, B, C element of N • w, x, y, z elements of T* • X, Y, Z element of N T • , , elements of (N T)* • (A, ) P => A Jim Hogg - UW - CSE - P501
Derivation Relations (1) • if B P then B => • simply affirms G is context-free • A =>* • denotes there is a chain of zero-or-more productions, starting with A, that generates • transitive closure Jim Hogg - UW - CSE - P501
Derivation Relations (2) • if B P then w B =>lm w • derives leftmost • prefix of A is all terminals (by construction) • if B P then B w =>rm w • derives rightmost • prefix of A may include terminals and non-terminals • We will only be interested in leftmost and rightmost derivations – not random orderings Jim Hogg - UW - CSE - P501
Languages • All the sentences (strings of Terminals) I can generate from NonTerminal A: • For A N, L(A) = { w | A =>* w } • All the sentences (strings of Terminal) I can generate from start symbol S: • If S is the start symbol of grammar G, define L(G ) = L(S) Jim Hogg - UW - CSE - P501
Reduced Grammars • Grammar G is reduced iff for every productionA in P there is some derivation S =>* x A z => x z =>* xyz • ie, no production is useless • Convention: we will use only reduced grammars Jim Hogg - UW - CSE - P501
Ambiguous Grammars • Grammar G is unambiguous iff every w in L(G ) has a unique leftmost (or rightmost) derivation • Fact: unique leftmost or unique rightmost implies the other • A grammar lacking this property is ambiguous • Note: other grammars that generate the same language may be unambiguous • So, "ambiguous" applies to a grammar – not a language • We need unambiguous grammars for parsing (well mostly: see later) Jim Hogg - UW - CSE - P501
Example: Ambiguous Grammar ExpExp Op Exp | Dig Op + | - | * | / Dig 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Exercise: show that this is ambiguous • How? Show two different leftmost or rightmost derivations for the same string • Equivalently: show two different parse trees for the same string Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] 2+3*4 – part 1 Give a leftmost derivation of 2+3*4; show the parse tree Exp Exp Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] 2+3*4 – part 2 Exp Exp => Exp+ Exp Exp Exp + Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] 2+3*4 – part 3 Exp Exp Exp => Exp+ Exp => Dig + Exp Exp Exp Dig + Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] 2+3*4 – part 4 Exp Exp Exp => Exp+Exp => Dig + Exp => 2 + Exp Exp Exp Dig + 2 Jim Hogg - UW - CSE - P501
Exp ::= Exp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig::= [0-9] 2+3*4 – part 5 Exp Exp Exp => Exp+ Exp => Dig + Exp => 2 +Exp => 2 + Exp * Exp Exp Exp Exp Exp Dig 2 + * Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] 2+3*4 – part 6 Exp Exp Exp => Exp+ Exp => Dig + Exp => 2 +Exp => 2 + Exp * Exp => 2 + Dig * Exp Exp Exp Exp Exp Dig Dig 2 * + Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] 2+3*4 – part 7 Exp Exp Exp => Exp+ Exp => Dig + Exp => 2 +Exp => 2 + Exp* Exp => 2 + Dig * Exp => 2 + 3 * Exp Exp Exp Exp Exp Dig Dig 2 + 3 * Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] 2+3*4 – part 8 Exp Exp Exp => Exp+ Exp => Dig + Exp => 2 +Exp => 2 + Exp* Exp => 2 + Dig * Exp => 2 + 3 * Exp => 2 + 3 * Dig Exp Exp Exp Exp Dig Dig Dig 2 3 * + Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] 2+3*4 – part 9 Exp Exp Exp => ExpOp Exp => Dig Op Exp => 2 OpExp => 2 + Exp => 2 + ExpOp Exp => 2 + Dig Op Exp => 2 + 3 OpExp => 2 + 3 * Exp => 2 + 3 * Dig => 2 + 3 * 4 Exp Exp Exp Op Exp Dig Dig Dig 2 3 * 4 + Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] 2+3*4 – part 10 Give a different leftmost derivation of 2+3*4 Exp Exp => Exp * Exp => Exp + Exp * Exp => 2 + Exp * Exp => 2 + 3 *Exp => 2 + 3 * 4 Exp Exp Exp Exp Dig Dig Dig 4 2 + 3 * Jim Hogg - UW - CSE - P501
Are derivations equivalent? * + 4 * + 2 3 4 2 3 Result = 2 + (3 * 4) = 14 Result = (2 + 3) * 4 = 20 Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] Another example • Give two different derivations of 5 – 6 – 7 Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] Another example - result = 6 Give two different rightmost derivations of 5 – 6 – 7 Exp => Exp - Exp => Exp - Exp - Exp => Exp - Exp- 7 => Exp- 6 - 7 => 5 - 6 - 7 5 - 6 7 Exp => Exp- Exp => Exp- 7 => Exp - Exp- 7 => Exp- 6 - 7 => 5 - 6 - 7 result = -8 - 7 - 6 5 Jim Hogg - UW - CSE - P501
ExpExp + Exp | Exp – Exp | Exp * Exp | Exp / Exp | Dig Dig [0-9] Another example - result = 6 Give two different leftmost derivations of 5 – 6 – 7 Exp => Exp- Exp => 5 - Exp => 5 - Exp- Exp => 5 - 6 - Exp => 5 - 6 - 7 5 - 6 7 result = -8 - Exp => Exp- Exp => Exp- Exp - Exp => 5 - Exp - Exp => 5 - 6 - Exp => 5 - 6 - 7 7 - 6 5 Jim Hogg - UW - CSE - P501
What went wrong? • Grammar did not capture precedence or associativity • Eg: 2 + (3 * 4) = 14 versus (2 + 3) * 4 = 20 • Eg: 5 - (6 - 7) = 6 versus (5 - 6) - 7 = -8 • Solution • Create a non-terminal for each level of precedence • Isolate the corresponding part of the grammar • Force the parser to recognize higher precedence sub-expressions first Jim Hogg - UW - CSE - P501
Classic Expression Grammar expexp + term | exp – term | term term term * factor | term / factor | factor factor int | ( exp ) int 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 E E + T | E – T | T T T * F | T / F | F F ( E ) | D D [0-9] Jim Hogg - UW - CSE - P501
E E + T | E – T | T T T * F | T / F | F F ( E ) | D D [0-9] Derive 2 + 3 * 4 E => E + T => E + T * F => E + T * D => E + T * 4 => E + F * 4 => E + D * 4 => E + 3 * 4 => T + 3 * 4 => F + 3 * 4 => D + 3 * 4 => 2 + 3 * 4 + * 2 4 3 Result = 2 + (3 * 4) = 14 This grammar yields the correct, expected (school algebra) result Jim Hogg - UW - CSE - P501
E E + T | E – T | T T T * F | T / F | F F ( E ) | D D [0-9] Derive 5 - 6 - 7 E => E - T => E - F => E - D => E - 7 => E - T - 7 => E - F - 7 => E - D - 7 => E - 6 - 7 => F - 6 - 7 => D - 6 - 7 => 5 - 6 - 7 result = -8 - 7 - 6 5 • This grammar yields the correct, expected (school algebra) result • Note how left-recursive rules yield left-associativity Jim Hogg - UW - CSE - P501
Classic Example of Ambiguous Grammar • Grammar for conditional statements stm if ( cond ) stm | if ( cond ) stm else stm • Exercise: show that this is ambiguous • How? “The Dangling Else” - a 'weakness' in C, Pascal, etc Jim Hogg - UW - CSE - P501
Two Derivations stmif ( cond ) stm | if ( cond ) stmelse stm stm if cond ) stm ( if (cond) if (cond) stm else stm if stm else ) cond ( stm stm if stm else ) cond ( stm if cond ) stm ( Jim Hogg - UW - CSE - P501
Solving the Dangling Else • Fix the grammar to separate if statements with else clause from those without • Done in Java reference grammar • Adds lots of non-terminals • Use some ad-hoc rule in parser • “else matches closest unpaired if” • Change the language • Only possible if you 'own' the language Jim Hogg - UW - CSE - P501
Resolving Ambiguity with Grammar (1) StmIfElse | IfNoElse IfElse if ( Exp ) IfElse else IfElse IfNoElse if ( Exp ) Stm | if ( Exp ) IfElse else IfNoElse • formal, no additional rules beyond syntax • sometimes obscures original grammar Jim Hogg - UW - CSE - P501
Resolving Ambiguity with Grammar (2) • If you can (re-)design the language, avoid the problem entirely IfStm if Exp then Stm end | if Exp then Stm else Stm end • formal, clear, elegant • allows sequence of Stms in then and else branches, no { } needed • extra end required for every if Jim Hogg - UW - CSE - P501
Parser Tools and Operators • Most parser tools cope with ambiguous grammars • Earlier productions chosen before later ones • Longest match used if there is a choice • Makes life simpler if used with discipline • But be sure the tool does what you really want • Specify operator precedence & associativity • Allows simpler, ambiguous grammar with fewer non-terminals • Used in CUP Jim Hogg - UW - CSE - P501
Next • Next • LR (bottom-up / shift-reduce) parsing • Reading • Continue Cooper&Torczon chapter 3 • Note • Note: LR parsing is the toughest session in P501 Jim Hogg - UW - CSE - P501