260 likes | 664 Views
Kanat Bolazar January 28, 2010. Compiler Design 4. Language Grammars. Introduction to Parsing: Language Grammars. Programming language grammars are usually written as some variation of Context Free Grammars ( CFG )s Notation used is often BNF ( Backus-Naur form ):
E N D
Kanat Bolazar January 28, 2010 Compiler Design4. Language Grammars
Introduction to Parsing: Language Grammars • Programming language grammars are usually written as • some variation of Context Free Grammars (CFG)s • Notation used is often BNF (Backus-Naur form): • <block> -> { <statementlist> } • <statementlist> -> <statement> ; <statementlist> • <statement> -> <assignment> ; • | if ( <expr> ) <block> else <block> • | while ( <expr> ) <block> • ...
Example Grammar: Language 0+0 • A language that we'll call "Language 0+0": • E -> E + E | 0 • Equivalently: • E -> E + E • E -> 0 • Note that if there are multiple rules for the same left hand side, they are alternatives. • This language only contains sentences of the form: • 0 0+0 0+0+0 0+0+0+0 ... • Derivation for 0+0+0: • E -> E + E -> E + E + E -> 0 + 0 + 0 • Note: This language is ambiguous: In the second step, did we expand the first or the second E to E + E? Both paths work.
Example Grammar: Arithmetic, Ambiguous • Arithmetic expressions: • Exp -> num | Exp Operator Exp • Op -> + | - | * | / | % • The "num" here represents a token. What it corresponds to is defined in the lexical analyzer with a regular expression: • num [0-9]+ • This langugage allows: • 45 35 + 257 * 5 - 2 ... • This language as defined here is ambiguous: • 2 + 5 * 7 Exp * 7 or 2 + Exp ? • Depending on the tools you use, you may be able to just define precedence of operators, or may have to change the grammar.
Example Language: Arithmetic, Factored • Arithmetic expressions grammar, factored for operator precedence: • Exp -> Factor | Factor Addop Exp • Factor -> num | num Multop Factor • Addop -> + | - • Multop -> * | / | % • This langugage also allows the same sentences: • 45 35 + 257 * 5 - 2 ... • This language is not ambiguous; it first groups factors: • 2 + 5 * 7 • Factor Addop Exp • num + Exp • num + Factor • num + num Multop Factor • num + num * num
Grammar Definitions • The grammar is a set of rules, sometimes called productions, that construct valid sentences in the language. • Nonterminal symbols represent constructs in the language. These would be the phrases in a natural language. • Terminal symbols are the actual words of the language. These are the tokens produced by the lexical analyzer. In a natural language, these would be the words, symbols, and space. • A sentence in the language only contains terminal symbols. • Nonterminals are intermediate linguistic constructs to define the structure of a sentence.
Rules, Nonterminal and Terminal Symbols • Arithmetic expressions grammar, using multiplicative factors for operator precedence: • Exp -> Factor | Factor Addop Exp • Factor -> num | num Multop Factor • Addop -> +|- • Multop -> * | / |% • This langugage has four rules as written here. If we expand each option, we would have 2 + 2 + 2 + 3 = 9 rules. • There are four nonterminals: • Exp Factor Addop Multop • There are six terminals (tokens): • num + - * / %
Grammar Definitions: Rules • The production rules are rewrite rules. The basic CFG rule form is: • X -> Y1 Y2 Y3 … Yn • where X is a nonterminal and the Y’s may be nonterminals or terminals. • There is a special nonterminal called the Start symbol. • The language is defined to be all the strings that can be generated by starting with the start symbol, repeatedly replacing nonterminals by the rhs of one of its rules until there are no more nonterminals.
Larger Grammar Examples • We'll look at language grammar examples for MicroJava and Decaf. • Note: Decaf extends the standard notation; the very useful { X }, to mean X | X, X | X, X, X | ... is not standard.
Parse Trees • Derivation of a sentence by the language rules can be used to construct a parse tree. • We expect parse trees to correspond to meaningful semantic phrases of the programming language. • Each node of the parse tree will represent some portion that can be implemented as one section of code. • The nonterminals expanded during the derivation are trunk/branches in the parse tree. • The terminals at the end of branches are the leaves of the parse tree.
Parsing • A parser: • Uses the grammar to check whether a sentence (a program for us) is in the language or not. • Gives syntax error If this is not a proper sentence/program. • Constructs a parse tree from the derivation of the correct program from the grammar rules. • Top-down parsing: • Starts with the start symbol and applies rules until it gets the desired input program. • Bottom-up parsing: • Starts with the input program and applies rules in reverse until it can get back to the start symbol. • Looks at left part of input program to see if it matches the rhs of a rule.
Parsing Issues • Derivation Paths = Choices • Naïve top-down and bottom-up parsing may require backtracking to find a correct parse. • Restrictions on the form of grammar rules to make parsing deterministic. • Ambiguity • One program may have two different correct derivations from the grammar. • This may be a problem if it implies two different semantic interpretations. • Famous examples are arithmetic operators and the dangling else problem.
Ambiguity: Dangling Else Problem • Which if does this else associate with? • if X • if Y • find() • else • getConfused() • The corresponding ambiguous grammar may be: • IfSttmt -> if Cond Action • | if Cond Action else Action • Two derivations at top (associated with top "if") are: • if Cond Action if Cond Action else Action • Programming languages often associate else with the inner if.
Resources • Aho, Lam, Sethi, and Ullman, Compilers: Principles, Techniques, and Tools, 2nd ed. Addison-Wesley, 2006. • Compiler Construction Course Notes at Linz: • http://www.ssw.uni-linz.ac.at/Misc/CC/ • CS 143 Compiler Course at Stanford: • http://www.stanford.edu/class/cs143/