460 likes | 483 Views
Topic #2: Infix to Postfix. EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003. Why this topic?. Program to convert infix to postfix A simple, one-pass compiler! Infix notation is source language Postfix is intermediate code No code generation (but could be actions)
E N D
Topic #2: Infix to Postfix EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003
Why this topic? • Program to convert infix to postfix • A simple, one-pass compiler! • Infix notation is source language • Postfix is intermediate code • No code generation (but could be actions) • A good example that introduces many aspects of writing a compiler
Syntax vs. Semantics • Syntax • Describes what is allowable in a language • Relatively easy to describe (we will use context free grammars) • Semantics • Describes the meaning of a program (or expression) • Very difficult to describe in a formal sense • We are discussing programming languages, but these definitions also apply to natural languages
Structure of Simple Compiler • A lexical analyzer will tokenize the input • A “syntax-directed translator” combines syntax analysis and intermediate code generation
Context-free Grammars • Often used to describe the hierarchical structure of a programming language • Every CFG has four components: • A set of tokens (terminal symbols) • A set of nonterminals • A set of productions (rules) • A start symbol (left side of first rule) Example rule: stmt→if (expr) stmtelsestmt
Productions • Every production consists of left side, an arrow (“can have the form of”), right side • Left side is a always nonterminal; the production is “for” this nonterminal • Right side consists of terminals and/or non-terminals • Convention, nonterminals will be italicized Example rule: stmt→if (expr) stmtelsestmt
Example CFG • List of digits separated by plus or minus signs: • For convenience, nonterminals can be grouped using ‘|’ (“or”) • The tokens of this grammar: + - 0 1 2 3 4 5 6 7 8 9 • The start symbol: list list→ list + digit list → list – digit list→digit digit→ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 list→ list + digit | list – digit | digit digit→ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
CFG for Expressions expr→ expr + term | expr – term | term term → term * factor | term / factor | factor factor→digit | (expr) digit→ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Strings • A string of tokens is a sequence of zero or more tokens • The string containing zero tokens is the empty string: ‘ε’ • A grammar “derives” strings • starting with start symbol • repeatedly replacing nonterminals with right side of productions for the nonterminals • The set of strings that can be derived from the start symbol is the “language” for the grammar
Parse Trees • Shows how the start symbol of a grammar can derive a string in the language • A tree with the following properties: • The root is labeled with a start symbol • Each leaf labeled with a token or ε • Each interior node is labeled by a nonterminal • If A is the label for an interior node, and X1,X2,…,Xn (nonterminals or tokens) are the labels of its children, then the following production must exist: A→X1 X2 … Xn
More on Parse Trees • A language can be defined as all strings that can be generated by some parse tree • The process of finding a parse tree for a given string is called “parsing” the string
Ambiguous Grammars • If any string has more than one parse tree, grammar is said to be ambiguous • Need to avoid for compilation, since string can have more than one meaning • List of digits separated by plus or minus signs: • Example merges notion of digit and list into single nonterminal string • Same strings are derivable, but some strings have multiple parse trees (possible meanings) string→ string + string | string – string |0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Precedence and Associativity • Precedence • Determines the order in which different operators are evaluated when they occur in the same expression • Operators of higher precedence are applied before operators of lower precedence • Associativity • Determines the order in which operators of equal precedence are evaluated when they occur in the same expression • Most operators have a left-to-right associativity, but some have right-to-left associativity
Postfix Notation • Formal rules, infix → postfix • If E is variable or constant, E → E • If E is expression of form E1 op E2, where op is binary operator, E1 → E1’, and E2 → E2’, then E → E1’ E2’ op • If E is expression of form (E1) and E1 → E1’, then E → E1’ • Parentheses are not needed!
Syntax-Directed Definitions • CFG to specify syntactic structure of input • Each grammar symbol has associated attributes • Each production has associated semantic rules for computing values of attributes • Grammar and semantic rules together constitute the syntax-directed definition • A parse tree showing the attribute values at each node is called an annotated parse tree
Two Types of Attributes • Synthesized Attributes • Value at a parse-tree node can be determined based on values of attributes of children • Can be evaluated during a single bottom-up traversal of parse tree • Inherited Attributes • More complicated • Will be discussed later in course
Depth-First Search • Recursively visit all children before evaluating semantic rules of given node • Suitable if all attributes are synthesized procedure visit (n:node) begin for each child m of n, from left to right do visit(m) evaluate semantic rules at node n end
Translation Schemes • Adds to a CFG • Includes “semantic actions” embedded within productions • Similar to syntax-directed definition, but order of evaluation is explicit
Example Translation Scheme expr expr + term { print(‘+’) } expr expr – term { print(‘-’) } expr term term 0 { print(‘0’) } term 1 { print(‘1’) } … term 9 { print(‘9’) }
Equivalent Translation Scheme expr term rest rest + term { print(‘+’) } rest rest - term { print(‘-’) } rest rest ε term 0 { print(‘0’) } term 1 { print(‘1’) } … term 9 { print(‘9’) }
Parsing • Parsing is the process of determining if a string of tokens can be generated by a grammar • For any CFG, there is a parser that takes at most O(n^3) time • For most programming languages that arise in practice, linear time parsers exist
Top-down Parsing • Recursively apply the following steps: • At node n with nonterminal A, select a production for A • Construct children at n for symbols on right side of selected production • Find next node for which subtree needs to be constructed • Top-down parsing uses a “lookahead” symbol • Selecting production may involve trial-and-error and backtracking
Predictive Parsing • Recursive-descent parsing is a recursive, top-down approach to parsing • A procedure is associated with each nonterminal of the grammar • Predictive parsing • Special case of recursive-descent parsing • The lookahead symbol unambiguously determines the procedure for each nonterminal
Procedures for Nonterminals • Production with right side α used if lookahead is in FIRST(α) • FIRST(α) is set of all symbols that can be first symbol of α • If lookahead symbol is not in FIRST set for any production, can use production with right side of ε • If two or more possibilities, can not use this method • If no possibilities, an error is declared • Nonterminals on right side of selected production are recursively expanded
Left Recursion • Left-recursive productions can cause recursive-descent parsers to loop forever • Example: example example + term • Can eliminate left recursion A β R R α R | ε A A α | β
Eliminating Left Recursion expr expr + term { print(‘+’) } expr expr – term { print(‘-’) } expr term term 0 { print(‘0’) } term 1 { print(‘1’) } … term 9 { print(‘9’) } expr term rest rest + term { print(‘+’) } rest rest - term { print(‘-’) } rest rest ε term 0 { print(‘0’) } term 1 { print(‘1’) } … term 9 { print(‘9’) }
Infix to Prefix Code: Part 1 #include <stdio.h> #include <ctype.h> int lookahead; void expr(void); void rest(void); void term(void); void match(int); void error(void); int main(void) { lookahead = getchar(); expr(); putchar('\n'); /* adds trailing newline character */ } …
Infix to Prefix Code: Part 2 … void expr(void) { term(); rest(); } void term(void) { if (isdigit(lookahead)) { putchar(lookahead); match(lookahead); } else error(); } …
Infix to Prefix Code: Part 3 … void rest(void) { if (lookahead == '+') { match('+'); term(); putchar('+'); rest(); } else if (lookahead == '-') { match('-'); term(); putchar('-'); rest(); } } …
Infix to Prefix Code: Part 4 … void match(int t) { if (lookahead == t) lookahead = getchar(); else error(); } void error(void) { printf("syntax error\n"); /* print error message */ exit(1); /* then halt */ }
Code Optimization 1 void rest(void) { REST: if (lookahead == '+') { match('+'); term(); putchar('+'); goto REST; } else if (lookahead == '-') { match('-'); term(); putchar('-'); goto REST; } }
Code Optimization 2 void expr(void) { term(); while (1) { if (lookahead == '+') { match('+'); term(); putchar('+'); } else if (lookahead == '-') { match('-'); term(); putchar('-'); } else break; } }
Improvements Remaining • Want to ignore whitespace • Allow numbers • Allow identifiers • Allow additional operators (multiplications and division) • Allow multiple expressions (separated by semicolons)
Lexical Analyzer • Eliminates whitespace (and comments) • Reads numbers (not just single digits) • Reads identifiers and keywords
Allowable Tokens • expected tokens: +, -, *, /, DIV, MOD, (, ), ID, NUM, DONE • ID represents an identifier, NUM represents a number, DONE represents EOF
A Simple Symbol Table • Each record of symbol table contains a token type and a string (lexeme or keyword) • Symbol table has fixed size • All lexemes in array of fixed size • Will be able to insert and search for tokens: • insert(s, t): creates entry with string s and token t, returns index into symbol table • lookup(s): searches for entry with string s, returns index if found, 0 otherwise • Keywords (div and mod) will be inserted into symbol table, they can not be used as identifiers
Updated Translation Scheme start listeof list expr; list | ε expr expr + term { print(‘+’) } | expr – term { print(‘-’) } | term term term * factor { print(‘*’) } | term / factor { print(‘/’) } | termdivfactor { print(‘DIV’) } | termmodfactor { print(‘MOD’) } | factor factor (expr) | id { print(id.lexeme) } | num { print(num.value) }
After Eliminating Left Recursion start listeof list expr; list | ε expr term moreterms moreterms + term { print(‘+’) } moreterms | - term { print(‘-’) } moreterms | ε term factor morefactors morefactors * factor { print(‘*’) } morefactors | / factor { print(‘/’) } morefactors | divfactor { print(‘DIV’) } morefactors | modfactor { print(‘MOD’) } morefactors | ε factor (expr) | id { print(id.lexeme) } | num { print(num.value) }
Final Code • About 250 lines of C • Pretty sloppy, otherwise would be longer