960 likes | 1.17k Views
CSC 415: Translators and Compilers Spring 2009. Chapter 4 Syntactic Analysis. Syntactic Analysis. Sub-phases of Syntactic Analysis Grammars Revisited Parsing Abstract Syntax Trees Scanning Case Study: Syntactic Analysis in the Triangle Compiler. Structure of a Compiler.
E N D
CSC 415: Translators and CompilersSpring 2009 Chapter 4 Syntactic Analysis
Syntactic Analysis • Sub-phases of Syntactic Analysis • Grammars Revisited • Parsing • Abstract Syntax Trees • Scanning • Case Study: Syntactic Analysis in the Triangle Compiler
Structure of a Compiler Lexical Analyzer Source code Symbol Table tokens Parser & Semantic Analyzer parse tree Intermediate Code Generation intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code
Syntactic Analysis • Main function • Parse source program to discover its phrase structure • Recursive-descent parsing • Constructing an AST • Scanning to group characters into tokens
Sub-phases of Syntactic Analysis • Scanning (or lexical analysis) • Source program transformed to a stream of tokens • Identifiers • Literals • Operators • Keywords • Punctuation • Comments and blank spaces discarded • Parsing • To determine the source programs phrase structure • Source program is input as a stream of tokens (from the Scanner) • Treats each token as a terminal symbol • Representation of phrase structure • AST
Lexical Analysis – A Simple Example let var y: Integer in !new year y := y+1 Note: !new year does not appear in list of tokens. Comments are removed along with white spaces. • Scan the file character by character and group characters into words and punctuation (tokens), remove white space and comments • Tokens for this example: let var y : Integer in y := y + 1
let var y: Integer in !new year y := y+1 Buffer (S = space) Creating Tokens – Mini-Triangle Example Input Converter character string . . . . S S S S l e t v a r y : I n t e g e r i n Scanner Ident. Ident. becomes Ident. op. Intlit. eot Ident. colon in let var := 1 : y Integer y y + in let var
Tokens in Triangle // punctuation... DOT = 21, ".", COLON = 22, ":", SEMICOLON = 23, ";", COMMA = 24, ",", BECOMES = 25, "~", IS = 26, // brackets... LPAREN = 27, "(", RPAREN = 28, ")", LBRACKET = 29, [", RBRACKET = 30, "]", LCURLY = 31, "{", RCURLY = 32, "}", // special tokens... EOT = 33, "", ERROR = 34; "<error>" // literals, identifiers, operators... INTLITERAL = 0, "<int>", CHARLITERAL = 1, "<char>", IDENTIFIER = 2, "<identifier>", OPERATOR = 3, "<operator>", // reserved words - must be in alphabetical order... ARRAY = 4, "array", BEGIN = 5, "begin", CONST = 6, "const", DO = 7, "do", ELSE = 8, "else", END = 9, "end", FUNC = 10, "func", IF = 11, "if", IN = 12, "in", LET = 13, "let", OF = 14, "of", PROC = 15, "proc", RECORD = 16, "record", THEN = 17, "then", TYPE = 18, "type", VAR = 19, "var", WHILE = 20, "while",
Grammars Revisited • Context free grammars • Generates a set of sentences • Each sentence is a string of terminal symbols • An unambiguous sentence has a unique phrase structure embodied in its syntax tree • Develop parsers from context-free grammars
Regular Expressions • A regular expression (RE) is a convenient notation for expressing a set of stings of terminal symbols • Main features • ‘|’ separates alternatives • ‘*’ indicates that the previous item may be represented zero or more times • ‘(‘ and ‘)’ are grouping parentheses • e The empty string -- a special string of length 0
Regular Expression Basics • Algebraic Properties • | is commutative and associative • r|s = s|r • r|(s|t) = (r|s)|t • Concatenation is associative • (rs)t = r(st) • Concatenation distributes over | • r(s|t) = rs|rt • (s|t)r = sr|tr • e is the identity for concatenation • e r = r • r e = r • * is idempotent • r** = r* • r* = (r| e)*
Regular Expression Basics • Common Extensions • r+ one or more of expression r, same as rr* • rk k repetitions of r • r3 = rrr • ~r the characters not in the expression r • ~[\t\n] • r-z range of characters • [0-9a-z] • r? Zero or one copy of expression (used for fields of an expression that are optional)
Regular Expression Example • Regular Expression for Representing Months • Examples of legal inputs • January represented as 1 or 01 • October represented as 10 • First Try: [0|1|e][0-9]0, 1, or e followed by a number between 0 and 9 • Matches all legal inputs? Yes 1, 2, 3, …, 10, 11, 12, 01, 02, …, 09 • Matches any illegal inputs? Yes 0, 00, 18
Regular Expression Example • Regular Expression for Representing Months • Examples of legal inputs • January represented as 1 or 01 • October represented as 10 • Second Try: [1-9]|(0[1-9])|(1[0-2]) • Any number between 1 and 9 or 0 followed by any number between 1 and 9 or 1 followed by any number between 0 and 2 • Matches all legal inputs? Yes 1, 2, 3, …, 10, 11, 12, 01, 02, …, 09 • Matches any illegal inputs? No
Regular Expression Example • Regular Expression for Floating Point Numbers • Examples of legal inputs • 1.0, 0.2, 3.14159, -1.0, 2.7e8, 1.0E-6, -2.5e+5 • Assume that a 0 is required before numbers less than 1 and does not prevent extra leading zeros, so numbers such as 0011 or 0003.14159 are legal • Building the regular expression • Assume digit 0|1|2|3|4|5|6|7|8|9 • Handle simple decimals such as 1.0, 0.2, 3.14159 digit+.digit+ 1 or more digits followed by . followed by 1 or more decimals • Add an optional sign (only minus, no plus) (-| e)digit+.digit+ or -?digit+.digit+
Regular Expression Example • Regular Expression for Floating Point Numbers (cont.) • Building the regular expression (cont.) • Format for the exponent (E|e)(+|-)?(digit+) • Adding it as an optional expression to the decimal part (-| e)digit+.digit+((E|e)(+|-)?(digit+))?
Extended BNF • Extended BNF (EBNF) • Combination of BNF and RE • N::=X, where N is a nonterminal symbol and X is an extended RE, i.e., an RE constructed from both terminal and nonterminal symbols • EBNF • Right hand side may use |. *, (, ) • Right hand side may contain both terminal and nonterminal symbols
Example EBNF Expression ::= primary-Expression (Operator primary-Expression)* primary-Expression ::= Identifier | ( Expression ) Identifier ::= a|b|c|d|e Operator ::= +|-|*|/ Generates e a + b a – b – c a + (b * c) a + (b + c) / d a – (b – (c – (d – e)))
Grammar Transformations • Left Factorization XY | XZ is equivalent to X(Y | Z) single-Command ::= V-name := Expression | if Expression then single-Command | if Expression then single-Command else single-Command single-Command ::= V-name := Expression | if Expression then single-Command (e |else single-Command)
Grammar Transformations • Elimination of left recursion N::= X | NY is equivalent to N::=X(Y)* Identifier ::= Letter | Identifier Letter | Identifier Digit Identifier ::= Letter | Identifier (Letter | Digit) Identifier ::= Letter(Letter | Digit)*
Grammar Transformations • Substitution of nonterminal symbols Given N::=X, we can substitute each occurrence of N with X iff N::=X is nonrecursive and is the only production rule for N single-Command ::= for Control-Variable := Expression To-or-Downto Expression do single-Command | … Control-Variable ::= Identifier To-or-Downto ::= to | down single-Command ::= for Identifier := Expression (to|downto) Expression do single-Command | …
Starter Sets • Starter set of an RE X • Starters[[X]] • Set of terminal symbols that can start a string generated by X • Examples • Starter[[his | her | its]] = {h, i} • Starter[[(re)* set]] = {r, s}
Starter Sets • Precise and complete definition of starters: starters[[e]] = {} starters[[t]] = {t} where t is a terminal symbol starters[[X Y]] = starters[[X]] starters[[Y]] if X generates e starters[[X Y]] = starters[[X]] if X does not generate e starters[[X | Y]] = starters[[X]] starters[[Y]] starters[[X *]] = starters[[X]] • To generalize fo ra starter set of an extended RE add • starters[[N]] = starters[[X]] where N is a nonterminal symbol defined production rule N ::= X
Example Starter Set Expression ::= primary-Expression (Operator primary-Expression)* primary-Expression ::= Identifier | ( Expression ) Identifier ::= a|b|c|d|e Operator ::= +|-|*|/ starters[[Expression]] = starters[[primary-Expression (Operator primary-Expression)*]] = starters[[primany-Expression]] = starters[[Identifier]] starters[[ (Expressions ) ]] = starters[[a | b | c | d | e]] { ( } = {a, b, c, d, e, (}
Scanning (Lexical Analysis) • The purpose of scanning is to recognize tokens in the source program. Or, to group input characters (the source program text) into tokens. • Difference between parsing and scanning: • Parsing groups terminal symbols, which are tokens, into larger phrases such as expressions and commands and analyzes the tokens for correctness and structure • Scanning groups individual characters into tokens
Structure of a Compiler Lexical Analyzer Source code Symbol Table tokens Parser & Semantic Analyzer parse tree Intermediate Code Generation intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code
let var y: Integer in !new year y := y+1 Buffer (S = space) Creating Tokens – Mini-Triangle Example Input Converter character string . . . . S S S S l e t v a r y : I n t e g e r i n Scanner Ident. Ident. becomes Ident. op. Intlit. eot Ident. colon in let var := 1 y : Integer y y + in let var
What Does a Scanner Do? • Handle keywords (reserve words) • Recognizes identifiers and keywords • Match explicitly • Write regular expression for each keyword • Identifier is any alpha numeric string which is not a keyword • Match as an identifier, perform lookup • No special regular expressions for keywords • When an identifier is found, perform lookup into preloaded keyword table How does Triangle handle keywords? Discuss in terms of efficiency and ease to code.
What Does a Scanner Do? • Remove white space • Tabs, spaces, new lines • Remove comments • Single line -- Ada comment • Multi-line, start and end delimiters { Pascal comment } /* c comment */ • Nested • Runaway comments • Nonterminated comments can’t be detected till end of file
What Does a Scanner Do? • Perform look ahead • Multi-character tokens 1..10 vs. 1.10 &, && <, <= etc • Challenging input languages • FORTRAN • Keywords not reserved • Blanks are not a delimiter • Example (comma vs. decimal) DO10I=1,5 start of a do loop (equivalent to a C for loop) DO10I=1.5 an assignment statement, assignment to variable DO10I
What Does a Scanner Do? • Challenging input languages (cont.) • PL/I, keywords not reserved IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;
What Does a Scanner Do? • Error Handling • Error token passed to parser which reports the error • Recovery • Delete characters from current token which have been read so far, restart scanning at next unread character • Delete the first character of the current lexeme and resume scanning from next character. • Examples of lexical errors: • 3.25e bad format for a constant • Var#1 illegal character • Some errors that are not lexical errors • Mistyped keywords • Begim • Mismatched parenthesis • Undeclared variables
Scanner Implementation • Issues • Simpler design – parser doesn’t have to worry about white space, etc. • Improve compiler efficiency – allows the construction of a specialized and potentially more efficient processor • Compiler portability is enhanced – input alphabet peculiarities and other device-specific anomalies can be restricted to the scanner
Scanner Implementation • What are the keywords in Triangle? • How are keywords and identifiers implemented in Triangles? • Is look ahead implemented in Triangle? • If so, how?
Structure of a Compiler Lexical Analyzer Source code Symbol Table tokens Semantic Analyzer Parser parse tree Intermediate Code Generation intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code
Parsing • Given an unambiguous, context free grammar, parsing is • Recognition of an input string, i.e., deciding whether or not the input string is a sentence of the grammar • Parsing of an input string, i.e., recognition of the input string plus determination of its phrase structure. The phrase structure can be represented by a syntax tree, or otherwise. Unambiguous is necessary so that every sentence of the grammar will form exactly one syntax tree.
Parsing • The syntax of programming language constructs are described by context-free grammars. • Advantages of unambiguous, context-free grammars • A precise, yet easy-to understand, syntactic specification of the programming language • For certain classes of grammars we can automatically construct an efficient parser that determines if a source program is syntactically well formed. • Imparts a structure to a programming language that is useful for the translation of source programs into correct object code and for the detection of errors. • Easier to add new constructs to the language if the implementation is based on a grammatical description of the language
parser sequence of tokens syntax tree Parsing • Check the syntax (structure) of a program and create a tree representation of the program • Programming languages have non-regular constructs • Nesting • Recursion • Context-free grammars are used to express the syntax for programming languages
Context-Free Grammars • Comprised of • A set of tokens or terminal symbols • A set of non-terminal symbols • A set of rules or productions which express the legal relationships between symbols • A start or goal symbol • Example: • expr expr – digit • expr expr + digit • expr digit • digit 0|1|2|…|9 • Tokens: -,+,0,1,2,…,9 • Non-terminals: expr, digit • Start symbol: expr
Context-Free Grammars expr • expr expr – digit • expr expr + digit • expr digit • digit 0|1|2|…|9 expr - digit expr digit + 2 Example input: 3 + 8 - 2 digit 8 3
Checking for Correct Syntax • Given a grammar for a language and a program, how do you know if the syntax of the program is legal? • A legal program can be derived from the start symbol of the grammar Grammar must be unambiguous and context-free
expr expr – digit • expr expr + digit • expr digit • digit 0|1|2|…|9 Example input: 3 + 8 - 2 Deriving a String • The derivation begins with the start symbol • At each step of a derivation the right hand side of a grammar rule is used to replace a non-terminal symbol • Continue replacing non-terminals until only terminal symbols remain Rule 2 Rule 1 Rule 4 expr expr – digit expr – 2 expr + digit - 2 Rule 3 Rule 4 Rule 4 expr + 8-2 digit + 8-23+8 -2
Rule 1 expr expr – digit • expr expr – digit • expr expr + digit • expr digit • digit 0|1|2|…|9 Example input: 3 + 8 - 2 Rightmost Derivation • The rightmost non-terminal is replaced in each step Rule 4 expr – digit expr – 2 Rule 2 expr – 2 expr + digit - 2 Rule 4 expr + digit - 2 expr + 8-2 Rule 3 expr + 8-2 digit + 8-2 Rule 4 digit + 8-23+8 -2
Rule 1 expr expr – digit • expr expr – digit • expr expr + digit • expr digit • digit 0|1|2|…|9 Example input: 3 + 8 - 2 Leftmost Derivation • The leftmost non-terminal is replaced in each step Rule 2 expr – digit expr + digit – digit Rule 3 expr + digit – digit digit + digit – digit Rule 4 digit + digit – digit3 + digit – digit Rule 4 3 + digit – digit 3 + 8 – digit Rule 4 3 + 8 – digit 3 + 8 – 2
Rule 1 expr expr – digit Leftmost Derivation • The leftmost non-terminal is replaced in each step expr 1 1 Rule 2 expr – digit expr + digit – digit 6 2 2 expr - digit Rule 3 expr + digit – digit digit + digit – digit 3 3 5 expr digit + Rule 4 4 digit + digit – digit3 + digit – digit 2 Rule 4 3 + digit – digit 3 + 8 – digit 5 4 digit 8 Rule 4 3 + 8 – digit 3 + 8 – 2 6 3
Bottom-Up Parsing • Parser examines terminal symbols of the input string, in order from left to right • Reconstructs the syntax tree from the bottom (terminal nodes) up (toward the root node) • Bottom-up parsing reduces a string w to the start symbol of the grammar. • At each reduction step a particular sub-string matching the right side of a production is replaced by the symbol on the left of that production, and if the sub-string is chosen correctly at each step, a rightmost derivation is traced out in reverse.
Bottom-Up Parsing • Types of bottom-up parsing algorithms • Shift-reduce parsing • At each reduction step a particular sub-string matching the right side of a production is replaced by the symbol on the left of that production, and if the sub-string is chosen correctly at each step, a rightmost derivation is traced out in reverse. • LR(k) parsing • L is for left-to-right scanning of the input, the R is for constructing a right-most derivation in reverse, and the k is for the number of input symbols of look-ahead that are used in making parsing decisions.
expr expr – digit • expr expr + digit • expr digit • digit 0|1|2|…|9 - 3 8 2 + digit Example input: 3 + 8 - 2 - 3 8 2 + digit digit digit digit - 3 8 2 + expr - 3 8 2 + Bottom-Up Parsing Example3+8-2
expr - 3 8 2 + expr digit digit digit digit digit digit digit digit - 3 8 2 + expr expr - 3 8 2 + Bottom-Up Parsing Example3+8-2
S aABe • A Abc | b • B d a b b c d e Example input: abbcde A a b b c d e Abbcde aAbcde A a b b c d e aAbcde Bottom-Up Parsing Exampleabbcde