260 likes | 388 Views
Course Overview. PART I: overview material 1 Introduction 2 Language processors (tombstone diagrams, bootstrapping) 3 Architecture of a compiler PART II: inside a compiler 4 Syntax analysis 5 Contextual analysis 6 Runtime organization 7 Code generation PART III: conclusion
E N D
Course Overview PART I: overview material 1 Introduction 2 Language processors (tombstone diagrams, bootstrapping) 3 Architecture of a compiler PART II: inside a compiler 4 Syntax analysis 5 Contextual analysis 6 Runtime organization 7 Code generation PART III: conclusion • Interpretation • Review
The “Phases” of a Compiler Source Program This chapter Syntax Analysis Error Reports Abstract Syntax Tree Contextual Analysis Error Reports Decorated Abstract Syntax Tree Code Generation Object Code
In Chapter 4 • Syntax Analysis • Scanning: recognize “words” or “tokens” in the input • Parsing: recognize structure of program • Different parsing strategies • How to construct a recursive descent parser • AST Construction • Use of theoretical “Tools”: • Regular Expressions and Finite–State Machines • Grammars • Extended BNF notation • First sets and Follow sets
Syntax Analysis • The “job” of syntax analysis is to read the source program (text file) and determine its structure. • Subphases • Scanning • Parsing • Construct an internal representation of the source text that shows the structure (usually an AST) Note: A single-pass compiler usually does not explicitly construct an AST.
input input input output output output Source Text AST Decorated AST Object Code Multi Pass Compiler A multi pass compiler makes several passes over the program. The output of a preceding phase is stored in a data structure and used by subsequent phases. Dependency diagram of a typical Multi Pass Compiler: Compiler Driver calls calls calls This chapter Syntactic Analyzer Contextual Analyzer Code Generator
Syntax Analysis Dataflow chart Source Program (Stream of Characters) Scanner Error Reports Stream of “Tokens” Parser Error Reports Abstract Syntax Tree
in eot ident. colon ident. becomes op. intlit ident. y 1 : Integer y + in := (1) Scan: Divide Input into Tokens An example Mini–Triangle source program: let var y: Integerin !new year y := y+1 Tokens are “words” in the input, for example keywords, operators, identifiers, literals, etc. scanner let var ident. ... let var y ...
op eot let var intlit col. id. in id. bec. id. id. := Int y y y 1 in + let var : (2) Parse: Determine structure of program Parser analyzes the structure of the token stream with respect to the grammar of the language. Program single-Command single-Command Expression Declaration single-Declaration primary-Exp primary-Exp V-Name Type Denoter V-Name Int.Lit Op. Ident Ident Ident Ident
(3) AST Construction Program LetCommand AssignCommand VarDecl BinaryExpr SimpleVar SimpleType VNameExp Int.Expr SimpleVar Ident Op Int.Lit Ident Ident Ident y y + 1 y Integer
Grammars RECAP: • The Syntax of a Language can be specified by means of a CFG (Context Free Grammar). • CFG can be expressed in BNF (Bachus-Naur Form) Example: Mini–Triangle grammar in BNF Program ::= single-Command Command ::= single-Command | Command ; single-Command single-Command ::= V-name :=Expression | beginCommandend | ...
Grammars (continued) For our convenience, we will use EBNF or “Extended BNF” rather than simple BNF. EBNF = BNF + regular expressions * means 0 or more occurrences of Example: Mini Triangle in EBNF Program ::= single-Command Command ::= (single-Command ;)* single-Command single-Command ::= V-name :=Expression | beginCommandend | ...
Regular Expressions • RE are a notation for expressing a set of strings of terminal symbols. • Different kinds of RE: • e The empty string • t Generates only the string t • X Y Generates any string xy such that x is generated by x • and y is generated by Y • X | Y Generates any string which generated either • by X or by Y • X* The concatenation of zero or more strings generated • by X • (X) Used for grouping
RE: Examples What sets of strings do each of the following RE generate? 1. e 2. M(r|s)“.” 3. (foo|bar)* 4. (foo|bar)(foo|bar)* 5. (0|1|2|3|4|5|6|7|8|9)* 6. 0|(1|..|9)(0|1|..|9)*
Regular Expressions • The “languages” that can be defined by RE and CFG have been extensively studied by theoretical computer scientists. These are some important conclusions / terminology • RE is a “weaker” formalism than CFG: Any language expressible by a RE can be expressed by CFG but not the other way around! • The languages expressible as RE are called regular languages • Generally: a language that exhibits “self–embedding” cannot be expressed by RE. • Programming languages exhibit self–embedding. (Examples: an expression can contain another expression, and a command can contain another command).
Extended BNF • Extended BNF combines BNF with RE • A production in EBNF looks like LHS ::= RHS where LHS is a non terminal symbol and RHS is an extended regular expression • An extended RE is just like a regular expression except it is composed of terminals and non–terminals of the grammar. • Simply put, EBNF adds to BNF these notations • (...) for the purpose of grouping and • * for denoting “0 or more repetitions of … ”
Extended BNF: an Example Example: a simple expression language Expression ::= PrimaryExp (Operator PrimaryExp)* PrimaryExpression ::= Literal | Identifier | ( Expression ) Identifier ::= Letter (Letter|Digit)* Literal ::= Digit Digit* Letter ::= a | b | c | ... |z Digit ::= 0 | 1 | 2 | 3 | 4 | ... | 9
A little bit of useful theory • We will now look at a few useful bits of theory. These will be necessary later when we implement parsers. • Grammar transformations • A grammar can be transformed in a number of ways without changing its meaning (i.e. its language, or the set of strings that it generates) • The definition and computation of starter sets (first sets), follow sets, and nullable symbols
Y= e X Z Grammar Transformations Left factorization X ( Y | Z ) XY|XZ Example: single-Command ::= V-name := Expression | ifExpression thensingle-Command | ifExpression thensingle-Command elsesingle-Command single-Command ::= V-name := Expression | ifExpression thensingle-Command (e |elsesingle-Command)
Grammar Transformations (continued) Elimination of Left Recursion N ::= XY* N ::= X|NY Example: Identifier ::= Letter | Identifier Letter | Identifier Digit Identifier ::= Letter | Identifier (Letter|Digit) Identifier ::= Letter (Letter|Digit)*
Grammar Transformations (continued) Substitution of non-terminal symbols N ::= X M ::= X N ::= X M ::= N Example: single-Command ::= forcontrolVar :=Expression direction Expression dosingle-Command direction ::= to| downto single-Command ::= forcontrolVar :=Expression (to|downto) Expression dosingle-Command
Starter Sets (a.k.a. First Sets) • Informal Definition: • The starter set of a RE X is the set of terminal symbols that can occur as the start of any string generated by X • Example : • starters[ (“+”| - | e) (0 | 1 |…| 9)+] = {+, -, 0, 1, …, 9} • Formal Definition: • starters[e] ={ } • starters[t] ={t} (where t is any terminal symbol) • starters[X Y] = starters[X] (if X doesn’t generate e) • starters[X Y] = starters[X] starters[Y] (if X generates e) • starters[X | Y] = starters[X] starters[Y] • starters[X*] = starters[X]
Derivations • Replacing a non-terminal S ::= E E ::= T | E + T T ::= i | ( E ) S S => E S => E => E + T S => E => E + T => T + T S => E => E + T => T + T => i + T S => E => E + T => T + T => i + T => i + i • This is a left-most derivation (it replaces the left-most non-terminal at each step. • Can you find the corresponding right-most derivation? • Can you find a derivation that is neither left-most nor right-most?
Sentential forms • A sequence of grammar symbols that can be derived from the start symbol • A sentence is a sentential form that contains only terminal symbols, that is, a string that can be generated using the grammar. S => E => E + T => T + T => i + T => i + i
Ambiguous grammars A grammar is ambiguous if some sentence has more than one distinct parse tree. Equivalently, a grammar is ambiguous if some sentence has more than one left-most derivation, or more than one right-most derivation. S ::= E E ::= i | ( E ) | E + E Does i + i demonstrate the ambiguity? Does i + i demonstrate the ambiguity? E => E + E => i + E => i + i Does i + i + i demonstrate the ambiguity? Does i + i + i demonstrate an ambiguity? E => E + E => i + E => i + E + E => i + i + E => i + i + i E => E + E => E + E + E => i + E + E => i + i + E => i + i + i
Augmented grammars We augment grammars to ensure that we can recognize and handle the end of the input string S ::= E E ::= i | ( E ) | E + E S’ ::= S $ S ::= E E ::= i | ( E ) | E + E Here $ denotes the end-of-file token
Nullable, First sets (starter sets), and Follow sets • A non-terminal is nullable if it derives the empty string • First(N) or starters(N) is the set of all terminals that can begin a sentence derived from N • Follow(N) is the set of terminals that can follow N in some sentential form Next we will see algorithms to compute each of these.