CIS 461 Compiler Design and Construction Fall 2012 Instructor: Hugh McGuire Lecture-Module 2b

CIS 461Compiler Design and Construction Fall 2012 Instructor: Hugh McGuireLecture-Module 2b Phases of a Compiler

Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code) Must agree with OS and linker on format for object code High-level View of a Compiler Source code Machine code Compiler Errors

A Higher Level View: How Does the Compiler Fit In? • collects the source program that is divided into seperate files • macro expansion source program target assembly program Preprocessor Compiler skeletal source program • links the library routines and other object modules • generates absolute addresses Assembler Loader/Linker relocatable machine code absolute machine code generates machine code from the assembly code library routines, relocatable object files

Traditional Two-pass Compiler Use an intermediate representation (IR) Front end maps legal source code into IR Back end maps IR into target machine code Admits multiple front ends and multiple passes Typically, front end is O(n) or O(n log n), while back end is NPC Different phases of compiler also interact through the symbol table Symbol Table Source code IR Front End Back End Machine code Errors

Responsibilities Recognize legal programs Report errors for the illegal programs in a useful way Produce IR and construct the symbol table Much of front end construction can be automated The Front End Source code tokens IR Type Checker IR Scanner Parser Errors

The Front End Scanner Maps character stream into words—the basic unit of syntax Produces tokens and stores lexemes when it is necessary x = x + y ; becomes <id,x> EQ <id,x> PLUS <id,y> SEMICOLON Typical tokens include number, identifier, +, -, while, if Scanner eliminates white space and comments Source code tokens IR Type Checker IR Scanner Parser Errors

The Front End Parser Uses scanner as a subroutine Recognizes context-free syntax and reports errors Guides context-sensitive analysis (type checking) Builds IR for source program Scanning and parsing can be grouped into one pass token Source code IR Type Checker IR Scanner Parser get next token Errors

The Front End Context Sensitive Analysis • Check if all the variables are declared before they are used • Type checking • Check type errors such as adding a procedure and an array • Add the necessary type conversions • int-to-float, float-to-double, etc. Source code tokens IR Type Checker IR Scanner Parser Errors

The Back End Responsibilities Translate IR into target machine code Choose instructions to implement each IR operation Decide which values to keep in registers Schedule the instructions for instruction pipeline Automation has been much less successful in the back end Machine code Instruction Scheduling Instruction Selection Register Allocation Errors IR IR IR

The Back End Instruction Selection Produce fast, compact code Take advantage of target language features such as addressing modes Usually viewed as a pattern matching problem ad hoc methods, pattern matching, dynamic programming This was the problem of the future in late 70’s when instruction sets were complex RISC architectures simplified this problem Machine code Register Allocation Errors IR IR IR Instruction Selection Instruction Scheduling

The Back End Instruction Scheduling Avoid hardware stalls (keep pipeline moving) Use all functional units productively Optimal scheduling is NP-Complete IR IR IR Instruction Selection Instruction Scheduling Register Allocation Machine code Errors

The Back End Register allocation Have each value in a register when it is used Manage a limited set of registers Can change instruction choices and insert LOADs and STOREs Optimal allocation is NP-Complete Compilers approximate solutions to NP-Complete problems IR IR IR Instruction Selection Instruction Scheduling Register Allocation Machine code Errors

Traditional Three-pass Compiler Code Optimization Analyzes IR and transforms IR Primary goal is to reduce running time of the compiled code May also improve space, power consumption (mobile computing) Must preserve “meaning” of the code Machine code Front End Back End Errors IR IR Middle End Source Code

The Optimizer (or Middle End) Typical Transformations Discover and propagate constant values Move a computation to a less frequently executed place Discover a redundant computation and remove it Remove unreachable code IR IR IR ... IR IR Opt 1 Opt 3 Opt 2 Opt n Errors Modern optimizers are structured as a series of passes

First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into words Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments token IR Source code Parser Scanner get next token Errors

Why Lexical Analysis? By separating context free syntax from lexical analysis We can develop efficient scanners We can automate efficient scanner construction We can write simple specifications for tokens source code tokens Scanner tables or code specifications (regular expressions) Scanner Generator

What are Tokens? • Token: Basic unit of syntax • Keywords if, while, ... • Operators +, *, <=, ||, ... • Identifiers (names of variables, arrays, procedures, classes) i, i1, j1, count, sum, ... • Numbers 12, 3.14, 7.2E-2, ...

What are Tokens? Tokens are terminal symbols for the parser Tokens are treated as undivisible units in the grammar defining the source language 1. S expr 2. exprexpr op term 3. | term 4. term number 5. | id 6. op + 7. | - number, id, +, - are tokens passed from scanner to parser. They form the terminal symbols of this simple grammar.

Lexical Concepts • Token: Basic unit of syntax, syntactic output of the scanner • Pattern: The rule that describes the set of strings that correspond to a token, specification of the token • Lexeme: A sequence of input characters which match to a pattern and generate the token Token Lexeme Pattern WHILE while while IF if if ID i1, length, letter followed by count, sqrt letters and digits

Tokens can have Attributes • A problem • If we send this output to the parser is it enough? Where are the variable names, procedure, names, etc.? All identifiers look the same. • Tokens can have attributes that they can pass to the parser (using the symbol table) IF, LPAREN,ID,EQEQ,ID,RPAREN, ID,EQ,NUM,SEMICOLON,ELSE, ID,EQ,NUM,SEMICOLON if (i == j) z = 0; else z = 1; becomes IF, LPAREN,<ID, i>,EQEQ,<ID, j>,RPAREN, <ID, z>,EQ,<NUM,0>,SEMICOLON,ELSE, <ID,z>,EQ,<NUM,1>,SEMICOLON

How do we specify lexical patterns? Some patterns are easy Keywords and operators Specified as literal patterns: if, then, else, while, =, +, …

Some patterns are more complex Identifiers letter followed by letters and digits Numbers Integer: 0 or a digit between 1 and 9 followed by digits between 0 and 9 Decimal: An optional sign which can be “+” or “-” followed by digit “0” or a nonzero digit followed by an arbitrary number of digits followed by a decimal point followed by an arbitrary number of digits GOAL: We want to have concise descriptions of patterns, and we want to automatically construct the scanner from these descriptions Specifying Lexical Patterns

Specifying Lexical Patterns: Regular Expressions Regular expressions (REs) describe regular languages Regular Expression (over alphabet )  (empty string) is a RE denoting the set {} If a is in , then a is a RE denoting {a} If x and y are REs denoting languages L(x) and L(y) then x is an RE denotingL(x) x | y is an RE denoting L(x)  L(y) xy is an RE denoting L(x)L(y) x* is an RE denoting L(x)* Precedence is closure, then concatenation, then alternation All left-associative x | y* z is equivalent to x | ((y*) z)

Operations on Languages O p e r a ti o n De finiti o n U ni o n of L a n d M È L M = { s | s L o r s M } Î Î È W ri t t e n L M C o nc at en ati o n of L a n d M L M = { s t | s L a n d t M } Î Î W ri t t e n L M {} ifi = 0 Exponentiation of L Written Li Li = Li-1L if i > 0 K l ee n e c lo sur e o f L È i * L = L * W ri t t e n L £ £ ¥ 0 i È i Positive closure of L Written L+ L+ = L £ £ ¥ 1 i

Examples of Regular Expressions • All strings of 1s and 0s ( 0 | 1 )* • All strings of 1s and 0s beginning with a 1 1 ( 0 | 1 )* • All strings of 0s and 1s containing at lest two consecutive 1s ( 0 | 1 )* 1 1( 0 | 1 )* • All strings of alternating 0s and 1s (  | 1 ) ( 0 1 )* (  | 0 )

Extensions to Regular Expressions (a la JLex) • x+= x x* denotes L(x)+ • x? = x |  denotes L(x) {} • [abc] = a | b | c matches one character in the square bracket • a-z = a | b | c | ... | z range • [0-9a-z] = 0 | 1 | 2 | ... | 9 | a | b | c | ... | z • [^abc] ^ means negation matches any character except a, b or c • . (dot) matches any character except the newline • . = [^\n] \n means newline, dot is equivalent to [^\n] • “[“ matches left square bracket, metacharacters in double quotes become plain characters • \[ matches left square bracket, metacharacter after backslash becomes plain character

Regular Definitions • We can define macros using regular expressions and use them in other regular expressions Letter  (a|b|c| … |z|A|B|C| … |Z) Digit  (0|1|2| … |9) Identifier  Letter ( Letter | Digit )* • Important: We should be able to order these definitions so that every definition uses only the definitions defined before it (i.e., no recursion) • Regular definitions can be converted to basic regular expressions with macro expansion • In JLex enclose definitions using curly braces Identifier  {Letter} ( {Letter} | {Digit} )*

Examples of Regular Expressions Digit  (0|1|2| … |9) Integer  (+|-)? (0| (1|2|3| … |9)(Digit *)) Decimal  Integer “.” Digit * Real  ( Integer | Decimal ) E (+|-)?Digit * Complex  “(“ Real , Real “)” Numbers can get even more complicated.

From Regular Expressions to Scanners • Regular expressions are useful for specifying patterns that correspond to tokens • However, we also want to construct programs which recognize these patterns • How do we do it? • Use finite automata!

Consider the problem of recognizing register names in an assembler Register  R (0|1|2| … |9)(0|1|2| … |9)* Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) Example (0|1|2| … |9) R (0|1|2| …|9) S0 S1 S2 accepting state R initial state R (0|1|2| …|9) Se error state Recognizer for Register (R|0|1|2| …|9)

Deterministic Finite Automata (DFA) • A set of states S • S = { s0 , s1 , s2 , se} • A set of input symbols (an alphabet)  •  = { R , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 } • A transition function  : S    S • Maps (state, symbol) pairs to states •  = { ( s0 ,R)  s1, ( s0 ,0-9)  se ,( s1 ,0-9 )  s2 ,( s1 ,R )  se , ( s2 , 0-9 )  s2 , ( s2 , R )  se , ( se ,R | 0-9)  se } • A start state • s0 • A set of final (or accepting) states • Final = { s2} A DFA accepts a word x iff there exists a path in the transition graph from start state to a final state such that the edge labels along the path spell out x

DFA simulation Start in state s0 and follow transitions on each input character DFA accepts a word x iff x leaves it in a final state (s2 ) So, “R17” takes it through s0 , s1 , s2and accepts “R” takes it through s0 , s1 and fails “A” takes it straight to se “R17R” takes it through s0 , s1 , s2 , seand rejects Example (0|1|2| …|9) R (0|1|2| …|9) S0 S1 S2 accepting state initial state Recognizer for Register

Simulating a DFA We can store the transition table in a two-dimensional array: state = s0 ; char = get_next_char(); while (char != EOF) { state = (state,char); char =get_next_char(); } if (state  Final) report acceptance; else report failure; 0 ,1, 2 ,3, 4 ,5, 6 , d ot h e r R 7 ,8, 9 S S S S 0 1 e e S S S S 1 e 2 e S S S S 2 e 2 e S S S S e e e e Final = { s2} We can also store the final states in an array • The recognizer translates directly into code • To change DFAs, just change the arrays • Takes O(|x|) time for input string x

Recognizing Longest Accepted Prefix accepted = false; current_string = ; // empty string state = s0 ; // initial state if (state  Final) { accepted_string = current_string; accepted = true; } char =get_next_char(); while (char != EOF) { state = (state,char); current_string = current_string + char; if (state  Final) { accepted_string = current_string; accepted = true; } char =get_next_char(); } if accepted return accepted_string; else report error; Given an input string, this simulation algorithm returns the longest accepted prefix 0 ,1, 2 ,3, 4 ,5, 6 , d ot h e r R 7 ,8, 9 S S S S 0 1 e e S S S S 1 e 2 e S S S S 2 e 2 e S S S S e e e e Final = { s2} Given the input “R17R” , this simulation algorithm returns “R17”

CIS 461 Compiler Design and Construction Fall 2012 Instructor: Hugh McGuire Lecture-Module 2b