2.39k likes | 4.85k Views
Introduction to Compiler Design. What is a Compiler?. A compiler is a language translator that takes as input a program written in a high level language and produces an equivalent program in a low-level language.
E N D
What is a Compiler? • A compiler is a language translator that takes as input a program written in a high level language and produces an equivalent program in a low-level language. • For example, a compiler may translate a C++ program into an executable program running on a processor.
Phases of Compilation • In the process of translation, a compiler goes through several phases: • Lexical analysis (also called scanning) • Syntax analysis (also called parsing) • Semantic analysis • Optimization (not in this course!) • Code generation
Phases of A Compiler Lexical Analyzer Syntax Analyzer Semantic Analyzer Intermediate Code Generator Code Optimizer Code Generator Target Program Source Program • Each phase transforms the source program from one representation • into another representation. • They communicate with error handlers. • They communicate with the symbol table.
Lexical Analysis • The job of the lexical analyzer, or scanner, is to read the source program one character at a time and produce as output a stream of tokens. • The tokens produced by the scanner serve as input to the next phase, the parser. • Thus, the lexical analyzer’s job is to translate the source program into a form more conducive to recognition by the parser.
Tokens • Takes a stream of characters and identifies tokens from the lexemes. • Tokens are used to represent low-level program units such as • Identifiers, such as sum, value, and x • Numeric literals, such as 123 and 1.35e06 • Operators, such as +, *, &&, <=, and % • Keywords, such as if, else, and return • Many other language symbols • Eliminates comments and redundant whitepaces.
Tokens • Whitespace: A sequence of space, tab, newline, carriage-return, form-feed characters etc. • Lexeme: A sequence of non-whitespace characters delimited by whitespace or special characters (e.g. operators like +, -, *). • Examples of lexemes. • reserved words, keywords, identifiers etc. • Each comment is usually a single lexeme • preprocessor directives
Classes of Tokens • There are many ways we could represent the tokens of a programming language. One possibility is to use a 2-tuple of the form <token_class, value> • For example, consider the token class identifier. The identifiers sum and value may be represented as <ident, “sum”>and <ident, “value”>, respectively.
Classes of Tokens • The token class NumericLiteral may be represented in the same way; for example, the literals 123 and 1.35e06 may be represented as <NumericLiteral, “123”>and <NumericLiteral, “1.35e06”>, respectively. • The same applies to operators; for example, <relop, “>=”>and <addop, “-”>
Representing Tokens • These 2-tuples are easily represented as a struct or class in C++: enum TokenClass {ident, numlit,…}; struct Token { TokenClass tokenClass; string tokenValue; };
Tokens: An Example The scanner may take the expression x = 2 + f(3); and produce the following stream of tokens: <ident, “x”> <assign_op, “=”> <numlit, “2”> <addop, “+”> <ident, “f”> <lparen, “(“> <numlit, “3”> <rparen, “)”> <semicolon, “;”>
The Scanner in Action if (x < y) min = x; SCANNER <SEMICOLON, “;”> <IDENT, “x”> <LPAREN, “(“> <IFTOK, “if”>
Syntax Analysis • The job of the syntax analyzer, or parser, is to take a stream of tokens produced by the lexical analyzer and build a parse tree (or syntax tree). • The parser is basically a program that determines if sentences in a language are constructed properly according to the rules of the language.
A Parse Tree if_statement if ( expr ) statement assign relop numlit ident var = expr x < 10 ident numlit y 23
Parsing Techniques • Depending on how the parse tree is created, there are different parsing techniques. • These parsing techniques are categorized into two groups: • Top-Down Parsing, • Bottom-Up Parsing • Top-Down Parsing: • Construction of the parse tree starts at the root, and proceeds towards the leaves. • Efficient top-down parsers can be easily constructed by hand. • Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing). • Bottom-Up Parsing: • Construction of the parse tree starts at the leaves, and proceeds towards the root. • Normally efficient bottom-up parsers are created with the help of some software tools. • Bottom-up parsing is also known as shift-reduce parsing. • Operator-Precedence Parsing – simple, restrictive, easy to implement • LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
Syntax Analysis • The syntax of a language is defined by using a context free grammar (CFG). • A CFG uses BNF rules to describe the syntax: <if-stmt> if(<cond> ) <stmt> [ else stmt ] ;
Syntax Analyzer (CFG) • The syntax of a language is specified by a context free grammar (CFG). • The rules in a CFG are mostly recursive. • A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not. • If it satisfies, the syntax analyzer creates a parse tree for the given program. • Ex: We use BNF (Backus Naur Form) to specify a CFG assgstmt -> identifier := expression expression -> identifier expression -> number expression -> expression + expression
Syntax Analyzer versus Lexical Analyzer • Which constructs of a program should be recognized by the lexical analyzer, and which ones by the syntax analyzer? • Both of them do similar things; But the lexical analyzer deals with simple non-recursive constructs of the language. • The syntax analyzer deals with recursive constructs of the language. • The lexical analyzer simplifies the job of the syntax analyzer. • The lexical analyzer recognizes the smallest meaningful units (tokens) in a source program. • The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize meaningful structures in our programming language.
Semantic Analyzer • A semantic analyzer checks the source program for semantic errors and collects the type information for the code generation. • Type-checking is an important part of semantic analyzer. • Normally semantic information cannot be represented by a context-free language used in syntax analyzers. • The semantic analyzer’s job is to attach some meaning to the structure produced by the parser. Activities include: • Ensuring an identifier is defined before being used in a statement or expression. • Enforcing the scope rules of the language. • Performing type checking • Producing intermediate code
Semantic Analysis • Static semanticscan be determined by the compiler prior to execution, including • Declarations • Determine the structure and attributes of a user-defined data type • Determine type of a variable • Determine the number and types of parameters of a procedure • Type checking • The process of ensuring that the type(s) of the operand(s) are appropriate for an operation
Semantic Analysis • Attributesare extra pieces of information computed by the semantic analyzer. These include the types of variables, constants, operators, etc. • An annotated syntax tree is a syntax tree that has been “decorated” with attributes. • Inherited attributes come down the syntax tree from parent or sibling nodes • Synthesized attributes come up the syntax tree from child nodes
Annotated syntax tree: a[ndx] = x + 3 Semantic Analysis Assign-expr lhs.type=rhs.type Add-operator Subscript-expr integer integer Identifier a Identifier ndx Identifier x Number 3 integer integer integer integer
Semantic Analysis • Some optimization may be done during this phase: • Source code optimization (e.g., constant folding): • X := 2 + 4; can be optimized to X := 6; • Intermediate code optimization: • Temp := 5; A[index] := Temp can be optimized to A[index] := 5;
Intermediate Code Generation • A compiler may produce an explicit intermediate codes representing the source program. • These intermediate codes are generally machine (architecture independent). But the level of intermediate codes is close to the level of machine codes. • Ex: newval := oldval * fact + 1 id1 := id2 * id3 + 1 MULT id2,id3,temp1 Intermediates Codes (Quadraples) ADD temp1,#1,temp2 MOV temp2,,id1
Code Optimizer (for Intermediate Code Generator) • The code optimizer optimizes the code produced by the intermediate code generator in the terms of time and space. • Ex: MULT id2,id3,temp1 ADD temp1,#1,id1
Code Generation • Takes intermediate code and produces object code: la $t0, _a addi $sp, $sp, 4 sw $t0, 0($sp) la $t0, _ndx lw $t0, 0($t0) . . .
Compiler Data Structures • Tokens • Often represented as a enumeration type • May include other information such as • Spelling of identifier • Value of constant • Scanner needs to generate only one token at a time (single symbol lookahead)
Compiler Data Structures • Parse tree (or syntax tree) • A linked structure built by the parser, with information added by the semantic analyzer • Each node is a record whose fields contain information about the syntactic construct which the node represents • Nodes may be represented in various ways: • A C structure • A Pascal or Ada variant record • A C++ or Java class
Compiler Data Structures • Symbol Table • Keeps information about identifiers declared in a program. • Efficient insertion and lookup is required hash table or tree structure may be used. • Several tables may be maintained in a list or stack.
Compiler Data Structures • Literal Table • Stores constants and strings used in a program. • Data in literal table applies globally to a program deletions are not necessary
Compiler Data Structures • Intermediate Code • Could be kept in an array, temporary file, or linked list. • Representations include P-code and 3-address code
The Structure of a Compiler (8) Code Generator [Intermediate Code Generator] Non-optimized Intermediate Code Scanner [Lexical Analyzer] Tokens Code Optimizer Parser [Syntax Analyzer] Optimized Intermediate Code Parse tree Code Optimizer Semantic Process [Semantic analyzer] Target machine code Abstract Syntax Tree w/ Attributes
The Structure of a Compiler • Compiler writing tools • Compiler generators or compiler-compilers • E.g. scanner and parser generators • Examples : Yacc, Lex