1.32k likes | 3.22k Views
Compiler design. Text Book: Compilers: principles, theory, and techniques by Aho, Sethi, and Ullman. Topics: Compiler phases Lexical analysis Syntax analysis Code generation Home work: there will be two major programming assignments. They must be done independently.
E N D
Compiler design Text Book: Compilers: principles, theory, and techniques by Aho, Sethi, and Ullman. Topics: Compiler phases Lexical analysis Syntax analysis Code generation Home work: there will be two major programming assignments. They must be done independently. Examinations: there will be two hourly exams and a final exam. Grading: the total grade will be computed as follows: 20% for each hourly exam 15% for the home work 50% for the final exam.
Overview of Compiler • Compiler is a program (written in a high-level language) that converts / translates / compiles source program written in a high level language into an equivalent machine code. compiler source program machine code or object code
What is a Compiler? • Definition: A compiler is a program that translates one language to another • Usually, the translation takes place between a high-level language and a low-level language • Clearly, our first step is to discuss some terminology…
Terminology • Source language – the language that is being translated • Object language – the language into which the translation is being done • High-level language – a language that is far removed from a computer; one which is close to the problem area(s) for which the language is designed
Terminology… • Low-level language – a language that is close to the machine (computer) upon which the language will run (execute) • Object language – (sometimes called machine code) the language of some computer. This language usually is not human readable (and is expressed in bits or hex)
Terminology… • Intermediate language – a language that is used either: • because it is a temporary step in the translation process; or, • because it is neither particularly, high, nor low, and is the output of a translation • Assembly language – a language that translates almost one-to-one to machine language, but is in human readable form
What’s a Compiler?... • Today, compilers are written using high-level languages (such as Java, C++, etc.) • The earliest compilers were written using assembly language (e.g., FORTRAN and COBOL around 1954) • Sometimes a compiler is written in the same language for which one is writing a compiler. This is done through Bootstrapping.
Why Should I learn Compiler Construction? • How do compilers work? • How do computers work? (instruction set, registers, addressing modes, run time data structures, …) • What machine code is generated for certain language constructs? (efficiency considerations) • Getting "a feeling" for good language design
Why Compilers? A Brief History • The first computers were “hard-wired” • That is, they were collections of physical devices that connected to one-another, in an assemblage designed to calculate particular kinds of results
Why Compilers? A Brief History… • For example, Babbage’s Analytic Engine and his Difference Engine were assemblages of gears that solved numeric problems • The primary driving force was the calculation of ballistics tables for artillery • Jacquard’s loom is another example • And Holleriths’ work for the US Census bureau is another
Why Compilers? A Brief History… • In the late 1940’s John von Neumann “invented” the stored program computer • The “invention” is the observation that just as you can store data in the memory of a computer, the data can be machine instructions • Then the computer can not only take its instructions from memory…
Why Compilers? A Brief History… • But the computer can modify the instructions in its memory… • And, in fact, can write its own programs, storing them in memory • It quickly became apparent that the simplest way to store information in a computer was in the form of binary numbers
Why Compilers? A Brief History… • So, to program a computer, you only needed to enter a sequence of binary numbers into memory, and then tell the computer at which memory address to start execution • This was programming in machine language • Instructions (and data) were entered from a console, one word (in binary) at a time…
Why Compilers? A Brief History… • This form of coding (note the word!) quickly was replaced by programming in assembly language • A program was written (in machine language) which translated assembly language to machine language (called an assembler)
Why Compilers? A Brief History… • After the first assembler was written, no one needed to code in machine language any longer • But, coding x = 3; can take many instructions… • So, the thought was – can we create a program that translates something like x = 3; into assembly language or into machine language?
Why Compilers? A Brief History. Formal Languages • About the same time, in the mid-1950’s, Noam Chomsky (M.I.T.) began investigating the formal structure of natural languages • His work led to the Chomsky hierarchy of type 0, 1, 2, 3 languages and their associated grammars
Why Compilers? A Brief History. Formal Languages… • The type 2 (context-free) grammars turned out to be very good at describing computer languages • And, efficient ways to recognize the structure of a source program using a type 2 were developed • Such recognition is called parsing
Why Compilers? A Brief History. Formal Languages… • Very closely related to context-free grammars are the type 3 grammars • These are equivalent to finite automata and regular grammars • An entire sub-branch of mathematics studies automata; it’s called automata theory
Why Compilers? A Brief History. Formal Languages… • It turns out that type 3 (regular) grammars are very good at describing the “atoms” used in computer languages • These “atoms” are the reserved words, symbols, and user-defined words that are used in a computer language • Recognizing atoms is called scanning (or lexing)
Why Compilers? A Brief History… • By far the most difficult and complicated problem has been how to generate object code that is concise, and most importantly, executes efficiently • This is called “optimization”
Why Compilers? A Brief History… • Far simpler are the front-end issues of scanning and parsing = recognizing the source code • This is due to the fact that we’ve developed (semi-) automatic ways to create scanners and parsers… • using scanner generators and parser generators
Programs Related to Compilers… • Interpreters – directly executes the code upon recognition; usually statement by statement • Assemblers – translate assembly language to machine language • Macro Assemblers – ditto, but with (powerful) macro capabilities
Programs Related to Compilers… • Linkers – combine object modules to produce an executable module • Linkage Editors – manage the linking process, and are able to create/maintain object libraries
Programs Related to Compilers… • Loaders – load executable modules into memory, and launch execution • Dynamic Loaders – loaders that stay around during execution to handle the loading of DLLs (dynamically loadable libraries)
Programs Related to Compilers… • Preprocessors – usually a separate program whose input is source code and whose output is source code; perform macro expansion, comment deletion, etc. Sometimes the first phase of a compiler
Programs Related to Compilers… • Editors – allow the user to create and update source code • Smart Editors – include syntax coloring, parenthesis balancing, etc. • Debuggers – a program that provides an environment in which code may be debugged; including single stepping, symbol tables, etc.
Programs Related to Compilers… • IDEs – integrated development environments; provide integrated editor-debugger-execution environments • Profilers – collects statistics about where programs spend their time during execution; important for optimizing at the source code level
Programs Related to Compilers… • Project Managers – programs that help software managers deal with hundreds or thousands of modules; build reports, etc. • SCCS – source code control systems; provide for multiple access to shared code in a control manner
The Translation Process • The translation process consists of a collection of phases, with the output of one phase feeding the input of the next • The original source code is transformed into a sequence of intermediate representations (IRs) during this process
Phases of Compiler Parallel to all other phases are two activities: • Symbol table manipulation. Symbol table is one of the primary data-structures that a compiler uses. This data-structure is used by all of the phases. • Error detecting and handling
The Scanner • The scanner reads the source program, as a stream of characters, and it performs lexical analysis – collecting sequences of characters into meaningful units called tokens • The scanner also may create a symbol table and a literal table
The Parser • The parser reads the tokens produced by the scanner and performs syntactic analysis – creating an IR (a parse tree or a syntax tree) showing the structure of the program • Syntax trees (abstract syntax trees) are reduced representations of the tree, with many irrelevant nodes eliminated
The Semantic Analyzer • The semantics of a program are its “meaning” – what it is intended to accomplish • The semantic analyzer creates an intermediate data structure that contains this meaning – these are the static semantics • The dynamic semantics of a program only can be determined by executing the program
The Semantic Analyzer… • An example of the static semantics of a program is the data types of the variables (and expressions) • These static semantics usually are represented in the intermediate representations (IRs) as attributes • The IR usually is a tree, “decorated” with these attributes
(Source) Code Optimization • Optimization may occur during several phases • Source code optimization rearranges the source (or the IR of the source) in order to produce more optimal results • E.g., x = 7 + 9; can become x = 16; • This is called constant folding
(Source) Code Optimization… • Duplicated computations can be saved as temporaries and then their values re-used • Recursion can be converted to iteration • Repeated calculations can be moved out of loops • The possibilities are endless…
The Code Generator • The code generator takes the IR and generates code for the target machine • Here the details of how various numeric and non-numeric quantities are represented become important • E.g., word length, hardware stack, hardware calling conventions, memory access, etc.
The Target Code Optimizer • The target code optimizer examines the emitted target code to see if further possibilities for optimization are present and then capitalizes upon them • E.g., reuse of registers, using a shift instruction to replace a multiplication or division, etc.
Phases of the compiler Source Program Scanner Lexical Analyzer Tokens Parser Syntax Analyzer Parse Tree SemanticAnalyzer Abstract Syntax Tree with attributes
Sample Program Compiled • Consider the example: int a, b{ a = 100; b = f (a) + 3} Source Program Lexical Analyzer Token stream
Sample Program Compiled • Tokens are entities defined by the compiler writer which are of interest. A sequence of characters with collective meanings are grouped to form a token. • Examples of Tokens: • Single Character operator: = + - * > < • More than one character operator: ++, --,==,<= • Numeric Constants: 1997 45.89 19.9e+7 • Key Words: int, while, for • Identifiers: x, my_name, Your_Name, a • Homework: Identify all token types in C programs.
Example Program Compiled-Continued What are the tokens in the example?
Example Continued The parser produces a parse tree: it is a heterogeneous tree (nodes have different data types) root_node stmt1 stmt2 stmt1 stmt2 = = a 100 b + f 3 ( a )
Intermediate-Code Generation • Using temporary location to save values • t1 = 100 • store t1, a • load a, t2 • t3 = f(t2) • t4 = t3 + 3 • store t4, b
Intermediate-Code Optimization • Eliminate unnecessary code or statements that want be executed • t1 = 100 • store t1, a • t3 = foo(t1) • t4 = t3 + 3 • store t4, b
Target-code Generation • Machine code generated for some machine • R1 = 100 • store r1, 0x10 • jsr _f • r2 = r0 + 3 • store r2, 0x16
Compiler ArchitectureSingle pass vs. multi pass architecture Single pass: all passes interleaved, driven by parser
Multi pass Each pass finishes before next starts • Saves main memory, communicate through files • Used if the language is complex or portability is important
Front end & Back end • Front end: is the phases or parts of phases that depend on the source language. • Back end: is phases or part of phases that depend on the target machine.