580 likes | 723 Views
Introduction. Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Fall 2012. Why study compiling?. Importance: Programs written in high-level languages have to be translated into binary codes before executing Reduce execution overhead of the programs
E N D
Introduction Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Fall 2012
Why study compiling? • Importance: • Programs written in high-level languages have to be translated into binary codes before executing • Reduce execution overhead of the programs • Make high-performance computer architectures effective on users' programs • Influence: • Language Design • Computer Architecture (influence is bi-directional) • Techniques used influence other areas • Text editors, information retrieval system, and pattern recognition programs • Query processing system such as SQL • Equation solver • Natural Language Processing • Debugging and finding security holes in codes • …
Compiler Concept COMPILER source program target program ( Normally the equivalent program in machine code relocatable object file) ( Normally a program written in a high-level programming language) error messages • A compiler is a program that takes a program written in a source language and translates it into an equivalent program in a target language.
Interpreter source program INTERPRETER output input error messages An interpreter directly executes the operations specified in the source program on inputs supplied by the user.
Programming Languages • Compiled languages: • Fortran, Pascal, C, C++, C#, Delphi, Visual Basic, … • Interpreted languages: • BASIC, Perl, PHP, Ruby, TCL, MATLAB,… • Joint Compiled and Interpreted languages • Java, Python, …
Compiler vs. Interpreter • Preprocessing • Compilers do extensive preprocessing • Interpreters run programs “as is”, with little or no preprocessing • Efficiency • The target program produced by a compiler is usually much faster than interpreting the source codes
Compiler Structure Intermediate Language Front End – language specific Back End – machine specific Target Language Source Language Symbol Table Synthesis Analysis • Separation of Concerns • Retargeting
Two Main Phases • Analysis Phase: breaks up a source program into constituent pieces and produces an internal representation of it called intermediate code. • Synthesis Phase: translates the intermediate code into the target program.
Phases of Compilation • Compilers work in a sequence of phases. • Each phase transforms the source program from one representation into another representation. • They use the symbol table to store information of the entire source program. Intermediate Language Lexical Analyzer Syntax Analyzer Semantic Analyzer Intermediate Code Generator Code Optimizer Code Generator Source Language Target Language Symbol Table Synthesis Analysis
A Model of A Compiler Font End • Lexical analyzer reads the source program character by character and returns the tokens of the source program. • Parser creates the tree-like syntactic structure of the given program. • Intermediate-code generator translates the syntax tree into three-address codes.
Lexical Analysis • Lexical Analyzer reads the source program character by character and returns the tokens of the source program. <token-name, attribute-value> • A tokendescribes a pattern of characters having same meaning in the source program. (such as identifiers, operators, keywords, numbers, delimeters and so on)
White Space Removal Skipping white space No blank, tab, newline, or comments in grammar
Constants • When a sequence of digits appears in the input stream, the lexical analyzer passes to the parser a token consisting of the terminal num along with an integer-valued attribute computed from the digits. 31+28+59 <num, 31><+><num, 28><+><num, 59> • Simulate parsing some number ....
Keywords and Identifiers Keywords: Fixed character strings used as punctuation marks or to identify constructs. Identifiers: A character string forms an identifier only if it is not a keyword.
Lexical Analysis Cont’d Puts information about identifiers into the symbol table. Regular expressions are used to describe tokens (lexical constructs). A (Deterministic) Finite State Automaton can be used in the implementation of a lexical analyzer.
Symbol Table Outer symbol table Inner symbol table • Symbol Tables are data structures that are used by compilers to hold information about the source-program constructs. • For each identifier, there is an entry in the symbol table containing its information. • Symbol tables need to support multiple declarations of the same identifier • One symbol table per scope (of declaration)... { int x; char y; { bool y; x; y; } x; y; }
Parsing A Syntax/Semantic Analyzer (Parser) creates the syntactic structure (generally a parse tree) of the given program. Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar
Syntax Definition Production • Context-Free Grammar (CFG) is used to specify the syntax of a formal language (for example a programming language like C, Java) • Grammar describes the structure (usually hierarchical) of programming languages. • Example: in Java an IF statement should fit in • if ( expression ) statement else statement • statement if ( expression ) statement else statement • Note the recursive nature of statement.
Definition of CFG • Four components: • A set of terminal symbols (tokens): elementary symbols of the language defined by the grammar • A set of non-terminals (syntactic variables): represent the set of strings of terminals • A set of productions: non-terminal a sequence of terminals and/or non-terminals • A designation of one of the non-terminals as the start symbol.
A Grammar Example List of digits separated by plus or minus signs • Accepts strings such as 9-5+2, 3-1, or 7. • 0, 1, …, 9, +, - are the terminal symbols • list and digit are non-terminals • Every “line” is a production • list is the start symbol • Grouping: list → list + digit | list – digit | digit
Derivations • A grammar derives strings by beginning with the start symbol and repeatedly replacing a non-terminal by the body of a production • Language: The terminal strings that can be derived from the start symbol defined by the grammar. • Example: Derivation of 9-5+2 • 9 is a list, since 9 is a digit. • 9-5 is a list, since 9 is a list and 5 is a digit. • 9-5+2 is a list, since 9-5 is a list and 2 is a digit.
Parse Trees A XYZ • A parse tree shows how the start symbol of a grammar derives a string in the language
Parse Trees Properties The root is labeled by the start symbol. Each leaf is labeled by a terminal or by ε. Each interior node is labeled by a non-terminal. If A is the non-terminal labeling some interior node and Xl , X2,… , Xn are the labels of the children of that node from left to right, then there must be a production A X1X2 · · · Xn.
Ambiguity (9-5)+2 = 6 9-5+2 9-(5+2) = 2 • A grammar can have more than one parse tree generating a given string of terminals. list list + digit | list – digit | digit digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 string string + string | string - string | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Eliminating Ambiguity • Operator Associativity: in most programming languages arithmetic operators have left associativity. • Example: 9+5-2 = (9+5)-2 • Exception: Assignment operator = has right associativity: a=b=c is equivalent to a=(b=c) • Operator Precedence: if an operator has higher precedence, then it will bind to it’s operands first. • Example: * has higher precedence than +, therefore 9+5*2 = 9+(5*2)
Parsing • Parsing is the process of determining how a string of terminals can be generated by a grammar. • Two classes: • Top-down: construction of parse tree starts at the root and proceeds towards the leaves • Bottom-up: construction of parse tree starts at the leaves and proceeds towards the root
Top-Down Parsing • The top-down construction of a parse tree is done by starting from the root, and repeatedly performing the following two steps. • At node N, labeled with non-terminal A, select the proper production of A and construct children at N for the symbols in the production body. • Find the next node at which a subtree is to be constructed, typically the leftmost unexpanded non-terminal of the tree.
Predictive Parsing • Recursive descent parsing: a top-down method of syntax analysis in which a set of recursive procedures is used to process the input. • Predictive parsing: a simple form of recursive-descent parsing • Lookahead symbol unambiguously determines the flow of control based on the first terminal(s) of the nonterminal
Procedure for stmt Necessary condition to use predictive parsing? No confliction on the first symbols of the bodies for the same head.
Left Recursion Elimination Leftmost symbol of the body is the same as the nonterminal: A left-recursive production can be eliminated by rewriting the offending production:
Syntax Analyzer vs. Lexical Analyzer • Both of them do similar things • Granularity • The lexical analyzer works on the characters to recognize the smallest meaningful units (tokens) in a source program. • The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize meaningful structures in the programming language. • Recursion • The lexical analyzer deals with simple non-recursive constructs of the language. • The syntax analyzer deals with recursive constructs of the language.
Semantic Analysis • Semantic Analyzer • adds semantic information to the parse tree (syntax-directed translation) • checks the source program for semantic errors • collects type information for the code generation • type checking: check whether each operator has matching operands • coercion: type conversion
Syntax-Directed Translation Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. Infix expression postfix expression Techniques: Attributes & Translation Schemes
Postfix Notation • Definition: • If E is a variable or constant , • E E. • If E is an expression of the form E1opE2, • E1opE2 E’1E’2op • If E is a parenthesized expression of the form (E1), • (E1) E’1 • Examples: • 9-5+2 95-2+ • 9-(5+2) 952+-
Attributes • A syntax-directed definition • associates attributes with non-terminals and terminals in a grammar • attaches semantic rules to the productions of the grammar • An attribute is said to be synthesized if its value at a parse-tree node is determined from attribute values of its children and itself.
Semantic Rules for Infix to Postfix Annotated Parse Tree • 9-5+2 95-2+ Syntax-directed definition
Translation Schemes A Syntax-Directed Translation Scheme is a notation for specifying a translation by attaching program fragments to productions in a grammar. The program fragments are called semantic actions.
A Translation Scheme • 9-5+2 95-2+ Parse tree Translation scheme
Attribute vs. Translation Scheme Syntax-directed attribute attaches strings as attributes to the nodes in the parse tree Syntax-directed translation scheme prints the translation incrementally, through semantic actions
A Simple Translator Grammar of List of digits separated by plus or minus signs
Translation of 9-5+2 to 95-2+ Left-recursion eliminated
Syntax vs. Semantics • The syntax of a programming language describes the proper form of its programs. • The semantics of the language defines what its programs mean, what each program does when it executes.
Intermediate Code Generation • The front end of a compiler constructs an intermediate representation of the source program from which the back end generates the target program. • Two kinds of intermediate representations • Tree: parse trees and (abstract) syntax trees • Linear representation: three-address code
Three-Address Codes • Three-address code is a sequence of instructions of the form x = y op z • Arrays will be handled by using the following two variants of instructions: x [ y ] = z x = y [ z ] • Instructions for control flow: ifFalse x goto L ifTrue x goto L goto L • Instruction for copying value x = y