340 likes | 615 Views
Introduction to Compilers. Professor Yihjia Tsai 2006 Spring Tamkang University. What is a compiler?. Translates source code to target code Source code is typically a high level programming language (Java, C++, etc) but does not have to be
E N D
Introduction to Compilers Professor Yihjia Tsai 2006 Spring Tamkang University
What is a compiler? • Translates source code to target code • Source code is typically a high level programming language (Java, C++, etc) but does not have to be • Target code is often a low level language like assembly or machine code but does not have to be • Can you think of other compilers that you have used – according to this definition?
Before we begin • * star • + plus • , comma • - hyphen, minus • / slash • : colon • ; semicolon • < less than • = equal • A-Z, a-z, 0-9 • “ double quote • # hash • $ dollar sign • % percent • & ampersand • ‘ single quote • ( left parenthesis • ) right parenthesis
Symbols • ` back quote • { open brace • | or • } close brace • ~ tilde • . period, dot • bullet • > greater than • ? question mark • @ at sign • [ left (open) square bracket • \ back slash • ] right (close) square bracket • ^ caret, power • _ underscore
Greek symbols • mu • nu • xi • pi • rho • sigma • tau • chi • psi • eta • omega • alpha • beta • gamma • delta • epsilon • phi • zeta • theta • iota • kappa • lambda
Other Compilers • Javadoc -> HTML • XML -> HTML • SQL Query output -> Table • Poscript -> PDF • High level description of a circuit -> machine instructions to fabricate circuit
The analysis Stage • Broken up into four phases • Lexical Analysis (also called scanning or tokenization) • Parsing • Semantic Analysis • Intermediate Code Generation
Lexing Example double d1; double d2; d2 = d1 * 2.0; double TOK_DOUBLE reserved word d1 TOK_ID variable name ; TOK_PUNCT has value of “;” double TOK_DOUBLE reserved word d2 TOK_ID variable name ; TOK_PUNCT has value of “;” d2 TOK_ID variable name = TOK_OPER has value of “=” d1 TOK_ID variable name * TOK_OPER has value of “*” 2.0 TOK_FLOAT_CONST has value of 2.0 ; TOK_PUNCT has value of “;” lexemes
Syntax and Semantics • Syntax - the form or structure of the expressions – whether an expression is well formed • Semantics – the meaning of an expression
Syntactic Structure • Syntax almost always expressed using some variant of a notation called a context-free grammar (CFG) or simply grammar • BNF • EBNF
A CFG has 4 parts • A set of tokens (lexemes), known as terminal symbols • A set of non-terminals • A set of rules (productions) where each production consists of a left-hand side (LHS) and a right-hand side (RHS) The LHS is a non-terminal and the RHS is a sequence of terminals and/or non-terminal symbols. • A special non-terminal symbol designated as the start symbol
An example of BNF syntax for real numbers <r> ::= <ds> . <ds> <ds> ::= <d> | <d> <ds> <d> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7| 8 | 9 < > encloses non-terminal symbols ::= 'is' or 'is made up of ' or 'derives' (sometimes denoted with an arrow ->) | or
Example • On the example from the previous slide: • What are the tokens? • What are the lexemes? • What are the non terminals? • What are the productions?
Token vs. lexeme • to·ken One that represents a group, as an employee whose presence is used to deflect from the employer criticism or accusations of discrimination. • to·ken A basic, grammatically indivisible unit of a language such as a keyword, operator or identifier. • lexeme A minimal unit (as a word or stem) in the lexicon of a language; `go' and `went' and `gone' and `going' are all members of the English lexeme `go' • lexeme A minimal lexical unit of a language. Lexical analysis converts strings in a language into a list of lexemes. For a programming language these word-like pieces would include keywords, identifiers, literals and punctuations. The lexemes are then passed to the parser for syntactic analysis.
BNF Points • A non terminal can have more than RHS or an OR can be used • Lists or sequences are expressed via recursion • A derivation is just a repeated set of production (rule) applications • Examples
Example Grammar <program> -> <stmts> <stmts> -> <stmt> | <stmt> ; <stmts> <stmt> -> <var> = <expr> <var> -> a | b | c | d <expr> -> <term> + <term> | <term> - <term> <term> -> <var> | const
Example Derivation <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
Parse Trees • Alternative representation for a derivation • Example parse tree for the previous example stmts stmt expr var = term term + a var const b
Another Example Expression -> Expression + Expression | Expression - Expression | ... Variable | Constant | ... Variable -> T_IDENTIFIER Constant -> T_INTCONSTANT | T_DOUBLECONSTANT
The Parse a + 2 Expression -> Expression + Expression -> Variable + Expression -> T_IDENTIFIER + Expression -> T_IDENTIFIER + Constant -> T_IDENTIFIER + T_INTCONSTANT
Parse Trees PS -> P | P PS P -> e | '(' PS ')' | '<' PS '>' | '[' PS ']' What’s the parse tree for this statement ? < [ ] [ < > ] >
EBNF - Extended BNF • Like BNF except that • Non-terminals start w/ uppercase • Parens are used for grouping terminals • Braces {} represent zero or more occurrences (iteration ) • Brackets [] represent an optional construct , that is a construct that appears either once or not at all.
EBNF example Exp -> Term { ('+' | '-') Term } Term -> Factor { ('*' | '/') Factor } Factor -> '(' Exp ')' | variable | constant
EBNF/BNF • EBNF and BNF are equivalent • How can {} be expressed in BNF? • How can ( ) be expressed? • How can [ ] be expressed?
Semantic Analysis • The syntactically correct parse tree (or derivation) is checked for semantic errors • Check for constructs that while valid syntax do not obey the semantic rules of the source language. • Examples: • Use of an undeclared/un-initialized variable • Function called with improper arguments • Incompatible operands and type mismatches,
Examples void fun1(int i); double d; d = fun1(2.1); int i; int j; i = i + 2; int arr[2], c; c = arr * 10; Most semantic analysis pertains to the checking of types.
Intermediate Code Generation • Where the intermediate representation of the source program is created. • The representation can have a variety of forms, but a common one is called three-address code (TAC) • Like assembly – the TAC is a sequence of simple instructions, each of which can have at most three operands.
Example _t1 = b * c _t2 = b * d _t3 = _t1 + _t2 a = _t3 a = b * c + b * d Note: temps
Another Example _t1 = a > b if _t1 goto L0 _t2 = a - c a = _t2 L0: t3 = b * c c = _t3 if (a <= b) a = a - c; c = b * c; Note Temps Symbolic addresses
Next Time • Finish introduction to compilation stages • Read Appel Chapter 1, and 2 if you have not already done so. • What is a splay tree?
Selected References • Appel, A., Modern Compiler Implementation In Java (2nd Ed), Cambridge University Press, 2002. ISBN 052182060X. • Aho, A.V., R. Sethi, and J.D. Ullman, Compilers Principles, Techniques and Tools, Addison-Wesley, 1988. ISBN 0-201-10088-6. • Muchnick, S., Advanced Compiler Design and Implementation, Morgan Kaufmann, 1998. ISBN 1-55860-320-4.