1 / 39

Introduction to Compilers: Overview, Structure, and Grammars

This article provides an introduction to compilers, covering their motivation, structure, and the role of grammars. It also explores syntax trees, Chomsky's classification of grammars, and the history of compiler construction. The article discusses the Z# language and its lexical analysis, syntax analysis, and semantic analysis. It also explains the dynamic and static structures of a compiler, as well as the phases involved in single-pass and multi-pass compilers. Additionally, it covers the difference between compilers and interpreters, and the advantages and disadvantages of each. The article concludes with a discussion on the static structure of a compiler and the components of a grammar.

leroya
Download Presentation

Introduction to Compilers: Overview, Structure, and Grammars

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars 1.6 The Z# Language

  2. Short History of Compiler Construction Formerly "a mystery", today one of the best-known areas of computing 1957Fortranfirst compilers (arithmetic expressions, statements, procedures) 1960Algolfirst formal language definition (grammars in Backus-Naur form, block structure, recursion, ...) 1970Pascaluser-defined types, virtual machines (P-code) 1985C++object-orientation, exceptions, templates 1995Javajust-in-time compilation We only look at imperative languages Functional languages (e.g. Lisp) and logical languages (e.g. Prolog) require different techniques.

  3. Why should I learn about compilers? It's part of the general background of a software engineer • How do compilers work? • How do computers work?(instruction set, registers, addressing modes, run-time data structures, ...) • What machine code is generated for certain language constructs?(efficiency considerations) • What is good language design? • Opportunity for a non-trivial programming project • Also useful for general software development • Reading syntactically structured command-line arguments • Reading structured data (e.g. XML files, part lists, image files, ...) • Searching in hierarchical namespaces • Interpretation of command codes • ...

  4. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars 1.6 The Z# Language

  5. lexical analysis (scanning) token stream token number 1 (ident) "val" 3 (assign) - 2 (number) 10 4 (times) - 1 (ident) "val" 5 (plus) - 1 (ident) "i" token value syntax analysis (parsing) Statement syntax tree Expression Term ident = number * ident + ident Dynamic Structure of a Compiler character stream v a l = 1 0 * v a l + i

  6. optimization code generation ld.i4.s 10 ldloc.1 mul ... machine code Dynamic Structure of a Compiler Statement syntax tree Expression Term ident = number * ident + ident semantic analysis (type checking, ...) intermediate representation syntax tree, symbol table, ...

  7. Single-Pass Compilers Phases work in an interleaved way scan token parse token check token generate code for token n eof? y The target program is already generated while the source program is read.

  8. Multi-Pass Compilers Phases are separate "programs", which run sequentially sem. analysis scanner parser ... characters tokens tree code Each phase reads from a file and writes to a new file Why multi-pass? • if memory is scarce (irrelevant today) • if the language is complex • if portability is important

  9. language-dependent machine-dependent Java C Pascal Pentium PowerPC SPARC any combination possible Today: Often Two-Pass Compilers Front End Back End scanning parsing sem. analysis code generation intermediate representation • Advantages • better portability • many combinations between front endsand back ends possible • optimizations are easier on the intermediaterepresentation than on source code • Disadvantages • slower • needs more memory

  10. Interpreter executes source code "directly" • statements in a loop arescanned and parsedagain and again scanner parser source code interpretation Variant: interpretation of intermediate code • source code is translated into thecode of a virtual machine (VM) • VM interprets the codesimulating the physical machine ... compiler ... VM source code intermediate code (e.g. Common IntermediateLanguage (CIL)) Compiler versus Interpreter Compiler translates to machine code scanner parser ... code generator loader source code machine code

  11. data flow Static Structure of a Compiler "main program" directs the whole compilation parser & sem. analysis scanner code generation provides tokens from the source code generates machine code symbol table maintains information about declared names and types uses

  12. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars 1.6 The Z# Language

  13. Four components terminal symbols are atomic "if", ">=", ident, number, ... nonterminal symbols are derived into smaller units Statement, Expr, Type, ... productions rules how to decom- pose nonterminals Statement = Designator "=" Expr ";". Designator = ident ["." ident]. ... start symbol topmost nonterminal CSharp What is a grammar? Example Statement = "if" "(" Condition ")" Statement ["else" Statement].

  14. | (...) [...] {...} separates alternatives groups alternatives optional part repetitive part a | b | c  a or b or c a ( b | c )  ab | ac [ a ] b  ab | b { a } b  b | ab | aab | aaab | ... EBNF Notation Extended Backus-Naur form John Backus: developed the first Fortran compiler Peter Naur: edited the Algol60 report symbol meaning examples string name = . denotes itself denotes a T or NT symbol separates the sides of a production terminates a production "=", "while" ident, Statement A = b c d . • Conventions • terminal symbols start with lower-case letters (e.g. ident) • nonterminal symbols start with upper-case letters (e.g. Statement)

  15. Terminal symbols simple TS: terminal classes: "+", "-", "*", "/", "(", ")" (just 1 instance) ident, number (multiple instances) Nonterminal symbols Expr, Term, Factor Start symbol Expr Example: Grammar for Arithmetic Expressions Productions Expr = [ "+" | "-" ] Term { ( "+" | "-" ) Term }. Term = Factor { ( "*" | "/" ) Factor }. Factor = ident | number | "(" Expr ")". Expr Term Factor

  16. - ident * number + ident / number - ident  - Factor * Factor + Factor / Factor - Factor  "*" and "/" have higher priority than "+" and "-" - Term + Term - Term "-" does not refer to a, but to a*3  Expr Operator Priority Grammars can be used to define the priority of operators Expr = [ "+" | "-" ] Term { ( "+" | "-" ) Term }. Term = Factor { ( "*" | "/" ) Factor }. Factor = ident | number | "(" Expr ")". input:- a * 3 + b / 4 - c How must the grammar be transformed, so that "-" refers to a? Expr = Term { ( "+" | "-" ) Term }. Term = Factor { ( "*" | "/" ) Factor }. Factor = [ "+" | "-" ] ( ident | number | "(" Expr ")" ).

  17. Terminal Start Symbols of Nonterminals Which terminal symbols can a nonterminal start with? Expr = ["+" | "-"] Term {("+" | "-") Term}. Term = Factor {("*" | "/") Factor}. Factor = ident | number | "(" Expr ")". First(Factor) = ident, number, "(" First(Term) = First(Factor) = ident, number, "(" First(Expr) = "+", "-", First(Term) = "+", "-", ident, number, "("

  18. Terminal Successors of Nonterminals Which terminal symbols can follow after a nonterminal in the grammar? Expr = [ "+" | "-" ] Term { ( "+" | "-" ) Term }. Term = Factor { ( "*" | "/" ) Factor }. Factor = ident | number | "(" Expr ")". Where does Expr occur on the right-hand side of a production? What terminal symbols can follow there? Follow(Expr) = ")", eof Follow(Term) = "+", "-", Follow(Expr) = "+", "-", ")", eof Follow(Factor) = "*", "/", Follow(Term) = "*", "/", "+", "-", ")", eof

  19. String A finite sequence of symbols from an alphabet. Empty String The string that contains no symbol (denoted by e). Some Terminology Alphabet The set of terminal and nonterminal symbols of a grammar Strings are denoted by greek letters (a, b, g, ...) e.g: a = ident + number b = - Term + Factor * number

  20. a*b (indirect derivation) ag1g2 ... gnb aLb (left-canonical derivation) the leftmost NTS in a is derived first aRb (right-canonical derivation) the rightmost NTS in a is derived first Reduction The converse of a derivation: If the right-hand side of a production occurs in b it is replaced with the corresponding NTS Derivations and Reductions Derivation a b ab (direct derivation)  Term + Factor * Factor Term + ident * Factor NTS right-hand side of a production of NTS

  21. Deletability A string a is called deletable, if it can be derived to the empty string. a* e Example A = B C. B = [ b ]. C = c | d | . B is deletable: B e C is deletable: C e A is deletable: A  B C  C e

  22. Sentential form Any string that can be derived from the start symbol of the grammar. e.g.: Expr Term + Term + Term Term + Factor * ident + Term ... Sentence A sentential form that consists of terminal symbols only. e.g.: ident * number + ident Language (formal language) The set of all sentences of a grammar (usually infinitely large). e.g.: the C# language is the set of all valid C# programs More Terminology Phrase Any string that can be derived from a nonterminal symbol. Term phrases: Factor Factor * Factor ident * Factor ...

  23. Direct recursion A w1 A w2 Left recursion A = b | A a. A  A a  A a a  A a a a  b a a a a a ... Right recursion A = b | a A. A  a A  a a A  a a a A  ... a a a a a b Central recursion A = b | "(" A ")". A  (A)  ((A))  (((A)))  (((... (b)...))) Indirect recursion A * w1 A w2 Example Expr  Term  Factor  "(" Expr ")" Expr = Term { "+" Term }. Term = Factor { "*" Factor }. Factor = id | "(" Expr ")". Recursion A production is recursive if A * w1 A w2 Can be used to represent repetitions and nested structures

  24. Left recursion can be transformed to iteration E = T | E "+" T. What sentences can be derived? T T + T T + T + T ... From this one can deduce the iterative EBNF rule: E = T { "+" T }. How to Remove Left Recursion Left recursion cannot be handled in topdown syntax analysis Both alternatives start with b. The parser cannot decide which one to choose A = b | A a.

  25. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars 1.6 The Z# Language

  26. BNF grammar for arithmetic expressions • Alternatives are transformed intoseparate productions • Repetition must be expressed by recursion <Expr> ::= <Sign> <Term> <Expr> ::= <Expr> <Addop> <Term> <Sign> ::= + <Sign> ::= - <Sign> ::= <Addop> ::= + <Addop> ::= - <Term> ::= <Factor> <Term> ::= <Term> <Mulop> <Factor> <Mulop> ::= * <Mulop> ::= / <Factor> ::= ident <Factor> ::= number <Factor> ::= ( <Expr> ) Plain BNF Notation terminal symbols are written without quotes (ident, +, -) nonterminal symbols are written in angle brackets (<Expr>, <Term>) sides of a production are separated by ::= • Advantages • fewer meta symbols (no |, (), [], {}) • it is easier to build a syntax tree • Disadvantages • more clumsy

  27. Abstract Syntax Tree(leaves = operands, inner nodes = operators) + often used as an internal program representation; used for optimizations * ident number ident Syntax Tree Shows the structure of a particular sentence e.g. for 10 + 3 * i Concrete Syntax Tree(parse tree) Would not be possible with EBNF because of [...] and {...}, e.g.: Expr = [ Sign ] Term { Addop Term }. Expr Expr Addop Term Sign Term Term Mulop Factor Also reflects operator priorities:operators further down in the tree have a higher priority than operators further up in the tree. Factor Factor number + number * ident e

  28. Two syntax trees can be built for this sentence: T T T T T T T T T T F F F F F F id * id * id id * id * id Ambiguous grammars cause problems in syntax analysis! Ambiguity A grammar is ambiguous, if more than one syntax tree can be built for a given sentence. Example sentence:id * id * id T = F | T "*" T. F = id.

  29. Even better: transformation to EBNF T = F { "*" F }. F = id. Removing Ambiguity Example T = F | T "*" T. F = id. Only the grammar is ambiguous, not the language. The grammar can be transformed to i.e. T has priority over F T T = F | T "*" F. F = id. T only this syntax tree is possible T T T F F F id * id * id

  30. Statement There is no unambiguous grammar for this language! Statement Condition Condition Statement Statement C# solutionAlways recognize the longest possible right-hand side of a production  leads to the lower of the two syntax trees Condition Condition Statement Statement Statement Statement Inherent Ambiguity There are languages which are inherently ambiguous Example: Dangling Else Statement = Assignment | "if" Condition Statement | "if" Condition Statement "else" Statement | ... . if (a < b) if (b < c) x = c; else x = b;

  31. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars 1.6 The Z# Language

  32. class 0 Unrestricted grammars (a and b arbitrary) e.g: A = a A b | B c B. aBc = d. dB = bb. class 1 Contex-sensitive grammars (|a|  |b|) e.g: a A = a b c. Recognized by linear bounded automata class 2 Context-free grammars (a = NT, be) e.g: A = a b c. Recognized by push-down automata Only these two classes are relevant in compiler construction class 3 Regular grammars (a = NT, b= T | T NT) e.g: A = b | b B. Recognized by finite automata Classification of Grammars Due to Noam Chomsky (1956) Grammars are sets of productions of the form a = b. A  aAb  aBcBb  dBb  bbb Recognized by Turing machines

  33. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars 1.6 The Z# Language

  34. Sample Z# Program main program class; no separate compilation class P const int size = 10; class Table { int[] pos; int[] neg; } Table val; { void Main () int x, i; { /*---------- initialize val ----------*/ val = new Table; val.pos = new int[size]; val.neg = new int[size]; i = 0; while (i < size) { val.pos[i] = 0; val.neg[i] = 0; i++; } /*---------- read values ----------*/ read(x); while (-size < x && x < size) { if (0 <= x) val.pos[x]++; else val.neg[-x]++; read(x); } } } inner classes (without methods) global variables local variables

  35. Keywords class if else while read write return break void const new Operators + - * / % ++ -- == != > >= < <= && || ( ) [ ] { } = ; ,. Comments may be nested /* ... */ Types intchar arrays classes Lexical Structure of Z# Names ident = letter { letter | digit | '_' }. Numbers all numbers are of type int number = digit { digit }. Char constants all character constants are of type char (may contain \r and \n) charConst = '\'' char '\''. no strings

  36. Declarations ConstDecl = "const" Type ident "=" ( number | charConst ) ";". VarDecl = Type ident { "," ident } ";". MethodDecl = (Type | "void") ident "(" [ FormPars ] ")" Block. Type = ident [ "[" "]" ]. FormPars = Type ident { "," Type ident }. only one-dimensional arrays Syntactical Structure of Z# (1) Programs Program = "class" ident { ConstDecl | VarDecl | ClassDecl } "{" { MethodDecl } "}". class P ... declarations ... { ... methods ... }

  37. Syntactical Structure of Z# (2) Statements Block = "{" {Statement} "}". Statement = Designator ( "=" Expr ";" | "(" [ActPars] ")" ";" | "++" ";" | "--" ";" ) | "if" "(" Condition ")" Statement [ "else" Statement ] | "while" "(" Condition ")" Statement | "break" ";" | "return" [ Expr ] ";" | "read" "(" Designator ")" ";" | "write" "(" Expr [ "," number ] ")" ";" | Block | ";". ActPars = Expr { "," Expr }. • input from System.Console • output to System.Console

  38. Syntactical Structure of Z# (3) Expressions Condition = CondTerm { "||" CondTerm }. CondTerm = CondFact { "&&" CondFact }. CondFact = Expr Relop Expr. Relop = "==" | "!=" | ">" | ">=" | "<" | "<=". Expr = [ "-" ] Term { Addop Term }. Term = Factor { Mulop Factor }. Factor = Designator [ "(" [ ActPars ] ")" ] | number | charConst | "new" ident [ "[" Expr "]" ] | "(" Expr ")". Designator = ident { "." ident | "[" Expr "]" }. Addop = "+" | "-". Mulop = "*" | "/" | "%". no constructors

  39. Project structure, Compilation, Usage Project structure NUnit Test Cases Command Line Compiler zsctest.dll zsclib.dll zsc.exe TokenTest.cs ScannerTest.cs ParserTest.cs SymTabTest.cs TestTextWriter.cs Token.cs Scanner.cs Parser.cs SymTab.cs CodeGen.cs ErrorStrings.cs Main.cs csc /target:library /out:zsclib.dll Token.cs Scanner.cs Parser.cs SymTab.cs CodeGen.cs ErrorStrings.cs csc /target:library /out:zsctest.dll /reference:zsclib.dll,"%NUnitHOME%\bin\nunit.framework.dll" TokenTest.cs ScannerTest.cs ParserTest.cs SymTabTest.cs TestTextWriter.cs csc /target:exe /out:zsc.exe /reference:zsclib.dll Main.cs Compilation: zsc myProg.zs Compiler myProg.exe myProg.zs Execution: CLR myProg.exe myProg.exe Disassembling: myProg.exe ildasm myProg.exe

More Related