1 / 82

Theory of Compilation - Introduction, Lexical Analysis

Lecture 1 of the Theory of Compilation course by Erez Petrank. Learn about compilers, how they work, and the tools and techniques used. Important for understanding programming languages and building parsers.

pmercado
Download Presentation

Theory of Compilation - Introduction, Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Theory of Compilation 236360ErezPetrank Lecture 1: Introduction, Lexical Analysis

  2. Theory of Compilation • Lecturer: Erez Petrank • erez@cs.technion.ac.il • Reception hour: Thursday 14:30—15:30, Taub 528. • Teaching assistants: • AdiSosnovich (responsible TA) • Maya Arbel • Lior Friedman

  3. Administration • Site: http://webcourse.cs.technion.ac.il/236360 • Grade: • 25% homework (important!) • 5% dry (MAGEN) • 20% wet: compulsory • 75% Test • Failure in test means failure in course, independent of homework grade. • Prerequisite: Automata and formal languages 236353 • MOED GIMMEL for Miluim only • העתקות ...

  4. Books • Main book • A.V. Aho, M. S. Lam, R. Sethi, and J.D. Ullman – “Compilers – Principles, Techniques, and Tools”, Addison-Wesley, 2007. • Additional book: • Dick Grune, Kees van Reeuwijk, Henri E. Bal, Ceriel J.H. Jacobs, and Koen G. Langendoen. “Modern Compiler Design”, Second Edition, Wiley 2010.

  5. Turing Award 2008: Barbara Liskov, programming languages and system design. 2009: Charles P Thacker, architecture. 2010: Leslie Valiant, theory of computation. 2011: Judea Pearl, artificial intelligence. 2012: Goldwasser & Micali, Cryptography

  6. Goals • Understand what a compiler is, • How a compiler works, • Tools and techniques that can be used in other settings.

  7. Complexity

  8. What is a Compiler? • “A compiler is a computer program that transforms source code written in a programming language (source language) into another language (target language). The most common reason for wanting to transform source code is to create an executable program.” --Wikipedia

  9. “The Education of a Computer” 1952 Grace Hopper 1906 -- 1992 Image source: http://en.wikipedia.org/wiki/File:Museum_of_Science,_Boston,_MA_-_IMG_3163.JPG

  10. “The Education of a Computer” (Hopper 1952)

  11. 1957 John Backus and team at IBM The first complete compiler

  12. Source text txt Executable code exe What is a Compiler? source language target language C C++ Pascal Java Postscript TeX Perl JavaScript Python Ruby Prolog Lisp Scheme ML OCaml IA32 IA64 SPARC C C++ Pascal Java Java Bytecode … Compiler

  13. Source text txt Executable code exe What is a Compiler? Compiler • The source and target program are semantically equivalent. • Since the translation is difficult, it is partitioned into standard modular steps. MOV R1,2 SAL R1 INC R1 MOV R2,R1 int a, b; a = 2; b = a*2 + 1;

  14. Source text txt Executable code exe Anatomy of a Compiler(Coarse Grained) Compiler Frontend (analysis) Intermediate Representation Backend (synthesis) MOV R1,2 SAL R1 INC R1 MOV R2,R1 int a, b; a = 2; b = a*2 + 1; Optimization

  15. SourceLanguage 1 SourceLanguage n SourceLanguage 2 txt txt txt Executabletarget m Executabletarget 2 Executabletarget 1 exe exe exe SET R1,2 STORE #0,R1 SHIFT R1,1 STORE #1,R1 ADD R1,1 STORE #2,R1 Modularity Build m+n modules instead of mn compilers… Frontend SL1 Backend TL1 Frontend SL2 IntermediateRepresentation Backend TL2 Frontend SL3 Backend TL3 MOV R1,2 SAL R1 INC R1 MOV R2,R1 int a, b; a = 2; b = a*2 + 1;

  16. Source text txt Executable code exe Anatomy of a Compiler Compiler Frontend (analysis) Intermediate Representation Backend (synthesis) LexicalAnalysis Syntax Analysis Parsing Semantic Analysis IntermediateRepresentation (IR) Code Generation MOV R1,2 SAL R1 INC R1 MOV R2,R1 int a, b; a = 2; b = a*2 + 1;

  17. Source text txt Interpreter Interpreter Frontend (analysis) Intermediate Representation Execution Engine Output Input int a, b; a = 2; b = a*2 + 1;

  18. Source text Source text txt txt Executable code exe Compiler vs. Interpreter Frontend (analysis) Intermediate Representation Backend (synthesis) Frontend (analysis) Intermediate Representation Execution Engine Output Input

  19. Compiler vs. Interpreter 3 MOV R1,8(ebp) SAL R1 INC R1 MOV R2,R1 Frontend (analysis) Intermediate Representation Backend (synthesis) b = a*2 + 1; 7 Frontend (analysis) Intermediate Representation Execution Engine 7 b = a*2 + 1; 3

  20. Java Source Java Bytecode txt txt Just-in-time Compiler (e.g., Java) Machine dependent and optimized. Java source to Java bytecode compiler Java Virtual Machine Output Input Just-in-time compilation: bytecode interpreter (in the JVM) compiles program fragments during interpretation to avoid expensive re-interpretation.

  21. Importance • Many in this class will build a parser some day • Or wish they knew how to build one… • Useful techniques and algorithms • Lexical analysis / parsing • Intermediate representation • … • Register allocation • Understand programming languages better • Understand internals of compilers • Understanding of compilation versus runtime, • Understanding of how the compiler treats the program (how to improve the efficiency, how to use error messages), • Understand (some) details of target architectures

  22. Complexity:Various areas, theory and practice. Compiler • Useful formalisms • Regular expressions • Context-free grammars • Attribute grammars • Data structures • Algorithms Source Target Operating systems Runtime environment Garbage collection Architecture Programming Languages Software Engineering

  23. Source text txt Executable code exe Course Overview Compiler LexicalAnalysis Syntax Analysis Parsing SemanticAnalysis Inter.Rep. (IR) Code Gen.

  24. Front End Ingredients Character Stream Easy, Regular Expressions Lexical Analyzer Token Stream More complex, recursive, context-free grammar Syntax Analyzer Syntax Tree More complex, recursive, requires Walking up and down in the Derivation tree. Semantic Analyzer Decorated Syntax Tree Get code from the tree. Some optimizations are easier on a tree, and some easier on the code. Intermediate Code Generator

  25. x = b*b – 4*a*c txt Lexical Analysis TokenStream <ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”> LexicalAnalysis Syntax Analysis Sem.Analysis Inter.Rep. Code Gen.

  26. Syntax Analysis (Parsing) <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”> expression SyntaxTree expression term MINUS expression MULT factor term term ID term MULT factor term MULT factor ‘c’ factor ID ID factor ID ‘b’ ID ‘a’ ‘b’ ‘4’ LexicalAnalysis Syntax Analysis Sem.Analysis Inter.Rep. Code Gen.

  27. Simplified Tree AbstractSyntaxTree MINUS MULT MULT ‘b’ ‘b’ MULT ‘c’ ‘4’ ‘a’ LexicalAnalysis Syntax Analysis Sem.Analysis Inter.Rep. Code Gen.

  28. Semantic Analysis type: intloc: R1 AnnotatedAbstractSyntaxTree MINUS type: intloc: R1 type: intloc: R2 MULT MULT type: intloc: sp+24 type: intloc: sp+16 type: intloc: sp+16 type: intloc: R2 ‘b’ ‘b’ MULT ‘c’ type: intloc: const type: intloc: sp+8 ‘4’ ‘a’ LexicalAnalysis Syntax Analysis Sem.Analysis Inter.Rep. Code Gen.

  29. Intermediate Representation type: intloc: R1 IntermediateRepresentation MINUS R2 = 4*a R1=b*b R2= R2*c R1=R1-R2 type: intloc: R1 type: intloc: R2 MULT MULT type: intloc: sp+24 type: intloc: sp+16 type: intloc: sp+16 type: intloc: R2 ‘b’ ‘b’ MULT ‘c’ type: intloc: const type: intloc: sp+8 ‘4’ ‘a’ LexicalAnalysis Syntax Analysis Sem.Analysis Inter.Rep. Code Gen.

  30. Generating Code type: intloc: R1 IntermediateRepresentation MINUS R2 = 4*a R1=b*b R2= R2*c R1=R1-R2 type: intloc: R1 type: intloc: R2 MULT MULT AssemblyCode type: intloc: sp+24 type: intloc: sp+16 type: intloc: sp+16 type: intloc: R2 MOV R2,(sp+8) SAL R2,2 MOV R1,(sp+16) MUL R1,(sp+16) MUL R2,(sp+24) SUB R1,R2 ‘b’ ‘b’ MULT ‘c’ type: intloc: const type: intloc: sp+8 ‘4’ ‘a’ LexicalAnalysis Syntax Analysis Sem.Analysis Inter.Rep. Code Gen.

  31. The Symbol Table • A data structure that holds attributes for each identifier, and provides fast access to them. • Example: location in memory, type, scope. • The table is built during the compilation process. • E.g., the lexical analysis cannot tell what the type is, the location in memory is discovered only during code generation, etc.

  32. Error Checking • An important part of the process. • Done at each stage. • Lexical analysis: illegal tokens • Syntax analysis: illegal syntax • Semantic analysis: incompatible types, undefined variables, … • Each phase tries to recover and proceed with compilation (why?)

  33. pi = 3.141.562 pi = 3oranges pi = oranges3 txt txt txt Errors in lexical analysis Illegal token Illegal token <ID,”pi”>, <EQ>, <ID,”oranges3”>

  34. x = 4*a*”oranges” txt Error detection: type checking type: intloc: R2 MULT type: stringloc: const type: intloc: R2 MULT “oranges” type: intloc: const type: intloc: sp+8 ‘4’ ‘a’

  35. Source text txt Executable code exe The Real Anatomy of a Compiler Process text input LexicalAnalysis SyntaxAnalysis Sem.Analysis tokens characters AST Annotated AST Codegeneration Intermediate code optimization Intermediate code generation IR IR Symbolic Instructions Target code optimization Machine code generation Write executable output MI

  36. Optimizations • “Optimal code” is out of reach • many problems are undecidable or too expensive (NP-complete) • Use approximation and/or heuristics • Must preserve correctness, should (mostly) improve code, should run fast. • Improvements in time, space, energy, etc. • This part takes most of the compilation time. • A major question: how much time should be invested in optimization to make the code run faster. • Answer changes for the JIT setting.

  37. Optimization Examples • Loop optimizations: invariants, unrolling, … • Peephole optimizations • Constant propagation • Leverage compile-time information to save work at runtime (pre-computation) • Dead code elimination • space • …

  38. Modern Optimization Challenges • Main challenge is to exploit modern platforms • Multicores • Vector instructions • Memory hierarchy • Is the code sent on the net (Java byte-code)?

  39. Machine code generation • A major goal: determine location of variables. • Register allocation • Optimal register assignment is NP-Complete • In practice, known heuristics perform well • assign variables to memory locations • Instruction selection • Convert IR to actual machine instructions • Modern architectures • Multicores • Challenging memory hierarchies

  40. Compiler Construction Toolset • Lexical analysis generators • lex • Parser generators • yacc • Syntax-directed translators • Dataflow analysis engines

  41. Summary • A compiler is a program that translates code from source language to target language • Compilers play a critical role • Bridge from programming languages to the machine • Many useful techniques and algorithms • Many useful tools (e.g., lexer/parser generators) • Compiler are constructed from modular phases • Reusable • Debug-able, understandable. • Different front/back ends

  42. Theory of Compilation Lexical Analysis

  43. Source text txt Executable code exe You are here Compiler LexicalAnalysis Syntax Analysis Parsing SemanticAnalysis Inter.Rep. (IR) Code Gen.

  44. From characters to tokens • What is a token? • Roughly – a “word” in the source language • Identifiers • Values • Language keywords • (Really - anything that should appear in the input to syntax analysis) • Technically • A token is a pair of (kind,value)

  45. Example: kinds of tokens

  46. Strings with special handling

  47. The Terminology • Lexeme(aka symbol): a series of letters separated from the rest of the program according to a convention (space, semi-column, comma, etc.) • Pattern: a rule specifying a set of strings.Example: “an identifier is a string that starts with a letter and continues with letters and digits”. • Token: a pair of (pattern, attributes)

  48. x = b*b – 4*a*c txt From characters to tokens TokenStream <ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

  49. pi = 3.141.562 pi = 3oranges pi = oranges3 txt txt txt Errors in lexical analysis Illegal token Illegal token <ID,”pi”>, <EQ>, <ID,”oranges3”>

  50. Error Handling • Many errors cannot be identified at this stage. • For example: “fi (a==f(x))”. Should “fi” be “if”? Or is it a routine name? • We will discover this later in the analysis. • At this point, we just create an identifier token. • But sometimes the lexeme does not satisfy any pattern. • Easiest: eliminate letters until the beginning of a legitimate lexeme. • Alternatives: eliminate one letter, add one letter, replace one letter, replace order of two adjacent letteres, etc. • Goal: allow the compilation to continue. • Problem: errors that spread all over.

More Related