1 / 41

컴파일러

컴파일러. 첫째주 2005/09/01. 권혁철. Making Languages Usable.

frye
Download Presentation

컴파일러

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 컴파일러 첫째주 2005/09/01 권혁철

  2. Making Languages Usable It was our belief that if FORTRAN, during its first months, were to translate any reasonable “scientific” source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger... I believe that had we failed to produce efficient programs, the widespread use of languages like FORTRAN would have been seriously delayed. — John Backus 18 person-years to complete!!!

  3. Compiler construction • Compiler writing is perhaps the most pervasive topic in computer science, involving many fields: • Programming languages • Architecture • Theory of computation • Algorithms • Software engineering • In this course, you will put everything you have learned together. Exciting, right??

  4. 과제 • It might be the biggest program you’ve ever written. • It cannot be done the day it’s due! • Syllabus에 있는 대로 따라 올 것 • 조교가 사전에 직접 프로그램한 후 중요 모듈을 제공하게 하겠음 • 또한 조교가 만든 프로그램을 수행해 볼 수 있게 하겠음 • 되도록 모든 사람이 컴파일러를 완성하게 할 것임 • 프로그램에 자신이 없는 사람은 조교와 직접 만나서 도움을 받을 것

  5. 간단한 질문 • Consider the grammar shown below(<S> is your start symbol). Circle which of the strings shown on the below are in the language described by the grammar? There may be zero or more correct answers. • Grammar: • <S> ::= <A> a <B> b • <A> ::= b <A> | b • <B> ::= <A> a | a • Strings: A) baab B) bbbabb C) bbaaaa D) baaabb E) bbbabab • Compose the grammar for the language consisting of sentences of an equal number of a’s followed by an equal number of b’s. For example, aaabbb is in the language, aabbb is not, the empty string is not in the language.

  6. What is a compiler? Compiler TargetProgram SourceProgram ErrorMessage • The source language might be • General purpose, e.g. C or Pascal • A “little language” for a specific domain, e.g. SIML • The target language might be • Some other programming language • The machine language of a specific machine

  7. 관련 용어 What is an interpreter? A program that reads an executable program and produces the results of executing that program Target Machine: machine on which compiled program is to be run Cross-Compiler: compiler that runs on a different type of machine than is its target Compiler-Compiler: a tool to simplify the construction of compilers (YACC/JCUP)

  8. Is it hard?? • In the 1950s, compiler writing took an enormous amount of effort. • The first FORTRAN compiler took 18 person-years • Today, though, we have very good software tools • You will write your own compiler in a team of 3 in one semester!

  9. Intrinsic interest Compiler construction involves ideas from many different parts of computer science Greedy algorithms Heuristic search techniques Artificial intelligence Graph algorithms, union-find Dynamic programming Algorithms DFAs & PDAs, pattern matching Fixed-point algorithms Theory Allocation & naming, Synchronization, locality Systems Pipeline & hierarchy management Instruction set use Architecture

  10. Intrinsic merit Compiler construction poses challenging and interesting problems: Compilers must do a lot but also run fast Compilers have primary responsibility for run-timeperformance Compilers are responsible for making it acceptable to use the full power of the programming language Computer architects perpetually create new challenges for the compiler by building more complex machines Compilers must hide that complexity from the programmer Success requires mastery of complex interactions

  11. Implications Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code) Must agree with OS & linker on format for object code High-level View of a Compiler Source code Machine code Compiler Errors

  12. Two Pass Compiler • We break compilation into two phases: • ANALYSIS breaks the program into pieces and creates an intermediate representation of the source program. • SYNTHESIS constructs the target program from the intermediate representation. • Sometimes we call the analysis part the FRONT END and the synthesis part the BACK END of the compiler. They can be written independently.

  13. Traditional Two-pass Compiler Implications Use an intermediate representation (IR) Front end maps legal source code into IR Back end maps IR into target machine code Admits multiple front ends & multiple passes (better code) Typically, front end is O(n) or O(n log n), while back end is NP-Complete Back End Machine code Source code IR Front End Errors

  14. Can we build n x m compilers with n+m components? Must encode all language specific knowledge in each front end Must encode all features in a single IR Must encode all target specific knowledge in each back end Limited success in systems with very low-level IRs A Common Fallacy Front end Front end Front end Back end Back end Back end Fortran SPARC Front end Scheme i86 Java Power PC Smalltalk

  15. Source code analysis • Analysis is important for many applications besides compilers: • STRUCTURE EDITORS try to fill out syntax units as you type • PRETTY PRINTERS highlight comments, indent your code for you, and so on • STATIC CHECKERS try to find programming bugs without actually running the program • INTERPRETERS don’t bother to produce target code, but just perform the requested operations (e.g. Matlab)

  16. Source code analysis • Analysis comes in three phases: • LINEAR ANALYSIS processes characters left-to-right and groups them into TOKENS • HIERARCHICAL ANALYSIS groups tokens hierarchically into nested collections of tokens • SEMANTIC ANALYSIS makes sure the program components fit together, e.g. variables should be declared before they are used

  17. Linear (lexical) analysis The linear analysis stage is called LEXICAL ANALYSIS or SCANNING. Example: position = initial + rate * 60 gets translated as: • he IDENTIFIER “position” • The ASSIGNMENT SYMBOL “=” • The IDENTIFIER “initial” • The PLUS OPERATOR “+” • The IDENTIFIER “rate” • The MULTIPLICATION OPERATOR “*” • The NUMERIC LITERAL 60

  18. Hierarchical (syntax) analysis • The hierarchical stage is called SYNTAX ANALYSIS or PARSING. • The hierarchical structure of the source program can be represented by a PARSE TREE, for example:

  19. assignment statement identifier = expression position expression + expression identifier expression * expression initial identifier identifier rate 60

  20. Syntax analysis • The hierarchical structure of the syntactic units in a programming language is normally represented by a set of recursive rules. Example for expressions: • Any identifier is an expression • Any number is an expression • If expression1 and expression2 are expressions, so are expression1 + expression2 expression1 * expression2 ( expression1 )

  21. Syntax analysis • Example for statements: • If identifier1 is an identifier and expression2 is an expression, then identifier1 = expression2 is a statement. • If expression1 is an expression and statement2 is a statement, then the following are statements: while ( expression1 ) statement2 if ( expression1 ) statement2

  22. Lexical vs. syntactic analysis • Generally if a syntactic unit can be recognized in a linear scan, we convert it into a token during lexical analysis. • More complex syntactic units, especially recursive structures, are normally processed during syntactic analysis (parsing). • Identifiers, for example, can be recognized easily in a linear scan, so identifiers are tokenized during lexical analysis.

  23. Source code analysis • It is common to convert complex parse trees to simpler SYNTAX TREES, with a node for each operator and children for the operands of each operator. position = initial + rate * 60 = position + Analysis initial * rate 60

  24. Semantic analysis • The semantic analysis stage: • Checks for semantic errors, e.g. undeclared variables • Gathers type information • Determines the operators and operands of expressions Example: if rate is a float, the integer literal 60 should be converted to a float before multiplying. = position + initial * rate inttoreal 60

  25. source program The rest of the process lexicalanalyzer syntaxanalyzer semanticanalyzer symbol-tablemanager errorhandler intermediatecode generator codeoptimizer codegenerator target program

  26. Symbol-table management • During analysis, we record the identifiers used in the program. • The symbol table stores each identifier with its ATTRIBUTES. • Example attributes: • How much STORAGE is allocated for the id • The id’s TYPE • The id’s SCOPE • For functions, the PARAMETER PROTOCOL • Some attributes can be determined immediately; some are delayed.

  27. Error detection • Each compilation phase can have errors • Normally, we want to keep processing after an error, in order to find more errors. • Each stage has its own characteristic errors, e.g. • Lexical analysis: a string of characters that do not form a legal token • Syntax analysis: unmatched { } or missing ; • Semantic: trying to add a float and a pointer

  28. position = initial + rate * 60 lexical analyzer Internal Representations Each stage of processing transforms a representation of the source code program into a new representation. id1 = id2 + id3 * 60 syntax analyzer symbol table = id1 + * id2 id3 60 semantic analyzer = id1 + * id2 id3 inttoreal 60

  29. Intermediate code generation • Some compilers explicitly create an intermediate representation of the source code program after semantic analysis. • The representation is as a program for an abstract machine. • Most common representation is “three-address code” in which all memory locations are treated as registers, and most instructions apply an operator to two operand registers, and store the result to a destination register.

  30. Intermediate code generation = position + initial * rate inttoreal 60 semantic analyzer temp1 := inttoreal(60) temp2 := id3 * temp1 temp3 := id2+ temp2 id1 := temp3

  31. The Optimizer (or Middle End) Typical Transformations Discover & propagate some constant value Move a computation to a less frequently executed place Specialize some computation based on context Discover a redundant computation & remove it Remove useless or unreachable code Encode an idiom in some particularly efficient form IR IR IR ... IR IR Opt 1 Opt 3 Opt 2 Opt n Errors Modern optimizers are structured as a series of passes

  32. Code optimization • At this stage, we improve the code to make it run faster. temp1 := inttoreal(60) temp2 := id3 * temp1 temp3 := id2+ temp2 id1 := temp3 temp1 := id3 * 60.0 id1 := id2 + temp1 code optimizer

  33. Code generation • In the final stage, we take the three-address code (3AC) or other intermediate representation, and convert to the target language. • We must pick memory locations for variables and allocate registers. MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1 temp1 := id3 * 60.0 id1 := id2 + temp1 code generator

  34. The Back End Responsibilities Translate IR into target machine code Choose instructions to implement each IR operation Decide which value to keep in registers Ensure conformance with system interfaces Automation has been less successful in the back end IR IR IR Machine code Instruction Scheduling Instruction Selection Register Allocation Errors

  35. The Back End Instruction Selection Produce fast, compact code Take advantage of target features such as addressing modes Usually viewed as a pattern matching problem ad hoc methods, pattern matching, dynamic programming IR IR IR Instruction Selection Register Allocation Machine code Instruction Scheduling Errors

  36. The Back End Register Allocation Have each value in a register when it is used Manage a limited set of resources Can change instruction choices & insert LOADs & STOREs Optimal allocation is NP-Complete (1 or k registers) Compilers approximate solutions to NP-Complete problems IR IR IR Register Allocation Machine code Instruction Scheduling Instruction Selection Errors

  37. The Back End Instruction Scheduling Avoid hardware stalls and interlocks Use all functional units productively Can increase lifetime of variables (changing the allocation) Optimal scheduling is NP-Complete in nearly all cases Heuristic techniques are well developed IR IR IR Instruction Scheduling Machine code Instruction Selection Register Allocation Errors

  38. Cousins of the compiler • PREPROCESSORS take raw source code and produce the input actually read by the compiler • MACRO PROCESSING: macro calls need to be replaced by the correct text • Macros can be used to define a constant used in many places. E.g. #define BUFSIZE 100 in C • Also useful as shorthand for often-repeated expressions:#define DEG_TO_RADIANS(x) ((x)/180.0*M_PI)#define ARRAY(a,i,j,ncols) ((a)[(i)*(ncols)+(j)]) • FILE INCLUSION: included files (e.g. using #include in C) need to be expanded

  39. Cousins of the compiler • ASSEMBLERS take assembly code and covert to machine code. • Some compilers go directly to machine code; others produce assembly code then call a separate assembler. • Either way, the output machine code is usually RELOCATABLE, with memory addresses starting at location 0.

  40. Cousins of the compiler • LOADERS take relocatable machine code and alter the addresses, putting the instructions and data in a particular location in memory. • The LINK EDITOR (part of the loader) pieces together a complete program from several independently compiled parts.

  41. Compiler writing tools • We’ve come a long way since the 1950s. • SCANNER GENERATORS produce lexical analyzers automatically. • Input: a specification of the tokens of a language (usually written as regular expressions) • Output: C code to break the source language into tokens. • PARSER GENERATORS produce syntactic analyzers automatically. • Input: a specification of the language syntax (usually written as a context-free grammar) • Output: C code to build the syntax tree from the token sequence. • There are also automated systems for code synthesis.

More Related