650 likes | 792 Views
Chapter 1: Introduction to Compiling. Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155. steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818.
E N D
Chapter 1: Introduction to Compiling Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 Material for course thanks to: Laurent Michel Aggelos Kiayias Robert LeBarre
Introduction to Compilers Sourceprogram Target Program Compiler Error messages Diverse & Varied • As a Discipline, Involves Multiple CSE Areas • Programming Languages and Algorithms • Software Engineering & Theory / Foundations • Computer Architecture & Operating Systems • But, Has Surprisingly Simplistic Intent:
What is the Compilation Process?Consider the main.c C Program? #include <stdio.h> #include <ctype.h> #include <string.h> main() { char day[10]; int m, d, y; m = 9; d = 7; y = 2011; strcpy(day, "Wednesday"); printf("Today is: "); printf("%s %d/%d/%d\n", day, m, d, y); }
Viewing C from Compiler Theory • Tokens or Patterns for the “Words” of the Programming Language • Specified in Flex • Grammar Rules that Describe the Allowable Sentences of the Programming Language • Specified in Bison • Other Stages of Compilation • Semantic and Type Checking Analysis • Intermediate Code Generation • Code Optimization • Final (Relocatable) Code Generation
Viewing C from Usage Side • ANSI C Specification of Lexical Structure in Flex • ANSI C Specification of Grammar Rules in Bison • Multi-Stage Compilation Process from Source Code to Preprocessed Code to Assembly Code to Object Code to Executable • Preprocess: gcc –E main.c > main.e • Assembly: gcc –S main.c Generates main.s • Object: gcc –c main.c Generates main.o • Executable: gcc main.c Generates a.out
What are Tokens?Smallest Individual Units that are Recognized #include <stdio.h> #include <ctype.h> #include <string.h> main() { char day[10]; int m, d, y; m = 9; d = 7; y = 2011; strcpy(day, "Wednesday"); printf("Today is: "); printf("%s %d/%d/%d\n", day, m, d, y); } Tokens???
C Lexical Specification in Flex • http://www.lysator.liu.se/c/ANSI-C-grammar-l.html D [0-9] L [a-zA-Z_] H [a-fA-F0-9] E [Ee][+-]?{D}+ FS (f|F|l|L) IS (u|U|l|L)* %{ #include <stdio.h> #include "y.tab.h" void count(); %} %% "/*" { comment(); }
C Flex Continued … "auto" { count(); return(AUTO); } "break" { count(); return(BREAK); } "case" { count(); return(CASE); } "char" { count(); return(CHAR); } "const" { count(); return(CONST); } "continue" { count(); return(CONTINUE); } "default" { count(); return(DEFAULT); } "do" { count(); return(DO); } "double" { count(); return(DOUBLE); } "else" { count(); return(ELSE); } "enum" { count(); return(ENUM); } "extern" { count(); return(EXTERN); } "float" { count(); return(FLOAT); } "for" { count(); return(FOR); } "goto" { count(); return(GOTO); } "if" { count(); return(IF); } "int" { count(); return(INT); }
C Flex Continued … "long" { count(); return(LONG); } "register" { count(); return(REGISTER); } "return" { count(); return(RETURN); } "short" { count(); return(SHORT); } "signed" { count(); return(SIGNED); } "sizeof" { count(); return(SIZEOF); } "static" { count(); return(STATIC); } "struct" { count(); return(STRUCT); } "switch" { count(); return(SWITCH); } "typedef" { count(); return(TYPEDEF); } "union" { count(); return(UNION); } "unsigned" { count(); return(UNSIGNED); } "void" { count(); return(VOID); } "volatile" { count(); return(VOLATILE); } "while" { count(); return(WHILE); } What TOKEN is Missing?
C Flex Continued … {L}({L}|{D})* { count(); return(check_type()); } 0[xX]{H}+{IS}? { count(); return(CONSTANT); } 0{D}+{IS}? { count(); return(CONSTANT); } {D}+{IS}? { count(); return(CONSTANT); } L?'(\\.|[^\\'])+' { count(); return(CONSTANT); } {D}+{E}{FS}? { count(); return(CONSTANT); } {D}*"."{D}+({E})?{FS}? { count(); return(CONSTANT); } {D}+"."{D}*({E})?{FS}? { count(); return(CONSTANT); } L?\"(\\.|[^\\"])*\" { count(); return(STRING_LITERAL); } What Do These Represent?
C Flex Continued … "..." { count(); return(ELLIPSIS); } ">>=" { count(); return(RIGHT_ASSIGN); } "<<=" { count(); return(LEFT_ASSIGN); } "+=" { count(); return(ADD_ASSIGN); } "-=" { count(); return(SUB_ASSIGN); } "*=" { count(); return(MUL_ASSIGN); } "/=" { count(); return(DIV_ASSIGN); } "%=" { count(); return(MOD_ASSIGN); } "&=" { count(); return(AND_ASSIGN); } "^=" { count(); return(XOR_ASSIGN); } "|=" { count(); return(OR_ASSIGN); } ">>" { count(); return(RIGHT_OP); } "<<" { count(); return(LEFT_OP); } "++" { count(); return(INC_OP); } "--" { count(); return(DEC_OP); } "->" { count(); return(PTR_OP); }
C Flex Continued … MISSING LOTS OF FLEX "<" { count(); return('<'); } ">" { count(); return('>'); } "^" { count(); return('^'); } "|" { count(); return('|'); } "?" { count(); return('?'); } [ \t\v\n\f] { count(); } . { /* ignore bad characters */ } %% yywrap() { return(1); } MISSING OTHER CODE
C Bison Specification • http://www.lysator.liu.se/c/ANSI-C-grammar-y.html %token IDENTIFIER CONSTANT STRING_LITERAL SIZEOF %token PTR_OP INC_OP DEC_OP LEFT_OP RIGHT_OP LE_OP %token GE_OP EQ_OP NE_OP %token AND_OP OR_OP MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN %token SUB_ASSIGN LEFT_ASSIGN RIGHT_ASSIGN AND_ASSIGN %token XOR_ASSIGN OR_ASSIGN TYPE_NAME ADD_ASSIGN %token TYPEDEF EXTERN STATIC AUTO REGISTER %token CHAR SHORT INT LONG SIGNED UNSIGNED FLOAT DOUBLE CONST %token VOLATILE VOID %token STRUCT UNION ENUM ELLIPSIS %token CASE DEFAULT IF ELSE SWITCH WHILE DO FOR GOTO CONTINUE %token BREAK RETURN %start translation_unit %%
Bison Spec Continued … primary_expression : IDENTIFIER | CONSTANT | STRING_LITERAL | '(' expression ')' ; postfix_expression : primary_expression | postfix_expression '[' expression ']' | postfix_expression '(' ')' | postfix_expression '(' argument_expression_list ')' | postfix_expression '.' IDENTIFIER | postfix_expression PTR_OP IDENTIFIER | postfix_expression INC_OP | postfix_expression DEC_OP ; argument_expression_list : assignment_expression | argument_expression_list ',' assignment_expression These are Grammar Rules
Bison Spec Continued … These Rules Declare Variables declaration_list : declaration | declaration_list declaration declaration : declaration_specifiers ';' | declaration_specifiers init_declarator_list ';' ; declaration_specifiers : storage_class_specifier | storage_class_specifier declaration_specifiers | type_specifier | type_specifier declaration_specifiers | type_qualifier | type_qualifier declaration_specifiers
Bison Spec Continued … statement_list : statement | statement_list statement ; statement : labeled_statement | compound_statement | expression_statement | selection_statement | iteration_statement | jump_statement ; labeled_statement : IDENTIFIER ':' statement | CASE constant_expression ':' statement | DEFAULT ':' statement ; List of Statements and Different Statements
Bison Spec Continued … expression_statement : ';' | expression ';' ; selection_statement : IF '(' expression ')' statement | IF '(' expression ')' statement ELSE statement | SWITCH '(' expression ')' statement ; iteration_statement : WHILE '(' expression ')' statement | DO statement WHILE '(' expression ')' ';' | FOR '(' expression_statement expression_statement ')' statement | FOR '(' expression_statement expression_statement expression ')' statement ;
Reviewing Compilation Process in C • Current Programmers Used to State of the Art IDEs (Eclipse, etc.) • IDEs focus on Higher Level Concepts • Syntax Correction Capabilities • Matching Parens and Brackets • Sophisticated Compilation Error Messages • Debugging Environment • Process Hides Details – In C: • Multi-Stage Compilation Process from Source Code to Preprocessed Code to Assembly Code to Object Code to Executable
From Source to Preprocessing:What Does main.e Contain? # 1 "main.c" # 1 "<built-in>" # 1 "<command-line>" # 1 "main.c" # 1 "/usr/include/stdio.h" 1 3 4 # 28 "/usr/include/stdio.h" 3 4 # 1 "/usr/include/features.h" 1 3 4 # 330 "/usr/include/features.h" 3 4 # 1 "/usr/include/sys/cdefs.h" 1 3 4 # 348 "/usr/include/sys/cdefs.h" 3 4 # 1 "/usr/include/bits/wordsize.h" 1 3 4 # 349 "/usr/include/sys/cdefs.h" 2 3 4 # 331 "/usr/include/features.h" 2 3 4 # 354 "/usr/include/features.h" 3 4 # 1 "/usr/include/gnu/stubs.h" 1 3 4 … etc ….
From Source to Preprocessing:What Does main.e Contain? … missing almost 900 lines of code … # 4 "main.c" 2 main() { char day[10]; int m, d, y; m = 9; d = 7; y = 2011; strcpy(day, "Wednesday"); printf("Today is: "); printf("%s %d/%d/%d\n", day, m, d, y); }
What are Final Two Steps? • Generation of “.o” files • Linking of “.o” files to Create a.out • In Olden Days (no IDEs): • Compilation of 250K code take 2-3 Hours • Utilization of Makefiles for Smart Compilation • Creation of Directory Structure for Code • Makefiles to Establish Dependencies Among “.h” and “.c” and “.o” Files • Change in Time Stamp Causes Recompile • Try to Recompile “Least” Amount of Code • Then – Relink all “.o” Files into a.out
Programming Languages Offer … • Abstractions • At different levels • From low • Good for machines…. • To high • Good for humans…. • Three Approaches • Interpreted • Compiled • Mixed
Interpreter • Motivation… • Easiest to implement! • Upside ? • Downside ? • Phases • Lexical analysis • Parsing • Semantic checking • Interpretation • Interpreted Languages?
Compiler • Motivation • It is natural! • Upside? • Downside? • Phases • [Interpreter] • Code Generation • Code Optimization • Link & load • Compiled Languages?
Mixed • Motivation • The best of two breeds… • Upside ? • Downside? • Mixed Languages?
Classifications of Compilers • Compilers Viewed from Many Perspectives • However, All utilize same basic tasks to accomplish their actions Single Pass Multiple Pass Load & Go Construction Debugging Optimizing Functional
Compiler STructure • Two Fundamental Sets of Issues • We Will Discuss Each Category in This Class Analysis: Decompose Source into an intermediate representation Text, Syntactic, and Structural Analysis Synthesis: Target program generation from representation which includes optimization
Important Notes • In Today’s Technology, Analysis Is Often Performed by Software Tools - This Wasn’t the Case in Early CSE Days • Structure / Syntax directed editors: Force “syntactically” correct code to be entered • Pretty Printers: Standardized version for program structure (i.e., blank space, indenting, etc.) • Static Checkers: A “quick” compilation to detect rudimentary errors • Interpreters: “real” time execution of code a “line-at-a-time”
Important Notes • Compilation Is Not Limited to Programming Language Applications • Text Formatters • LATEX & TROFF Are Languages Whose Commands Format Text • Silicon Compilers • Textual / Graphical: Take Input and Generate Circuit Design • Database Query Processors • Database Query Languages Are Also a Programming Language • Input Is“compiled” Into a Set of Operations for Accessing the Database
Summary of Various Languages • Compiled • Programming Languages • [C,C#,ML,LISP,…] • Communication Languages • [XML,HTML,…] • Presentation Languages • [CSS,SGML,…] • Hardware Languages • [VHDL,…] • Formatting Languages • [Postscript,troff,LaTeX,…] • Query Languages • [SQL & friends]
The Many Phases of a Compiler Source Program 5 1 2 6 Code Optimizer Lexical Analyzer Code Generator Syntax Analyzer 3 Semantic Analyzer Error Handler Symbol-table Manager 4 Intermediate Code Generator Target Program 1, 2, 3 : Analysis - Our Focus 4, 5, 6 : Synthesis
5 1 2 3 What Does Relocatable Mean? Pre-Processor Compiler Loader Link/Editor Assembler Library,relocatable object files Source Program 4 RelocatableMachine Code Executable
2 3 The Analysis Task For Compilation Language Analysis Phases Semantic Analyzer Syntax Analyzer • Three Phases: • Linear / Lexical Analysis: • L-to-r Scan to Identify Tokens • Hierarchical Analysis: • Grouping of Tokens Into Meaningful Collection • Semantic Analysis: • Checking to Insure Correctness of Components Source Program 1 Lexical Analyzer
Phase 1. Lexical Analysis All are tokens Easiest Analysis - Identify tokens which are building blocks For Example: Position := initial + rate * 60 ; _______ __ _____ _ ___ _ __ _ Blanks, Line breaks, etc. are scanned out
Id Date Id Date Id System Symbol ( Symbol + Symbol ) Lexical Analysis Id x Keyword new Symbol . Symbol ) Integer 30 Symbol ; Symbol := Symbol ( Id Today Id Date Id Date Id System Symbol ( Symbol + Symbol ) Id x Keyword new Symbol . Symbol ) Integer 30 Symbol ; Symbol := Symbol ( Id Today • Purpose • Slice the sequence of symbols into tokens Date x := new Date ( System.today( ) + 30 ) ;
Phase 2. Hierarchical Analysisaka Parsing or Syntax Analysis assignment statement := identifier expression + position expression expression * identifier expression expression initial identifier number rate 60 For 1st example, we would have Parse Tree: Nodes of tree are constructed using a grammar for the language
Syntax Analysis (parsing) • For Second example, we would have: • Organize tokens in sentences based on grammar
What is a Grammar? • Grammar is a Set of Rules Which Govern the Interdependencies & Structure Among the Tokens statement is an assignment statement, or while statement, or if statement, or ... assignment statement is an identifier := expression ; expression is an (expression), or expression + expression, or expression * expression, or number, or identifier, or ...
Summary so far… • Turn a symbol stream into a parse tree
Semantic Analysis • Lexical Analysis - Scans Input to Identify “words” that are the the Tokens of the Language • Syntactic Analysis uses Recursion to Identify Structure as Indicated in Parse Tree • What is Semantic Analysis? • Purpose • Determine Unique / Unambiguous Interpretation • Catch errors related to program meaning • Determine Whether the Sentences have One and Only One Unambiguous Interpretation • “John Took Picture of Mary Out on the Patio” • For a PL – Wrong Types, Missing Declaration, Missing Methods, Ambiguous Statements, etc.
Phase 3. Semantic Analysis := := position + position + initial * initial * rate 60 rate inttoreal 60 • Find More Complicated Semantic Errors and Support Code Generation • Parse Tree Is Augmented With Semantic Actions Compressed Tree Conversion Action
Phase 3. Semantic Analysis • Most Important Activity in This Phase: • Type Checking - Legality of Operands • Many Different Situations: • Primary Tool is Symbol Table Real := int + char ; A[int] := A[real] + int ; while char <> int do …. Etc.
Analysis in Text Formatting Simple Commands : LATEX \begin{single} \end{single} \noindent \section{Introduction} $A_i$ $A_{i_j}$ begin single noindent section Embedded in a stream of text, i.e., a FILE Language Commands \ and $ serve as signals to LATEX What are tokens? What is hierarchical structure? What kind of semantic analysis is required?
Supporting Phases/ Activities for Analysis • Symbol Table Creation / Maintenance • Contains Info on Each “Meaningful” Token, Typically Identifiers • Data Structure Created / Initialized During Lexical Analysis • Utilized / Updated During Later Analysis & Synthesis • Error Handling • Detection of Different Errors Which Correspond to All Phases • What Kinds of Errors Are Found During the Analysis Phase? • What Happens When an Error Is Found?
Summary so far... • Turned a symbol stream into an annotated parse tree
From Analysis to Synthesis Source Program 5 1 2 6 Code Optimizer Lexical Analyzer Code Generator Syntax Analyzer 3 Semantic Analyzer Error Handler Symbol-table Manager 4 Intermediate Code Generator Target Program 1, 2, 3 : Analysis - Our Focus 4, 5, 6 : Synthesis Phase
The Synthesis Task For Compilation • Intermediate Code Generation • Abstract Machine Version of Code - Independent of Architecture • Easy to Produce and Do Final, Machine Dependent Code Generation • Code Optimization • Find More Efficient Ways to Execute Code • Replace Code With More Optimal Statements • 2-approaches: High-level Language & “Peephole” Optimization • Final Code Generation • Generate Relocatable Machine Dependent Code
Intermediate Code • What is intermediate code ? • A low level representation • A simple representation • Easy to reason about • Easy to manipulate • Programming Language independent • Hardware independent
IR Code example • Quadruples • (x,y,op,z) to represent x := y op z • Infinitely many temporaries • Implicit call stack management • Example Date x := new Date ( System.today( ) + 30 ) ; push System t0 := call today t1 := 30 t2 := t0 + t1 push t2 t3 := call DateFactory x := t3