850 likes | 867 Views
Lexical and Syntax Analysis Chapter 4. Compilation. Translating from high-level language to machine code is organized into several phases or passes. In the early days passes communicated through files, but this is no longer necessary.
E N D
Compilation • Translating from high-level language to machine code is organized into several phases or passes. • In the early days passes communicated through files, but this is no longer necessary.
We must first describe the language in question by giving its specification. • Syntax: • Defines symbols (vocabulary) • Defines programs (sentences) • Semantics: • Gives meaning to sentences. • The formal specifications are often the input to tools that build translators automatically. Language Specification
Lexical Analyzer Parser Semantic Analyzer Source-to-source optimizer Optimizer Translator Optimizer Translator Final Assembly String of characters Compiler passes String of tokens Abstract syntax tree Abstract syntax tree Abstract syntax tree Abstract syntax tree Medium-level intermediate code Low-level intermediate code Medium-level intermediate code Low-level intermediate code Low-level intermediate code Executable/object code
source program front end Compiler passes Lexical scanner Parser semantic analyzer symbol table manager error handler Translator Optimizer back end Final assembly target program
Lexical analyzer • Also called a scanner or tokenizer • Converts stream of characters into a stream of tokens • Tokens are: • Keywords such as for, while, and class. • Special characters such as +, -, (, and < • Variable name occurrences • Constant occurrences such as 1, 0, true.
Lexical analyzer • The lexical analyzer is usually a subroutine of the parser. • Each token is a single entity. A numerical code is usually assigned to each type of token.
Lexical analyzers perform: • Line reconstruction • delete comments • delete white spaces • perform text substitution • Lexical translation: translation of lexemes -> tokens • Often additional information is affiliated with a token. Lexical analyzer
Parser • Performs syntax analysis • Imposes syntactic structure on a sentence. • Parse trees are used to expose the structure. • These trees are often not explicitly built • Simpler representations of them are often used • Parsers, accepts a string of tokens and builds a parse tree representing the program
Parser • The collection of all the programs in a given language is usually specified using a list of rules known as a context free grammar.
Parser • A grammar has four components: • A set of tokens known as terminal symbols • A set of variables or non-terminals • A set of productions where each production consists of a non-terminal, an arrow, and a sequence of tokens and/or non-terminals • A designation of one of the nonterminals as the start symbol.
Symbol Table Management • The symbol table is a data structure used by all phases of the compiler to keep track of user defined symbols and keywords. • During early phases (lexical and syntax analysis) symbols are discovered and put into the symbol table • During later phases symbols are looked up to validate their usage.
Symbol Table Management • Typical symbol table activities: • add a new name • add information for a name • access information for a name • determine if a name is present in the table • remove a name • revert to a previous usage for a name (close a scope).
Symbol Table Management • Many possible Implementations: • linear list • sorted list • hash table • tree structure
Symbol Table Management • Typical information fields: • print value • kind (e.g. reserved, typeid, varid, funcid, etc.) • block number/level number • type • initial value • base address • etc.
The parse tree is used to recognize the components of the program and to check that the syntax is correct. • As the parser applies productions, it usually generates the component of a simpler tree (known as Abstract Syntax Tree). • The meaning of the component is derived out of the way the statement is organized in a subtree. Abstract Syntax Tree
The semantic analyzer completes the symbol table with information on the characteristics of each identifier. • The symbol table is usually initialized during parsing. • One entry is created for each identifier and constant. • Scope is taken into account. Two different variables with the same name will have different entries in the symbol table. • The semantic analyzer completes the table using information from declarations. Semantic Analyzer
The semantic analyzer does • Type checking • Flow of control checks • Uniqueness checks (identifiers, case labels, etc.) • One objective is to identify semantic errors statically. For example: • Undeclared identifiers • Unreachable statements • Identifiers used in the wrong context. • Methods called with the wrong number of parameters or with parameters of the wrong type. Semantic Analyzer
Some semantic errors have to be detected at run time. The reason is that the information may not be available at compile time. • Array subscript is out of bonds. • Variables are not initialized. • Divide by zero. Semantic Analyzer
Error Management • Errors can occur at all phases in the compiler • Invalid input characters, syntax errors, semantic errors, etc. • Good compilers will attempt to recover from errors and continue.
Translator • The lexical scanner, parser, and semantic analyzer are collectively known as the front endof the compiler. • The second part, or back end starts by generating low level code from the (possibly optimized) AST.
Translator • Rather than generate code for a specific architecture, most compilers generate intermediate language • Three address code is popular. • Really a flattened tree representation. • Simple. • Flexible (captures the essence of many target architectures). • Can be interpreted.
Translator • One way of performing intermediate code generation: • Attach meaning to each node of the AST. • The meaning of the sentence = the “meaning” attached to the root of the tree.
XIL • An example of Medium level intermediate language is XIL. XIL is used by IBM to compile FORTRAN, C, C++, and Pascal for RS/6000. • Compilers for Fortran 90 and C++ have been developed using XIL for other machines such as Intel 386, Sparc, and S/370.
Optimizers • Intermediate code is examined and improved. • Can be simple: • changing “a:=a+1” to “increment a” • changing “3*5” to “15” • Can be complicated: • reorganizing data and data accesses for cache efficiency • Optimization can improve running time by orders of magnitude, often also decreasing program size.
Code Generation • Generation of “real executable code” for a particular target machine. • It is completed by the Final Assembly phase • Final output can either be • assembly language for the target machine • object code ready for linking • The “target machine” can be a virtual machine (such as the Java Virtual Machine, JVM), and the “real executable code” is “virtual code” (such as Java Bytecode).
Compiler Overview Source Program IF (a<b) THEN c=1*d; Lexical Analyzer IF ( ID “a” < ID “b” THEN ID “c” = CONST “1” * ID “d” Token Sequence a Syntax Analyzer cond_expr < b Syntax Tree IF_stmt lhs c list 1 assign_stmt rhs Semantic Analyzer * d GE a, b, L1 MUlT 1, d, c L1: 3-Address Code GE a, b, L1 MOV d, c L1: Code Optimizer loadi R1,a cmpi R1,b jge L1 loadi R1,d storei R1,c L1: Optimized 3-Addr. Code Code Generation Assembly Code
What is Lexical Analysis? • The lexical analyzer deals with small-scale language constructs, such as names and numeric literals. The syntax analyzer deals with the large-scale constructs, such as expressions, statements, and program units. - The syntax analysis portion consists of two parts: 1. A low-level part called a lexical analyzer (essentially a pattern matcher). 2. A high-level part called a syntax analyzer, or parser. The lexical analyzer collects characters into logical groupings and assigns internal codes to the groupings according to their structure.
token source program get next token lexical analyzer symbol table parser Lexical Analyzer in Perspective
LEXICAL ANALYZER Scan Input Remove white space, … Identify Tokens Create Symbol Table Insert Tokens into AST Generate Errors Send Tokens to Parser PARSER Perform Syntax Analysis Actions Dictated by Token Order Update Symbol Table Entries Create Abstract Rep. of Source Generate Errors Lexical Analyzer in Perspective
Lexical analyzers extract lexemes from a given input string and produce the corresponding tokens. Sum = oldsum – value /100; Token Lexeme IDENT sum ASSIGN_OP = IDENT oldsum SUBTRACT_OP - IDENT value DIVISION_OP / INT_LIT 100 SEMICOLON ;
Basic Terminology • What are Major Terms for Lexical Analysis? • TOKEN • A classification for a common set of strings • Examples Include <Identifier>, <number>, etc. • PATTERN • The rules which characterize the set of strings for a token • LEXEME • Actual sequence of characters that matches pattern and is classified by a token • Identifiers: x, count, name, etc…
Token Sample Lexemes Informal Description of Pattern const if relation id num literal const if <, <=, =, < >, >, >= pi, count, D2 3.1416, 0, 6.02E23 “core dumped” const if < or <= or = or < > or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser Classifies Pattern Basic Terminology
Token Definitions Suppose: S ts the string banana Prefix : ban, banana Suffix : ana, banana Substring : nan, ban, ana, banana Subsequence: bnan, nn
Token Definitions letter A | B | C | … | Z | a | b | … | z digit 0 | 1 | 2 | … | 9 id letter ( letter | digit )* Shorthand Notation: “+” : one or more r* = r+ | & r+ = r r* “?” : zero or one r?=r | [range] : set range of characters (replaces “|” ) [A-Z] = A | B | C | … | Z id [A-Za-z][A-Za-z0-9]*
Token Recognition Assume Following Tokens: if, then, else, re-loop, id, num What language construct are they used for ? Given Tokens, What are Patterns ? Grammar:stmt |if expr then stmt |if expr then stmt else stmt |expr term re-loop term | termterm id | num if if then then else else Re-loop < | <= | > | >= | = | <> id letter ( letter | digit )* num digit + (. digit + ) ? ( E(+ | -) ? digit + ) ? What does this represent ?
What Else Does Lexical Analyzer Do? Scan away b, nl, tabs Can we Define Tokens For These? blank b tab ^T newline ^M delim blank | tab | newline ws delim+
Regular Expression Token Attribute-Value ws if then else id num < <= = < > > >= - if then else id num relop relop relop relop relop relop - - - - pointer to table entry pointer to table entry LT LE EQ NE GT GE Symbol Tables Note: Each token has a unique token identifier to define category of lexemes
Building a Lexical Analyzer There are three approaches to building a lexical analyzer: 1. Write a formal description of the token patterns of the language using a descriptive language. Tool on UNIX system called lex 2. Design a state transition diagram that describes the token patterns of the language and write a program that implements the diagram. 3. Design a state transition diagram and hand-construct a table-driven implementation of the state diagram.
Diagrams for Tokens • Transition Diagrams (TD) are used to represent the tokens • Each Transition Diagram has: • States : Represented byCircles • Actions : Represented byArrowsbetween states • Start State : Beginning of a pattern (Arrowhead) • Final State(s) : End of pattern (Concentric Circles) • Deterministic- No need to choose between 2 different actions
21 26 23 19 27 24 digit digit digit start . digit digit E * start digit * other . digit * 20 22 E digit + | - digit digit digit digit 18 14 15 16 13 12 17 start digit other other * 25 Example : Transition Diagrams
State diagram to recognize names, reserved words, and integer literals
Reasons to use BNF to Describe Syntax Provides a clear syntax description The parser can be based directly on the BNF Parsers based on BNF are easy to maintain
Reasons to Separate Lexical and Syntax Analysis Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser Efficiency - separation allows optimization of the lexical analyzer Portability - parts of the lexical analyzer may not be portable, but the parser always is portable
Summary of Lexical Analysis • A lexical analyzer is a pattern matcher for character strings • A lexical analyzer is a “front-end” for the parser • Identifies substrings of the source program that belong together - lexemes • Lexemes match a character pattern, which is associated with a lexical category called a token - sum is a lexeme; its token may be IDENT
The Compiler So Far • Lexical analysis • Detects inputs with illegal tokens • Parsing • Detects inputs with ill-formed parse trees • Semantic analysis • The last “front end” phase • Catches more errors
What’s Wrong? • Example 1 int in x; • Example 2 int i = 12.34;