540 likes | 673 Views
Lexical Analysis and Scanning. Compiler Construction Lecture 2 Spring 2001 Robert Dewar. The Input. Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Others (EBCDIC, JIS, etc).
E N D
Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar
The Input • Read string input • Might be sequence of characters (Unix) • Might be sequence of lines (VMS) • Character set • ASCII • ISO Latin-1 • ISO 10646 (16-bit = unicode) • Others (EBCDIC, JIS, etc)
The Output • A series of tokens • Punctuation ( ) ; , [ ] • Operators + - ** := • Keywords begin end if • Identifiers Square_Root • String literals “hello this is a string” • Character literals ‘x’ • Numeric literals 123 4_5.23e+2 16#ac#
Free form vs Fixed form • Free form languages • White space does not matter • Tabs, spaces, new lines, carriage returns • Only the ordering of tokens is important • Fixed format languages • Layout is critical • Fortran, label in cols 1-6 • COBOL, area A B • Lexical analyzer must worry about layout
Punctuation • Typically individual special characters • Such as + - • Lexical analyzer does not know : from : • Sometimes double characters • E.g. (* treated as a kind of bracket • Returned just as identity of token • And perhaps location • For error message and debugging purposes
Operators • Like punctuation • No real difference for lexical analyzer • Typically single or double special chars • Operators + - • Operations := • Returned just as identity of token • And perhaps location
Keywords • Reserved identifiers • E.g. BEGIN END in Pascal, if in C • Maybe distinguished from identifiers • E.g. mode vs mode in Algol-68 • Returned just as token identity • With possible location information • Unreserved keywords (e.g. PL/1) • Handled as identifiers (parser distinguishes)
Identifiers • Rules differ • Length, allowed characters, separators • Need to build table • So that junk1 is recognized as junk1 • Typical structure: hash table • Lexical analyzer returns token type • And key to table entry • Table entry includes location information
More on Identifier Tables • Most common structure is hash table • With fixed number of headers • Chain according to hash code • Serial search on one chain • Hash code computed from characters • No hash code is perfect! • Avoid any arbitrary limits
String Literals • Text must be stored • Actual characters are important • Not like identifiers • Character set issues • Table needed • Lexical analyzer returns key to table • May or may not be worth hashing
Character Literals • Similar issues to string literals • Lexical Analyzer returns • Token type • Identity of character • Note, cannot assume character set of host machine, may be different
Numeric Literals • Also need a table • Typically record value • E.g. 123 = 0123 = 01_23 (Ada) • But cannot use int for values • Because may have different characteristics • Float stuff much more complex • Denormals, correct rounding • Very delicate stuff
Handling Comments • Comments have no effect on program • Can therefore be eliminated by scanner • But may need to be retrieved by tools • Error detection issues • E.g. unclosed comments • Scanner does not return comments
Case Equivalence • Some languages have case equivalence • Pascal, Ada • Some do not • C, Java • Lexical analyzer ignores case if needed • This_Routine = THIS_RouTine • Error analysis may need exact casing
Issues to Address • Speed • Lexical analysis can take a lot of time • Minimize processing per character • I/O is also an issue (read large blocks) • We compile frequently • Compilation time is important • Especially during development
General Approach • Define set of token codes • An enumeration type • A series of integer definitions • These are just codes (no semantics) • Some codes associated with data • E.g. key for identifier table • May be useful to build tree node • For identifiers, literals etc
Interface to Lexical Analyzer • Convert entire file to a file of tokens • Lexical analyzer is separate phase • Parser calls lexical analyzer • Get next token • This approach avoids extra I/O • Parser builds tree as we go along
Implementation of Scanner • Given the input text • Generate the required tokens • Or provide token by token on demand • Before we describe implementations • We take this short break • To describe relevant formalisms
Relevant Formalisms • Type 3 (Regular) Grammars • Regular Expressions • Finite State Machines
Regular Grammars • Regular grammars • Non-terminals (arbitrary names) • Terminals (characters) • Two forms of rules • Non-terminal ::= terminal • Non-terminal ::= terminal Non-terminal • One non-terminal is the start symbol • Regular (type 3) grammars cannot count • No concept of matching nested parens
Regular Grammars • Regular grammars • E.g. grammar of reals with no exponent • REAL ::= 0 REAL1 (repeat for 1 .. 9) • REAL1 ::= 0 REAL1 (repeat for 1 .. 9) • REAL1 ::= . INTEGER • INTEGER ::= 0 INTEGER (repeat for 1 .. 9) • INTEGER ::= 0 (repeat for 1 .. 9) • Start symbol is REAL
Regular Expressions • Regular expressions (RE) defined by • Any terminal character is an RE • Alternation RE | RE • Concatenation RE1 RE2 • Repetition RE* (zero or more RE’s) • Language of RE’s = type 3 grammars • Regular expressions are more convenient
Specifying RE’s in Unix Tools • Single characters a b c d \x • Alternation [bcd] [b-z] ab|cd • Match any character . • Match sequence of characters x* y+ • Concatenation abc[d-q] • Optional [0-9]+(.[0-9]*)?
Finite State Machines • Languages and Automata • A language is a set of strings • An automaton is a machine • That determines if a given string is in the language or not. • FSM’s are automata that recognize regular languages (regular expressions)
Definitions of FSM • A set of labeled states • Directed arcs labeled with character • A state may be marked as terminal • Transition from state S1 to S2 • If and only if arc from S1 to S2 • Labeled with next character (which is eaten) • Recognized if ends up in terminal state • One state is distinguished start state
Building FSM from Grammar • One state for each non-terminal • A rule of the form • Nont1 ::= terminal • Generates transition from S1 to final state • A rule of the form • Nont1 ::= terminal Nont2 • Generates transition from S1 to S2
Building FSM’s from RE’s • Every RE corresponds to a grammar • For all regular expressions • A natural translation to FSM exists • We will not give details of algorithm here
Non-Deterministic FSM • A non-deterministic FSM • Has at least one state • With two arcs to two separate states • Labeled with the same character • Which way to go? • Implementation requires backtracking • Nasty
Deterministic FSM • For all states S • For all characters C • There is either ONE or NO arcs • From state S • Labeled with character C • Much easier to implement • No backtracking
Dealing with ND FSM • Construction naturally leads to ND FSM • For example, consider FSM for • [0-9]+ | [0-9]+\.[0-9]+ • (integer or real) • We will naturally get a start state • With two sets of 0-9 branches • And thus non-deterministic
Converting to Deterministic • There is an algorithm for converting • From any ND FSM • To an equivalent deterministic FSM • Algorithm is in the text book • Example (given in terms of RE’s) • [0-9]+ | [0-9]+\.[0-9]+ • [0-9]+(\.[0-9]+)?
Implementing the Scanner • Three methods • Completely informal, just write code • Define tokens using regular expressions • Convert RE’s to ND finite state machine • Convert ND FSM to deterministic FSM • Program the FSM • Use an automated program • To achieve above three steps
Ad Hoc Code (forget FSM’s) • Write normal hand code • A procedure called Scan • Normal coding techniques • Basically scan over white space and comments till non-blank character found. • Base subsequent processing on character • E.g. colon may be : or := • / may be operator or start of comment • Return token found • Write aggressive efficient code
Using FSM Formalisms • Start with regular grammar or RE • Typically found in the language standard • For example, for Ada: • Chapter 2. Lexical Elements • Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • decimal-literal ::= integer [.integer][exponent] • integer ::= digit {[underline] digit} • exponent ::= E [+] integer | E - integer
Using FSM formalisms, cont • Given RE’s or grammar • Convert to finite state machine • Convert ND FSM to deterministic FSM • Write a program to recognize • Using the deterministic FSM
Implementing FSM (Method 1) • Each state is code of the form: • <<state1>> case Next_Character is when ‘a’ => goto state3; when ‘b’ => goto state1; when others => End_of_token_processing; end case; • <<state2>> …
Implementing FSM (Method 2) • There is a variable called State loop case State is when state1 =><<state1>> case Next_Character is when ‘a’ => State := state3; when ‘b’ => State := state1; when others => End_token_processing; end case; when state2 … … end case; end loop;
Implementing FSM (Method 3) • T : array (State, Character) of State;while More_Input loop Curstate := T (Curstate, Next_Char); if Curstate = Error_State then …end loop;
Automatic FSM Generation • Our example, FLEX • See home page for manual in HTML • FLEX is given • A set of regular expressions • Actions associated with each RE • It builds a scanner • Which matches RE’s and executes actions
Flex General Format • Input to Flex is a set of rules: • Regexp actions (C statements) • Regexp actions (C statements) • … • Flex scans the longest matching Regexp • And executes the corresponding actions
An Example of a Flex scanner • DIGIT [0-9]ID [a-z][a-z0-9]*%%{DIGIT}+ { printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); }{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext));if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext));
Flex Example (continued) {ID} printf (“an identifier %s\n”, yytext);“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } “--”.*\n /* eat Ada style comment */ [ \t\n]+ /* eat white space */ . printf (“unrecognized character”);%%
Assembling the flex program %{ #include <math.h> /* for atof */ %} <<flex text we gave goes here>> %% main (argc, argv) int argc; char **argv; { yyin = fopen (argv[1], “r”); yylex(); }
Running flex • flex is a program that is executed • The input is as we have given • The output is a running C program • For Ada fans • Look at aflex (www.adapower.com) • For C++ fans • flex can run in C++ mode • Generates appropriate classes
Choice Between Methods? • Hand written scanners • Typically much faster execution • And pretty easy to write • And a easier for good error recovery • Flex approach • Simple to Use • Easy to modify token language
The GNAT Scanner • Hand written (scn.adb/scn.ads) • Basically a call does • Super quick scan past blanks/comments etc • Big case statement • Process based on first character • Call special routines • Namet.Get_Name for identifier (hashing) • Keywords recognized by special hash • Strings (stringt.ads) • Integers (uintp.ads) • Reals (ureal.ads)
More on the GNAT Scanner • Entire source read into memory • Single contiguous block • Source location is index into this block • Different index range for each source file • See sinput.adb/ads for source mgmt • See scans.ads for definitions of tokens
More on GNAT Scanner • Read scn.adb code • Very easy reading, e.g.
DTL (Dewar Trivial Language) • DTL Grammar • Program ::= DECLARE Decls BEGIN Stmts • Decls ::= {Decl}* • Stmts ::= {Stmt}+ • Type ::= INTEGER | REAL • Identifier ::= letter (_{digit}+)* • Decl ::= DECLARE identifier : Type
DTL (Continued) • Integer_Literal ::= {digit}+Real_Literal ::= {digit}+”.”{digit}*Stmt ::= Assignstmt | Ifstmt | WhilestmtAssignstmt ::= Identifier := ExprExpr ::= Literal | (Expr) Op (Expr)Op ::= + | -Literal ::= Integer_Literal | Real_LiteralIfstmt ::= IF Expr Relop Expr THEN StmtsWhilestmt ::= WHILE Expr Relop Expr DO StmtsRelop ::= > | < | >= | <=