First Phase: Lexical Analysis (Scanning)

CIS 461Compiler Design & Construction Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and Linda TorczonLecture-Module #6Further Lexical Analysis

First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into tokens Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Scanner works as a subroutine of the parser token IR Source code Parser Scanner get next token Errors

Lexical Analysis • Specify tokens using Regular Expressions • Translate Regular Expressions to Finite Automata • Use Finite Automata to generate tables or code for the scanner source code tokens Scanner tables or code specifications (regular expressions) Scanner Generator

Automating Scanner Construction To build a scanner: • Write down the RE that specifies the tokens • Translate the RE to an NFA • Build the DFA that simulates the NFA • Systematically shrink the DFA • Turn it into code or table Scanner generators • Lex , Flex, Jlex work along these lines • Algorithms are well-known and well-understood • Interface to parser is important

The Cycle of Constructions minimal DFA RE NFA DFA Automating Scanner Construction RENFA (Thompson’s construction) • Build an NFA for each term • Combine them with -moves NFA DFA(subset construction) • Build the simulation DFA Minimal DFA • Hopcroft’s algorithm DFA RE • All pairs, all paths problem • Union together paths from s0to a final state

NFA vs. DFA Scanners • Given a regular expression r we can convert it to an NFA of size O(|r|) • Given an NFA we can convert it to a DFA of size O(2|r|) • We can simulate a DFA on string x in O(|x|) time • We can simulate an NFA N (constructed by Thompson’s construction) on a string x in O(|N| |x|) time Recognizing input string x forregular expression r

Scanner Generators: JLex, Lex, FLex directly copied to the output file user code %% JLex directives %% regular expression rules macro (regular) definitions (e.g., digits = [0-9]+) and state names each rule: optional state list, regular expression, action • user code at top (from parser-generator) specifies what tokens are • States can be mixed with regular expressions • For each regular expression we can define a set of states where it is valid (JLex, Flex) • Standard format of regular expression rule: • <optional_state_list> regular_expression { actions }

JLex, FLex, Lex Automata for regular expression r_1 Java code for JLex, C code for FLex and Lex new final states Regular expression rules: r_1 { action_1 } r_2 { action_2 } . . . r_n { action_n } Ar_1 error new start sate  Ar_2  s0 error . Rules used by scanner generators 1) Continue scanning the input until reaching an error state 2) Accept the longest prefix that matches to a regular expression and execute the corresponding action 3) If two patterns match the longest prefix, then the action which is specified earlier will be executed 4) After a match, go back to the end of the accepted prefix in the input and start scanning for the next token  . . Ar_n error For faster scanning, convert this NFA to a DFA and minimize the states

A Simple Example Recognize the following tokens: Id = [a-z][a-z0-9]* Num = [0-9]+ if ="if" Also take care of one line comments and white space: WhiteSpace = [\ \t\f\b\r\n] Comment = \/\/.*

/* User code */ import java.io.*; // For FileInputStream and its exceptions. /* ========================================== */ class Type { static final int IF = 0; static final int ID = 1; static final int NUM = 2; static final int EOF = 3; }; class Token { public int type; public String attribute; public Token(int t) { type=t; } public Token(int t, String s) { type=t; attribute = s; } public static String spellingOf(int t) { switch (t) { case Type.IF : return "IF"; case Type.ID : return "ID"; case Type.NUM : return "NUM"; default : return "Undefinied token type"; } } public String toString() { switch (type) { case Type.ID : case Type.NUM : return spellingOf(type) + ", " + attribute; default: return spellingOf(type); } } };

/* ================================================= */ class Example { public static void main(String[] args) throws FileNotFoundException, IOException { FileInputStream fis = new FileInputStream(args[0]); Lexer L = new Lexer(fis); Token T = L.next(); while (T.type != Type.EOF) { System.out.println(T); T = L.next(); } } } /* ================================================ */ %% /* JLex directives */ %class Lexer %function next %type Token %eofval{ return new Token(Type.EOF); %eofval} /* white space */ WhiteSpace = [\ \t\f\b\r\n] /* comments */ Comment = \/\/.* Id = [a-z][a-z0-9]* Num = [0-9]+ %% {WhiteSpace} {} {Comment} {} "if" { return new Token(Type.IF); } {Id} { return new Token(Type.ID, yytext()); } {Num} { return new Token(Type.NUM, yytext()); }

If above JLex specification is in a file simple.jlx, you can generate a scanner • for that specification as follows: • % cd <directory for simple.jlx> • % setenv CLASSPATH ".:/fs/cs-cls/cs160/lib" • % java JLex.Main simple.jlx • % javac simple.jlx.java • % java Example input1

IF ID, i1 IF ID, var15 NUM, 15 NUM, 1 NUM, 2 NUM, 4253 if i1 // this is a comment if var15 15 1 2 4253 IF ID, i1 IF ID, var15 NUM, 15 NUM, 1 Undefined token type NUM, 2 Undefined token type NUM, 3 if i1 // this is a comment if var15 15 1 , 2 , 3

Building Faster Scanners from the DFA Table-driven recognizers waste a lot of effort • Read (& classify) the next character • Find the next state • Assign to the state variable • Branch back to the top We can do better • Encode state & actions in the code • Do transition tests locally • Generate ugly, spaghetti-like code (it is OK, this is automatically generated code) • Takes (many) fewer operations per input character state = s0 ; string = ; char = get_next_char(); while (char != eof) { state = (state,char); string = string + char; char = get_next_char(); } if (state in Final) then report acceptance; else report failure;

Building Faster Scanners from the DFA A direct-coded recognizer for Register regular expression R Digit Digit* • Many fewer operations per character • State is encoded as the location in the code goto s0; s0: string; char  get_next_char(); if (char = ‘r’) then goto s1; else goto se; s1: string string+ char; char get_next_char(); if (‘0’ ≤ char ≤ ‘9’) then goto s2; else goto se; s2: string string+ char; char  get_next_char(); if (‘0’ ≤ char ≤ ‘9’) then goto s2; else if (char = eof) then report acceptance; else goto se; se: print error message; return failure;

Building Faster Scanners Hashing keywords versus encoding them directly • Some compilers recognize keywords as identifiers and check them in a hash table • Encoding it in the DFA is a better idea • O(1) cost per transition • Avoids hash lookup on each identifier

What is hard about Lexical Analysis? Poor language design can complicate scanning • Reserved words are important • In PL/I there are no reserved keywords, so you can write a valid statement like: if then then then = else; else else = then • Significant blanks • In Fortran blanks are not significant do 10 i = 1,25 do loop do 10 i = 1.25 assignment to variable named do10i • Closures • Limited identifier length adds states to the automata to count length

integer function A What can be so hard? (Fortran 66/77) Macro definitions First A and B are converted to (6-2) This statement declares that variables that begin with A and B are of data-type four character string )=(3 is a literal constant statement for formatting input, output assigns value to variable DO9E1 assigns value to array element one statement split into two lines • How does a compiler do this? • First pass finds & inserts blanks • Can add extra words or tags to • create a scannable language • Second pass is normal scanner

First Phase: Lexical Analysis (Scanning)