1 / 18

First Phase: Lexical Analysis (Scanning)

CIS 461 Compiler Design & Construction Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #6 Further Lexical Analysis. First Phase: Lexical Analysis (Scanning). Scanner Maps stream of characters into tokens Basic unit of syntax

damon
Download Presentation

First Phase: Lexical Analysis (Scanning)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CIS 461Compiler Design & Construction Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and Linda TorczonLecture-Module #6Further Lexical Analysis

  2. First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into tokens Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Scanner works as a subroutine of the parser token IR Source code Parser Scanner get next token Errors

  3. Lexical Analysis • Specify tokens using Regular Expressions • Translate Regular Expressions to Finite Automata • Use Finite Automata to generate tables or code for the scanner source code tokens Scanner tables or code specifications (regular expressions) Scanner Generator

  4. Automating Scanner Construction To build a scanner: • Write down the RE that specifies the tokens • Translate the RE to an NFA • Build the DFA that simulates the NFA • Systematically shrink the DFA • Turn it into code or table Scanner generators • Lex , Flex, Jlex work along these lines • Algorithms are well-known and well-understood • Interface to parser is important

  5. The Cycle of Constructions minimal DFA RE NFA DFA Automating Scanner Construction RENFA (Thompson’s construction) • Build an NFA for each term • Combine them with -moves NFA DFA(subset construction) • Build the simulation DFA Minimal DFA • Hopcroft’s algorithm DFA RE • All pairs, all paths problem • Union together paths from s0to a final state

  6. NFA vs. DFA Scanners • Given a regular expression r we can convert it to an NFA of size O(|r|) • Given an NFA we can convert it to a DFA of size O(2|r|) • We can simulate a DFA on string x in O(|x|) time • We can simulate an NFA N (constructed by Thompson’s construction) on a string x in O(|N| |x|) time Recognizing input string x forregular expression r

  7. Scanner Generators: JLex, Lex, FLex directly copied to the output file user code %% JLex directives %% regular expression rules macro (regular) definitions (e.g., digits = [0-9]+) and state names each rule: optional state list, regular expression, action • user code at top (from parser-generator) specifies what tokens are • States can be mixed with regular expressions • For each regular expression we can define a set of states where it is valid (JLex, Flex) • Standard format of regular expression rule: • <optional_state_list> regular_expression { actions }

  8. JLex, FLex, Lex Automata for regular expression r_1 Java code for JLex, C code for FLex and Lex new final states Regular expression rules: r_1 { action_1 } r_2 { action_2 } . . . r_n { action_n } Ar_1 error new start sate  Ar_2  s0 error . Rules used by scanner generators 1) Continue scanning the input until reaching an error state 2) Accept the longest prefix that matches to a regular expression and execute the corresponding action 3) If two patterns match the longest prefix, then the action which is specified earlier will be executed 4) After a match, go back to the end of the accepted prefix in the input and start scanning for the next token  . . Ar_n error For faster scanning, convert this NFA to a DFA and minimize the states

  9. A Simple Example Recognize the following tokens: Id = [a-z][a-z0-9]* Num = [0-9]+ if ="if" Also take care of one line comments and white space: WhiteSpace = [\ \t\f\b\r\n] Comment = \/\/.*

  10. /* User code */ import java.io.*; // For FileInputStream and its exceptions. /* ========================================== */ class Type { static final int IF = 0; static final int ID = 1; static final int NUM = 2; static final int EOF = 3; }; class Token { public int type; public String attribute; public Token(int t) { type=t; } public Token(int t, String s) { type=t; attribute = s; } public static String spellingOf(int t) { switch (t) { case Type.IF : return "IF"; case Type.ID : return "ID"; case Type.NUM : return "NUM"; default : return "Undefinied token type"; } } public String toString() { switch (type) { case Type.ID : case Type.NUM : return spellingOf(type) + ", " + attribute; default: return spellingOf(type); } } };

  11. /* ================================================= */ class Example { public static void main(String[] args) throws FileNotFoundException, IOException { FileInputStream fis = new FileInputStream(args[0]); Lexer L = new Lexer(fis); Token T = L.next(); while (T.type != Type.EOF) { System.out.println(T); T = L.next(); } } } /* ================================================ */ %% /* JLex directives */ %class Lexer %function next %type Token %eofval{ return new Token(Type.EOF); %eofval} /* white space */ WhiteSpace = [\ \t\f\b\r\n] /* comments */ Comment = \/\/.* Id = [a-z][a-z0-9]* Num = [0-9]+ %% {WhiteSpace} {} {Comment} {} "if" { return new Token(Type.IF); } {Id} { return new Token(Type.ID, yytext()); } {Num} { return new Token(Type.NUM, yytext()); }

  12. If above JLex specification is in a file simple.jlx, you can generate a scanner • for that specification as follows: • % cd <directory for simple.jlx> • % setenv CLASSPATH ".:/fs/cs-cls/cs160/lib" • % java JLex.Main simple.jlx • % javac simple.jlx.java • % java Example input1

  13. IF ID, i1 IF ID, var15 NUM, 15 NUM, 1 NUM, 2 NUM, 4253 if i1 // this is a comment if var15 15 1 2 4253 IF ID, i1 IF ID, var15 NUM, 15 NUM, 1 Undefined token type NUM, 2 Undefined token type NUM, 3 if i1 // this is a comment if var15 15 1 , 2 , 3

  14. Building Faster Scanners from the DFA Table-driven recognizers waste a lot of effort • Read (& classify) the next character • Find the next state • Assign to the state variable • Branch back to the top We can do better • Encode state & actions in the code • Do transition tests locally • Generate ugly, spaghetti-like code (it is OK, this is automatically generated code) • Takes (many) fewer operations per input character state = s0 ; string = ; char = get_next_char(); while (char != eof) { state = (state,char); string = string + char; char = get_next_char(); } if (state in Final) then report acceptance; else report failure;

  15. Building Faster Scanners from the DFA A direct-coded recognizer for Register regular expression R Digit Digit* • Many fewer operations per character • State is encoded as the location in the code goto s0; s0: string; char  get_next_char(); if (char = ‘r’) then goto s1; else goto se; s1: string string+ char; char get_next_char(); if (‘0’ ≤ char ≤ ‘9’) then goto s2; else goto se; s2: string string+ char; char  get_next_char(); if (‘0’ ≤ char ≤ ‘9’) then goto s2; else if (char = eof) then report acceptance; else goto se; se: print error message; return failure;

  16. Building Faster Scanners Hashing keywords versus encoding them directly • Some compilers recognize keywords as identifiers and check them in a hash table • Encoding it in the DFA is a better idea • O(1) cost per transition • Avoids hash lookup on each identifier

  17. What is hard about Lexical Analysis? Poor language design can complicate scanning • Reserved words are important • In PL/I there are no reserved keywords, so you can write a valid statement like: if then then then = else; else else = then • Significant blanks • In Fortran blanks are not significant do 10 i = 1,25 do loop do 10 i = 1.25 assignment to variable named do10i • Closures • Limited identifier length adds states to the automata to count length

  18. integer function A What can be so hard? (Fortran 66/77) Macro definitions First A and B are converted to (6-2) This statement declares that variables that begin with A and B are of data-type four character string )=(3 is a literal constant statement for formatting input, output assigns value to variable DO9E1 assigns value to array element one statement split into two lines • How does a compiler do this? • First pass finds & inserts blanks • Can add extra words or tags to • create a scannable language • Second pass is normal scanner

More Related