1 / 26

Lexical Analysis

Lexical Analysis. Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://www.math.tau.ac.il/~msagiv/courses/wcc.html Textbook:Modern Compiler Implementation in C Chapter 2. A motivating example.

joie
Download Presentation

Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://www.math.tau.ac.il/~msagiv/courses/wcc.html Textbook:Modern Compiler Implementation in C Chapter 2

  2. A motivating example • Create a program that counts the number of lines in a given input file

  3. A motivating examplesolution int num_lines = 0; %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); }

  4. Subjects • Roles of lexical analysis • The straightforward solution a manual scanner for C • Regular Expressions • Finite automata • From regular languages into finite automata • Flex

  5. Source program (string) Basic Compiler Phases Finite automata lexical analysis Tokens Pushdown automata syntax analysis Abstract syntax tree semantic analysis Memory organization Translate Intermediate representation Instruction selection Dynamic programming Assembly Register Allocation graph algorithms Fin. Assembly

  6. Example • Input string • Tokens a\b := 5 + 3 ;\nb := (print(a, a-1), 10 * a) ;\nprint(b) id (“a”) assignnum (5) + num(3) ; id(“b”) assign print(id(“a”) , id(“a”) - num(1)), num(10) * id(“a”)) ; print(id(“b”))

  7. Lexical Analysis (Scanning) • Functionality • input • program text (file) • output • sequence of tokens • Read input file • Identify language keywords and standard identifiers • Handle include files and macros • Count line numbers • Remove whitespaces • Report illegal symbols • Produce symbol table

  8. A simplified scanner for C Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `:goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: putchar(c); return Plus; } case `<`: case `w`: }

  9. Automatic Generation of Lexical Analysis • The matching of input strings can be performed by a finite automaton • Examples: • An automaton for while • An automaton for C identifier • An automaton for C comment • The program for the automaton is automatically generated from regular expressions

  10. regular expressions tokens scanner input program Flex • Input • regular expressions and actions (C code) • Output • A scanner program that reads the input and applies actions when input regular expression is matched flex

  11. Regular Expression Notations a An ordinary character stands for itself M|NM or N MN M followed by N M* Zero or more times of M M+ One or more times of M M? Zero or one occurrence of M [a-zA-Z] Character set alternation (single character) . Any (single) character but newline “a.+” Quotation \ Convert an operator into text

  12. Ambiguity Resolving • Find the longest matching token • Between two tokens with the same length use the one declared first

  13. A Flex specification of C Scanner Letter [a-zA-Z_] Digit [0-9] %% [ \t] {;} [\n] {line_count++;} “;” { return SemiColumn;} “++” { return PlusPlus ;} “+=“ { return PlusEqual ;} “+” { return Plus} “while” { return While ; } {Letter}({Letter}|{Digit})* { return Id ;} “<=” { return LessOrEqual;} “<” { return LessThen ;}

  14. Running Example if { return IF; } [a-z][a-z0-9]* { return ID; } [0-9]+ { return NUM; } [0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; } (\-\-[a-z]*\n)|(“ “|\n|\t) { ; } . { error(); }

  15. int edges[][256] ={ /* …, 0, 1, 2, 3, ..., -, e, f, g, h, i, j, ... */ /* state 0 */ {0, ..., 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0} /* state 1 */ {13, ..., 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, ..., 13, 13} /* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, ..., 0, 0} /* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0} /* state 4 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 4, 4, 4, 4, 4, ..., 0, 0} /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0} /* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0} /* state 7 */ ... /* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}

  16. Token nextToken() { lastFinal = 0; currentState = 1 ; inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) { nextState = edges[currentState][currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition; } input = inputPositionAtLastFinal ; return action[lastFinal]; } Pseudo Code for Scanner

  17. Example Input: “if --not-a-com”

  18. Efficient Scanners • Efficient state representation • Input buffering • Using switch and goto instead of tables

  19. Constructing Automaton from Specification • Create a non-deterministic automaton (NDFA) from every regular expression • Merge all the automata using epsilon moves(like the | construction) • Construct a deterministic finite automaton (DFA) • Minimize the automaton starting with separate accepting states

  20. NDFA Construction if { return IF; } [a-z][a-z0-9]* { return ID; } [0-9]+ { return NUM; } [0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; } (\-\-[a-z]*\n)|(“ “|\n|\t) { ; } . { error(); }

  21. DFA Construction

  22. Minimization

  23. %{ /* C declarations */ #include “tokens.h'' /* Mapping of tokens into integers */ #include “errormsg.h'' /* Shared by all the phases */ union {int ival; string sval; double fval;} yylval; int charPos=1 ; #define ADJ (EM_tokPos=charPos, charPos+=yyleng) %} /* Lex Definitions */ digits [0-9]+ %% if { ADJ; return IF;} [a-z][a-z0-9] { ADJ; yylval.sval=String(yytext); return ID; } {digits} {ADJ; yylval.ival=atoi(yytext); return NUM; } ({digits}\.{digits}?)|({digits}?\.{digits}) { ADJ; yylval.fval=atof(yytext); return REAL; } (\-\-[a-z]*\n)|([\n\t]|" ")* { ADJ; } . { ADJ; EM_error(“illegal character''); }

  24. Start States • Regular expressions may be more complicated than automata • C comments • Solutions • Conversion of automata into regular expressions • Start States % start s1 s2 %% < INITIAL>r1 { action0 ; BEGIN s_1; } <s1>r1 { action1 ; BEGIN s2; } <s2>r2 { action2 ; BEGIN INITIAL};

  25. Realistic Example % start Comment %% <INITIAL>”/*'' { BEGIN Comment; } <INITIAL>r1 { Usual actions; } <INITIAL>r2 { Usual actions; } ... <INITIAL>rk { Usual actions; } <Comment>”*/”’ { BEGIN Initial; } <Comment>.|\n ;

  26. Summary • For most programming languages lexical analyzers can be easily constructed • Exceptions: • Fortran • PL/1 • Flex is a useful tool beyond compilers

More Related