260 likes | 370 Views
Lexical Analysis. Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://www.math.tau.ac.il/~msagiv/courses/wcc.html Textbook:Modern Compiler Implementation in C Chapter 2. A motivating example.
E N D
Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://www.math.tau.ac.il/~msagiv/courses/wcc.html Textbook:Modern Compiler Implementation in C Chapter 2
A motivating example • Create a program that counts the number of lines in a given input file
A motivating examplesolution int num_lines = 0; %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); }
Subjects • Roles of lexical analysis • The straightforward solution a manual scanner for C • Regular Expressions • Finite automata • From regular languages into finite automata • Flex
Source program (string) Basic Compiler Phases Finite automata lexical analysis Tokens Pushdown automata syntax analysis Abstract syntax tree semantic analysis Memory organization Translate Intermediate representation Instruction selection Dynamic programming Assembly Register Allocation graph algorithms Fin. Assembly
Example • Input string • Tokens a\b := 5 + 3 ;\nb := (print(a, a-1), 10 * a) ;\nprint(b) id (“a”) assignnum (5) + num(3) ; id(“b”) assign print(id(“a”) , id(“a”) - num(1)), num(10) * id(“a”)) ; print(id(“b”))
Lexical Analysis (Scanning) • Functionality • input • program text (file) • output • sequence of tokens • Read input file • Identify language keywords and standard identifiers • Handle include files and macros • Count line numbers • Remove whitespaces • Report illegal symbols • Produce symbol table
A simplified scanner for C Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `:goto loop ; case `;`: return SemiColumn; case `+`: c = getchar() ; switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: putchar(c); return Plus; } case `<`: case `w`: }
Automatic Generation of Lexical Analysis • The matching of input strings can be performed by a finite automaton • Examples: • An automaton for while • An automaton for C identifier • An automaton for C comment • The program for the automaton is automatically generated from regular expressions
regular expressions tokens scanner input program Flex • Input • regular expressions and actions (C code) • Output • A scanner program that reads the input and applies actions when input regular expression is matched flex
Regular Expression Notations a An ordinary character stands for itself M|NM or N MN M followed by N M* Zero or more times of M M+ One or more times of M M? Zero or one occurrence of M [a-zA-Z] Character set alternation (single character) . Any (single) character but newline “a.+” Quotation \ Convert an operator into text
Ambiguity Resolving • Find the longest matching token • Between two tokens with the same length use the one declared first
A Flex specification of C Scanner Letter [a-zA-Z_] Digit [0-9] %% [ \t] {;} [\n] {line_count++;} “;” { return SemiColumn;} “++” { return PlusPlus ;} “+=“ { return PlusEqual ;} “+” { return Plus} “while” { return While ; } {Letter}({Letter}|{Digit})* { return Id ;} “<=” { return LessOrEqual;} “<” { return LessThen ;}
Running Example if { return IF; } [a-z][a-z0-9]* { return ID; } [0-9]+ { return NUM; } [0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; } (\-\-[a-z]*\n)|(“ “|\n|\t) { ; } . { error(); }
int edges[][256] ={ /* …, 0, 1, 2, 3, ..., -, e, f, g, h, i, j, ... */ /* state 0 */ {0, ..., 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0} /* state 1 */ {13, ..., 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, ..., 13, 13} /* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, ..., 0, 0} /* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0} /* state 4 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 4, 4, 4, 4, 4, ..., 0, 0} /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0} /* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0} /* state 7 */ ... /* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}
Token nextToken() { lastFinal = 0; currentState = 1 ; inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) { nextState = edges[currentState][currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition; } input = inputPositionAtLastFinal ; return action[lastFinal]; } Pseudo Code for Scanner
Example Input: “if --not-a-com”
Efficient Scanners • Efficient state representation • Input buffering • Using switch and goto instead of tables
Constructing Automaton from Specification • Create a non-deterministic automaton (NDFA) from every regular expression • Merge all the automata using epsilon moves(like the | construction) • Construct a deterministic finite automaton (DFA) • Minimize the automaton starting with separate accepting states
NDFA Construction if { return IF; } [a-z][a-z0-9]* { return ID; } [0-9]+ { return NUM; } [0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; } (\-\-[a-z]*\n)|(“ “|\n|\t) { ; } . { error(); }
%{ /* C declarations */ #include “tokens.h'' /* Mapping of tokens into integers */ #include “errormsg.h'' /* Shared by all the phases */ union {int ival; string sval; double fval;} yylval; int charPos=1 ; #define ADJ (EM_tokPos=charPos, charPos+=yyleng) %} /* Lex Definitions */ digits [0-9]+ %% if { ADJ; return IF;} [a-z][a-z0-9] { ADJ; yylval.sval=String(yytext); return ID; } {digits} {ADJ; yylval.ival=atoi(yytext); return NUM; } ({digits}\.{digits}?)|({digits}?\.{digits}) { ADJ; yylval.fval=atof(yytext); return REAL; } (\-\-[a-z]*\n)|([\n\t]|" ")* { ADJ; } . { ADJ; EM_error(“illegal character''); }
Start States • Regular expressions may be more complicated than automata • C comments • Solutions • Conversion of automata into regular expressions • Start States % start s1 s2 %% < INITIAL>r1 { action0 ; BEGIN s_1; } <s1>r1 { action1 ; BEGIN s2; } <s2>r2 { action2 ; BEGIN INITIAL};
Realistic Example % start Comment %% <INITIAL>”/*'' { BEGIN Comment; } <INITIAL>r1 { Usual actions; } <INITIAL>r2 { Usual actions; } ... <INITIAL>rk { Usual actions; } <Comment>”*/”’ { BEGIN Initial; } <Comment>.|\n ;
Summary • For most programming languages lexical analyzers can be easily constructed • Exceptions: • Fortran • PL/1 • Flex is a useful tool beyond compilers