150 likes | 168 Views
SCANNING. Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, TAIWAN. Scanner (lexical analyzer). primary function -- grouping input characters into tokens called by -- parser return -- 1. token code 2. attribute (optional)
E N D
SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, TAIWAN
Scanner (lexical analyzer) • primary function -- grouping input characters into tokens • called by -- parser • return -- 1. token code 2. attribute (optional) • theoretical bases -- regular expression, finite automata • implementation • dedicated program (hardwired) • table-driven • construction • hand-coded • by generator, in order to limit the effort in building a scanner by specifying which tokens the scanner is to recognize • program [lex] • table + standard driver program [ScanGen]
Regular expression (1/2) • being used to • specify simple set of strings (regular set) • specify tokens of programming language • program a scanner generator • string -- catenation of characters in vocabulary, denoted V • regular expression • meta-characters: ( ) ‘ * + ? | • have to be quoted when used as ordinary characters 1.Æ -- empty set 2. l -- set of null string 3.s -- { string s } 4. A | B -- alternation of corresponding regular sets 5. A B -- catenation of corresponding regular sets 6. A* -- Kleene closure of corresponding regular set • repeating zero or more times
Regular expression (2/2) • other notations • A+ = A A* • A? = A | l • Not(A) = V - A for set of characters A • Not(S) = V* - S for set of stings S • may be infinite but still regular • Ak = A A ... A (k times) • examples • -- anything Eol Comment = - - ( Not(Eol) )* Eol • fixed decimal literal Lit = D+ . D+ • identifier begin with letter ID = L ( L | D )* ( _ ( L | D )+ )* end with letter/digit without consecutive underlines • being able to represent all finite sets and many but not all infinite sets • QUIZ: counter example?
- - Eol 1 2 3 4 Not(Eol) Finite automata • being used to recognize the tokens specified by a regular expression • consisting of • a finite set of states • a set of transitions labeled with characters in V • a start state • a set of final states • transition diagramltransition table ublank: error entry • deterministic finite automata (DFA) • unique transition for a given state and character • otherwise, nondeterministic finite automata (NFA)
l l NFA for A l A l a l NFA for A NFA for A l l l A A NFA for B NFA for B B l l From RE to NFA • rules • luKleene closure • vocabulary • catenationualternation
l a 1 2 5 a b 3, 4,5 a a 1,2 4,5 b a | b 3 4 a a a | b 5 3, 4,5 1,2 a b 3, 4,5 1,2 4,5 a b 3, 4,5 1,2 4,5 a a a | b 5 5 From NFA to DFA • major operation: l-closure • example 3.l-closure( 4, 5 ) = 5 1.l-closure(1) = 1, 2 4.l-closure( 5 ) = 5 2.l-closure( 3, 4, 5 ) = 3, 4, 5
DFA optimization • major operation: partition states into equivalent classes according to • final / non-final states • transition functions • example ( A B C D E ) ( A B C D ) ( E ) ( A B C ) ( D ) ( E ) ( A C ) ( B ) ( D ) ( E )
dedicated program example if (current_char == '-') { current_char = getchar(); if (current_char == '-') { do current_char = getchar(); while (current_char != '\n'); } else { ungetc(current_char, stdin); lexical_error(current_char); } } else lexical_error(current_char); /* Return or process valid token. */ ungetc() -- lookahead - - Eol 1 2 3 4 Not(Eol) From DFA to scanner (1/3)
table-driven transition table + return token code + character save/toss operation + process of valid token example /* * Note: current_char is already set * to the current input character. */ state = initial_state; while (TRUE) { next_state = T[state][current_char]; if (next_state == ERROR) break; state = next_state; if (current_char == EOF) break; current_char = getchar(); } if (is_final_state(state)) /* Return or process valid token. */ else lexical_error(current_char); QUIZ: where is “lookahead” ? From DFA to scanner (2/3)
NOT( " ) T( " ) T( " ) " From DFA to scanner (3/3) • toss operation • example -- ( " ( Not(") | " ")* " ) • QUIZ: how to program? " " "H i " "" " H i "
Reserved words • identifiers reserved for particular usage • approach 1 • one reserved word one regular expression • approach 2 • exceptions to ordinary identifiers • approach used in our simple example • QUIZ: comparison?
Lexical error recovery • strategies • delete the characters read so far • delete the first character • handling of runaway string • QUIZ: why need special handling? • " ( Not("|Eol) | " " )* " • " ( Not("|Eol) | " " )* Eol • print out special error message • handling of runaway comment • { Not({|})* } • { ( Not({|})* { Not({|})* )+ } • warning • { Not(})* Eof • error
input file -- E [Ee] OtherLetter [A-DF-Za-df-z] Digit [0-9] Letter {E} | {OtherLetter} IntLit {Digit}+ %% [ \t\n]+ { /* delete */ } [Bb][Ee][Gg][Ii][Nn] { minor=0; return(4); } [Ee][Nn][Dd] { minor=0; return(5); } [Rr][Ee][Aa][Dd] { minor=0; return(6); } [Ww][Rr][Ii][Tt][Ee] { minor=0; return(7}; } {Letter}({Letter} | {Digit} | _)* { minor=0; return(1); } {IntLit} { minor=1; return(2}; } ({IntLit}[.]{IntLit})({E}[+-]?{IntLit})? { minor=2; return(2}; } \"([^\"\n] I \"\")*\" { stripquotes(); minor=3; return(2); } \"([^\"\n] I \"\"}*\n { stripquotes(); minor=0; return(3); } "(" { minor=0; return(8}; } ")" { minor=0; return(9); } ";" { minor=0; return(10); } "," { minor=0; return(11); } ":=" { minor=0; return(12); } "+" { minor=0; return(13}; } " " { minor=0; return(14}; } %% Lex (1/2) class precedence to reduce table size regular expression executed when RE is matched
Lex (2/2) • input file -- /* Strip unwanted quotes from string in yytext; adjust yyleng. */ void stripquotes(void} { int frompos, topos = 0, numquotes = 2; for (frompos = 1; frompos < yyleng; frompos++) { yytext[topos++] = yytext[frompos]; if (yytext[frompos] == '"' && yytext[frompos+1] == '"') { frompos++; numquotes++; } } yyleng -= numquotes; yytext[yyleng] = '\0'; } • output -- a program • interface -- int yylex( ) char yytext; int yyleng; auxiliary routine(s)