230 likes | 253 Views
Scanning & FLEX. CPSC 388 Ellen Walker Hiram College. Scanning (review). Input: characters from the source code Output: Tokens Keywords: IF, THEN, ELSE, FOR … Symbols: PLUS, LBRACE, SEMI … Variable tokens: ID, NUM Augment with string or numeric value. Token Class (partial).
E N D
Scanning & FLEX CPSC 388 Ellen Walker Hiram College
Scanning (review) • Input: characters from the source code • Output: Tokens • Keywords: IF, THEN, ELSE, FOR … • Symbols: PLUS, LBRACE, SEMI … • Variable tokens: ID, NUM • Augment with string or numeric value
Token Class (partial) Class Token { Public: TokenType tokenval; string tokenchars; double numval; }
GetToken(): A scanning function • Token *getToken(istream &sin) • Read characters from sin until a complete token is extracted, return the token • Usually called by the parser • Note: version in the book uses global variables and returns only the token type
Using GetToken (Review) Token *myToken = GetToken(cin); While (myToken != NULL){ //process the token switch (myToken->TokenType){ //cases for each token type } myToken = GetToken(cin); }
Regular Expressions for Common Tokens • Special characters: (the characters) • Identifier: [a-zA-Z][a-zA-Z_]* • Numbers: • Int: [1-9][0-9]* • Float: [1-9][0-9]*(e|(.[0-9]*)) • Scientific: [1-9][0-9]*(e|(.[0-9]*))(E+e)(+|–| e)[1-9][0-9]*
Reg. Exp. For Comments • Comment to end of line • //[^\n]* (last part: (all chars except \n)* ) • /*…*/ comment • ab (~b|b~a)*b?ba <--- ab … ba • /\* (~\* | \*~/)*(\*)? \*/ <--- needs escapes! • Does not require matching of “inner” /**/
Comments in Practice • Often handled by “ad-hoc” methods • Scanner simply loops to ignore characters from /* to */ • If character is not ‘*’, ignore it • Else if next character is not “/”, ignore it • Else ignore “/*” and return to scanning normally
Delimiters and Ambiguity • Comments are not totally ignored! • “fo/**/r” is not the keyword “for” ! • Principle of longest substring (“maximal munch”) • “fork” is not “for” followed by “k” • Disallow keywords as identifiers • Scan identifier, then look it up instead of including keywords explicitly in language
FORTRAN’s mistakes • Ignored white space (no delimiters) • DO99I=1.2 (DO99I = 1.2) vs. • DO99I=1,2 (DO 99 I = 1 , 2) • No reserved words • IF(IF.EQ.0)THENTHEN=17 • Result: arbitrary backtracking (or lookahead) needed!
TINY Lexemes • Reserved words: if, then, else, end, repeat, until, read, write • Symbols: +, -, *, /, =, <, (, ), ;, := • Other: number (integer only), identifier (letters only) • Comment: {…} • Principle of longest substring holds
Using the TINY DFA • Implement DFA directly or with a table • Each call to gettoken() starts at the current point of the string, scans until no transition is possible. • If final state is reached, return the token determined by the link to the final state. Otherwise, report an error. • Characters in [ ] are not consumed
DFA pseudocodde • State = Start_state • While (chars available ){ • last_state = state; • state = next_state(next_char, state); • if state = null return (final (last_state)); • } return final(last_state);
LEX (FLEX) • FLEX generates a scanner automatically! • Input: description of regular expression for each token, optional additional code • Output: lex.yy.c - includes function yylex() for parsing (like gettoken)
DFA Pseudocode • state = initial-state • while(chars in string){ • c = next char from string • state = next_state[state][c] • } • If final[state] return ACCEPT
Parts of a LEX file • Definitions • code for the top of the file, and define expressions such as “digit” • All code in %{ and %} directly copied • Rules • { expression } {code when recognized} • Auxiliary Routines • Define additional functions here (including main)
Predefined items • yylex() - lex scanning routine (like getToken) - generated by FLEX • yytext - current string (a character array, not a C++ string class) • Input() - get a char from flex input • ECHO - print yytext to yyout
Example: Definitions %{ /* add line numbers to text and print */ #include <iostream> int lineno=1; %} line .*\n %%
Example: Rules & Aux. Code {line} {cout << lineno++ <<“ “<< yytext;} %% main(){ yylex(); return 0; }
Using the Scanner • First, create the code • flex test.lex • Next, compile the program • g++ lex.yy.c -o test -lfl • Finally, scan the input file • ./test < input_file