140 likes | 391 Views
FSA and Scanners. Lecture 5. Building Scanners From FSA. FSA are not scanners FSA accept or reject a string Scanners recognize and act on tokens Stop reading input when token seen For example convert string of digits to integer
E N D
FSA and Scanners Lecture 5
Building Scanners From FSA • FSA are not scanners • FSA accept or reject a string • Scanners recognize and act on tokens • Stop reading input when token seen • For example convert string of digits to integer • Key observation: scanner should recognize longest prefix of input string that forms a token • For example: ID = [a-z][a-z,0-9]* INT = [0-9]+ Given “son77”, do we return {son, 77} or {son, 7,7} or {son77}?
First attempt at a Scanner • A messy scanner • Write REs for all tokens and create FSA from each RE • Try each machine in turn, in order of longest possible accepted string • If FSA accepts, return token • Otherwise, restore input and try next FSA If non accept, then report error • Problems • Need to restore input • Messy but not impossible • Most machines fail and waste effort • Combine machines with common prefixes. • Hard to add new tokens since order is subtle and essential
Combining FSA • Consider FSA for integers and real numbers [0-9]+ [0-9]+ . ‘.’ .[0-9]+ [0-9] [0-9] [0-9] [0-9] Start State Start State . [0-9] [0-9] 3 4 1 2 1 2 INT Real INT [0-9] [0-9] Start State [0-9] [0-9] . 3 2 4 1 INT Real
Combining FSA, cont’d • Define errors more carefully • If FSA reaches non-final state that has no arc labeled with next character, then it is an error • However, if state is final, return the token for the state • Need ability to look-ahead one character • Cannot read character unless it puts FSA into new state • Character not matching an arc begins next token • Can implement look-ahead by reading a character and putting it back getc, ungetc
Implementing Scanners • Table Driven Scanners • FSAs are represented in table of transition arcs • Current state x Input char Next State • Usually generated automatically from RE, • e.g Lex • Ad-hoc Scanners • Represent transition in flow of control in a program • Usually written by hand • More difficult to modify • Typically faster than table driven
Table-Driven Scanners • Two Dimensional table (Current state, current character) • Entries are pairs: (Next state, Action) Input Char S T A T e [0-9] [0-9] Start State [0-9] [0-9] . 3 2 4 1 INT Real
Table-Driven Scanners, Cont’d • Could have column for each character • Scanner would be slightly faster since it need not map character to class • Table would be much larger (10x for example) • In real scanner, most entries are errors • Compress table to remove error entries • Scanner is slower because it must uncompress table • Speed is important in a scanner
Structure for Table Driven Scanner • Use enumerated type for states, actions, and character classes • typedef enum{start_st, int_st,dot_st,real_st,error_st} state; • Typedef enum{int_action,real_action, default_action, error_action} action; • Typedef enum{digit_c1,dot_c1} char_class; • Entries in table struct as_pair { action act; state next_st }; • Table as_pair transition[4][2] = {{default_state,int_state},{…},..};
Interpreter for Table-Driven Scanner Token scan (FILE* input) { state = start_state; while(1){ c= getc (input); cc = get_class(c); //char_class type act = transition[state][cc].act; new_state = transition[state][cc].state; switch (act){ case int_action: return int_token ---- case default_action: break; } state = new_state; }
Ad-hoc Scanner • Program written to recognize tokens C= getc(input) if(get_class[c] ==digit_c1){ while (get_class[c] ==digit_c1) c= getc(input); switch (get_class([c]){ case dot_c1: while(get_class[c] == digit_c1) c = getc(input) ungetc(input,c); return real_token; case eof_c1: return int_token; default: ungetc(input,c); error(…); } }
Actions • For Token classes, it is not enough to recognize them • For example, integers, identifiers • +, if • Scanners associate semantic actions with tokens • When the scanner recognizes a token, it invokes the appropriate action routine and gives it the characters in the token • Typical Actions • Int_action: converts digits to an integer (atoi)
Reserved Words • Most programming languages reserve a set of keywords that cannot be used by programmers as identifiers • For example, if, then, else, for, function, var ,…. • Typically 30-300 words • Reserving keywords makes parsing and reading program easier • Two ways for scanner to recognize keywords • Create REs for all keywords and build into scanner • For example i.f , t.h.e.n • FSA and scanner are larger • In action routine for identifier, check whether and is a keyword • Could use binary search on a sorted list of keywords • Or, pre-enter keywords into symbol table with special value • Need to write additional code and may slow scanner
Are Regular Expressions Enough? • Can we write a compiler with only FSA (a scanner)? • No, scanners only recognize simple class of strings • For example, cannot recognize balanced parentheses (()()))) • FSA have a finite set of States and cannot count • Easy to recognize balanced parentheses with counter n=0; while(1){ c= getc (input); if (c==‘(’) n++; else if (c==‘)’) n--; else if (c ==EOF){ if (n != 0) { error(….) ; break; } } }