1 / 14

FSA and Scanners

FSA and Scanners. Lecture 5. Building Scanners From FSA. FSA are not scanners FSA accept or reject a string Scanners recognize and act on tokens Stop reading input when token seen For example convert string of digits to integer

marion
Download Presentation

FSA and Scanners

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FSA and Scanners Lecture 5

  2. Building Scanners From FSA • FSA are not scanners • FSA accept or reject a string • Scanners recognize and act on tokens • Stop reading input when token seen • For example convert string of digits to integer • Key observation: scanner should recognize longest prefix of input string that forms a token • For example: ID = [a-z][a-z,0-9]* INT = [0-9]+ Given “son77”, do we return {son, 77} or {son, 7,7} or {son77}?

  3. First attempt at a Scanner • A messy scanner • Write REs for all tokens and create FSA from each RE • Try each machine in turn, in order of longest possible accepted string • If FSA accepts, return token • Otherwise, restore input and try next FSA If non accept, then report error • Problems • Need to restore input • Messy but not impossible • Most machines fail and waste effort • Combine machines with common prefixes. • Hard to add new tokens since order is subtle and essential

  4. Combining FSA • Consider FSA for integers and real numbers [0-9]+ [0-9]+ . ‘.’ .[0-9]+ [0-9] [0-9] [0-9] [0-9] Start State Start State . [0-9] [0-9] 3 4 1 2 1 2 INT Real INT [0-9] [0-9] Start State [0-9] [0-9] . 3 2 4 1 INT Real

  5. Combining FSA, cont’d • Define errors more carefully • If FSA reaches non-final state that has no arc labeled with next character, then it is an error • However, if state is final, return the token for the state • Need ability to look-ahead one character • Cannot read character unless it puts FSA into new state • Character not matching an arc begins next token • Can implement look-ahead by reading a character and putting it back getc, ungetc

  6. Implementing Scanners • Table Driven Scanners • FSAs are represented in table of transition arcs • Current state x Input char Next State • Usually generated automatically from RE, • e.g Lex • Ad-hoc Scanners • Represent transition in flow of control in a program • Usually written by hand • More difficult to modify • Typically faster than table driven

  7. Table-Driven Scanners • Two Dimensional table (Current state, current character) • Entries are pairs: (Next state, Action) Input Char S T A T e [0-9] [0-9] Start State [0-9] [0-9] . 3 2 4 1 INT Real

  8. Table-Driven Scanners, Cont’d • Could have column for each character • Scanner would be slightly faster since it need not map character to class • Table would be much larger (10x for example) • In real scanner, most entries are errors • Compress table to remove error entries • Scanner is slower because it must uncompress table • Speed is important in a scanner

  9. Structure for Table Driven Scanner • Use enumerated type for states, actions, and character classes • typedef enum{start_st, int_st,dot_st,real_st,error_st} state; • Typedef enum{int_action,real_action, default_action, error_action} action; • Typedef enum{digit_c1,dot_c1} char_class; • Entries in table struct as_pair { action act; state next_st }; • Table as_pair transition[4][2] = {{default_state,int_state},{…},..};

  10. Interpreter for Table-Driven Scanner Token scan (FILE* input) { state = start_state; while(1){ c= getc (input); cc = get_class(c); //char_class type act = transition[state][cc].act; new_state = transition[state][cc].state; switch (act){ case int_action: return int_token ---- case default_action: break; } state = new_state; }

  11. Ad-hoc Scanner • Program written to recognize tokens C= getc(input) if(get_class[c] ==digit_c1){ while (get_class[c] ==digit_c1) c= getc(input); switch (get_class([c]){ case dot_c1: while(get_class[c] == digit_c1) c = getc(input) ungetc(input,c); return real_token; case eof_c1: return int_token; default: ungetc(input,c); error(…); } }

  12. Actions • For Token classes, it is not enough to recognize them • For example, integers, identifiers • +, if • Scanners associate semantic actions with tokens • When the scanner recognizes a token, it invokes the appropriate action routine and gives it the characters in the token • Typical Actions • Int_action: converts digits to an integer (atoi)

  13. Reserved Words • Most programming languages reserve a set of keywords that cannot be used by programmers as identifiers • For example, if, then, else, for, function, var ,…. • Typically 30-300 words • Reserving keywords makes parsing and reading program easier • Two ways for scanner to recognize keywords • Create REs for all keywords and build into scanner • For example i.f , t.h.e.n • FSA and scanner are larger • In action routine for identifier, check whether and is a keyword • Could use binary search on a sorted list of keywords • Or, pre-enter keywords into symbol table with special value • Need to write additional code and may slow scanner

  14. Are Regular Expressions Enough? • Can we write a compiler with only FSA (a scanner)? • No, scanners only recognize simple class of strings • For example, cannot recognize balanced parentheses (()()))) • FSA have a finite set of States and cannot count • Easy to recognize balanced parentheses with counter n=0; while(1){ c= getc (input); if (c==‘(’) n++; else if (c==‘)’) n--; else if (c ==EOF){ if (n != 0) { error(….) ; break; } } }

More Related