Compiler Structures

Compiler Structures 242-437, Semester 2, 2018-2019 • Objectives • what is lexical analysis? • look at a lexical analyzer for a simple 'expressions' language 2. Lexical Analysis

Overview 1. Why Lexical Analysis? 2. Using a Lexical Analyzer 3. Implementing a Lexical Analyzer 4. The Expressions Language 5.exprTokens.c 6. From REs to Code Automatically

Source Program In this lecture Lexical Analyzer Front End Syntax Analyzer Semantic Analyzer Int. Code Generator Intermediate Code Code Optimizer Back End Target Code Generator Target Lang. Prog.

1. Why Lexical Analysis? • Stream of input text (e.g. from a file) is converted to an output stream of tokens (e.g. structs, records, constants) • Simplifies the design of the rest of the compiler • the code uses tokens, not strings or characters • Can be implemented efficiently • by hand or automatically • Improves portability • non-standard symbols / foreign characters are translated here, so do not affect the rest of the compiler

2. Using a Lexical Analyzer 3. Token,token value Syntax Analyzer (using tokens) LexicalAnalyzer(using chars) SourceProgram 1. Get nexttoken 2. Get chars to make a token lexical errors syntax errors

\n e l s e ; \n \t z = 0 ; \n e n d i f ; A Source Program is Chars Consider the program fragment: if (i==j); z=1; else; z=0; endif; The lexical analyzer reads it in as a string of characters: i f _ ( i = = j ) ; \n \t z = 1 ; Lexical analysis divides the string into tokens.

Tokens and Token Values Lexical Analyzer "y = 31 + 28*foo" get chars <id, “y”> <=, > <int, 31> <+, > <int, 28> <*, > <id, “foo”> token get tokens (one at a time) token value Syntax Analyzer

Tokens, Lexemes, and Patterns • A token is a lexical type • e.g id, int • A lexeme is a token value • e.g. "abc", 123 • A pattern says how to make a token from chars • e.g. id = letter followed by letters and digitsint =non-empty sequence of digits • a pattern is defined using regular expressions (REs)

3. Implementing a Lexical Analyzer Issues: • Lookahead • how to group chars into tokens • Ignoring whitespace and comments. • Separating variables from keywords • e.g. "if", "else" • (Automatically) translating REs into a lexical analyzer.

Lookahead • A token is created by reading in characters, and grouping them together. • It is not always possible to decide if a token is finished without looking ahead at the next char. • For example: • Is "i" a variable, or the first character of "if"? • Is "=" an assignment or the beginning of "=="?

4. The Expressions Language • In my expressions language, a program is a series of expressions and assignments. • Example: // test2.txt examplelet x56 = 2let bing_BONG = (27 * 2) - x565 * (67 / 3)

4.1. REs for the Language • alpha = a | b | c | ... | z | A | B | ... | Z • digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • alphanum = alpha | digit • id = alpha (alphanum | '_' )* • int = digit+

keywords = "let" | "SCANEOF" • punctuation = '(' | ')' | '+' | '-' | '*' | '/' | '=' | '\n' • Ignore: • whitespace (but not newlines) • comments ("//" to the end of the line)

4.2. From REs to Tokens • Using the REs as a guide, we create tokens and token values. How? • In general, the top-level REs (id, num) become tokens, and so do the punctuation and the keywords.

Tokens and Token Values • TokenToken Value ID "var" and the id stringINT "num" and the valueLPAREN '('RPAREN ')'PLUSOP '+'MINUSOP '-'MULTOP '*'DIVOP '/'

TokenToken Value ASSIGNOP '='NEWLINE '\n'LET "let"SCANEOF eof character

5. exprTokens.c • exprTokens.c is a lexical analyzer for the expressions language. • It reads in an expressions program on stdin, and prints out the tokens (and their values).

5.1. Usage > gcc -Wall -o exprTokens exprTokens.c > ./exprTokens < test2.txt 1: 2: 3: 4: 'let' var(x56) '=' num(2) 5: 'let' var(bing_BONG) '=' '(' num(27) '*' num(2) ')' '-' var(x56) 6: 7: num(5) '*' '(' num(67) '/' num(3) ')' 8: 'eof' > or a Windows C compiler: lcc-win32, http://www.cs.virginia.edu/~lcc-win32/

5.2. Code • // constants for tokens and their values#define NUMKEYS 2typedef enum token_types { LET, ID, INT, LPAREN, RPAREN, NEWLINE, ASSIGNOP, PLUSOP, MINUSOP, MULTOP, DIVOP, SCANEOF} Token;char *tokSyms[] = {"let", "var", "num", "(", ")", "\n", "=", "+", "-", "*", "/", "eof"};char *keywords[NUMKEYS] = {"let", "SCANEOF"};Token keywordToks[NUMKEYS] = {LET, SCANEOF};

Callgraph for exrprTokens.c calls

main() and its globals • Token currToken;int lineNum = 1; // num lines read inint main(void){ printf("%2d: ", lineNum); do { nextToken(); printToken(); } while (currToken != SCANEOF); return 0;}

Printing the Tokens • #define MAX_IDLEN 30char tokString[MAX_IDLEN];int currTokValue; // used when token is an integervoid printToken(void){ if (currToken == ID)// an ID, variable name printf("%s(%s) ", tokSyms[currToken], tokString); else if (currToken == INT)// a number printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other toks} // end of printToken()

Getting a Token • void nextToken(void){ currToken = scanner(); }

scanner() Overview Token scanner(void)// converts chars into a token { int inCh; clearTokStr(); if (feof(stdin)) return SCANEOF; while ((inCh = getchar()) != EOF) { /* EOF is ^D */ if (inCh == '\n') { lineNum++; return NEWLINE; } else if (isspace(inCh)) // do nothing continue;

else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_')* // read in chars to make id token // return ID or keyword } else if (isdigit(inCh)){ // INT = DIGIT+ // read in chars to make int token // change token to int return INT; } else if (inCh == '(') return LPAREN; else if ... // more tests of inCh ... else if (inCh == '=') return ASSIGNOP; else lexicalErr(inCh); } return SCANEOF; } // end of scanner() punctuation

Processing an ID in scanner() :else if (isalpha(inCh)){ // ID = ALPHA (ALPHA_NUM | '_')* extendTokStr(inCh); for (inCh = getchar(); (isalnum(inCh) || inCh == '_'); inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); return checkKeyword(); } :

Token String Functions void clearTokStr(void) // reset the token string to be empty { tokString[0] = '\0'; tokStrLen = 0; } // end of clearTokStr() void extendTokStr(char ch) // add ch to the end of the token string { if (tokStrLen == (MAX_IDLEN-1)) printf("Token string too long for %c\n", ch); else { tokString[tokStrLen] = ch; tokStrLen++; tokString[tokStrLen] = '\0'; // terminate string } } // end of extendTokStr()

Checking for a Keyword Token checkKeyword(void) { int i; for(i=0; i<NUMKEYS; i++) { if(!strcmp(tokString, keywords[i])) return keywordToks[i]; } return ID; } // end of checkKeyword()

Processing an INT in scanner() :else if (isdigit(inCh)){ // INT = DIGIT+ extendTokStr(inCh); for (inCh = getchar(); isdigit(inCh); inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); currTokValue = atoi(tokString); // token --> int return INT; } :

Reporting an Error void lexicalErr(char ch) { printf("Lexical error at \"%c\" on line %d\n", ch, lineNum); exit(1); } No recovery attempted.

5.3. Some Good News • Most programming languages use very similar lexical analyzers • e.g. the same kind of IDs, INTs, punctuation, and keywords • Once you've written one lexical analyzer, you can reuse it for other languages with only minor changes.

6. From REs to Code Automatically 1. Write the REs for the language. 2. Convert to Non-deterministic Finite Automata (NFA). 3. Convert to Deterministic Finite Automata (DFA) 4. Convert to a table that can be 'plugged' into an 'empty' lexical analyser. • There are tools that will do stages 2-4 automatically. We'll look at one such tool, lex, in the next chapter.

Compiler Structures