140 likes | 281 Views
CPSC 325 - Compiler. Tutorial 2 Scanner & Lex. Tokens. Input. Token Stream: Each significant lexical chunk of the program is represented by a token Operators & Punctuation: { } ! + - = * ; : … Keywords: if while return goto Identifier: id & actual name
E N D
CPSC 325 - Compiler Tutorial 2 Scanner & Lex
Tokens Input • Token Stream: Each significant lexical chunk of the program is represented by a token • Operators & Punctuation: { } ! + - = * ; : … • Keywords: if while return goto • Identifier: id & actual name • Constants: kind & value; int, floating-point character, string, …
Input text if( x >= y ) y = 10; Token Stream Token – example 1 IF LP ID(x) GEQ ID(y) RP ID(y) Assign INT(10) SEMI
Tokens Parser IF LP ID(x) GEQ ID(y) RP ID(y) Assign INT(10) SEMI IfStmt >= assign ID(y) ID(y) INT(10) ID(x)
Sample Grammar • Program ::= statement | program statement • Statement ::= assignStmt | ifStmt • assignStmt ::= id = expr; • ifStmt ::= if ( expr ) Statement • Expr ::= id | int | expr + expr • id ::= a | b | … | y | z • Int ::= 1 | 2 | … | 9 | 0
Why Separate the Scanner and Parser? • Simplicity & Separation of Concerns • Scanner hides details from parser (comments, whitespace, input files, etc.) • Parser is easier to build; has simpler input stream • Efficiency • Scanner can use simpler, faster design • (But still often consumes a surprising amount of the compiler’s total execution time)
Principle of Longest Match • In most of languages, the scanner should pick the longest possible string to make up the next token if there is a choice. • Example return apple != banana; Should be recognized as 5 tokens Not more (not parts of words or identifier, or ! And = as separate tokens) return ID(apple) NEQ ID(banana) SEMI
Scanner DFA Example (1) White space or comments 0 Accept EOF 1 end of input Accept LP ( 2 Accept RP ) 3 4 ; Accept SEMI
Scanner DFA Example (2) White space or comments Accept NEQ 6 ! = 5 Accept NOT 7 other 8 < = Accept LEQ 9 other 10 Accept LESS
Scanner DFA Example (3) White space or comments [0-9] [0-9] 11 Accept INT other 12
Scanner DFA Example (4) White space or comments [a-zA-Z] [a-zA-Z] 13 Accept ID or keyword other 14
Lex/Flex • Use Flex instead of Lex • Use Bison instead of yacc • When compile, link to the library • flex file.lex • gcc –o object lex.yy.c –ll • object
Lex - Structure • Declarations/Definitions %% • Rules/Production - Lex expression - white space - C statement (optional) %% • Additional Code/Subroutines
Lex – Basic operators • * - zero or more occurrences • . - “ANY” character • .* - matches any sequence • | - separator • + - one or more occurrences. (a+ :== aa*) • ? - zero or one of something. (b? :== (b+null) • [ ] - choice, so [12345] (1|2|3|4|5) (Note: [*+] represent a choice between star and plus. They lost their specialty. • - - [a-zA-Z] a to z and A to Z, all the letters. • \ - \* matches *, and \. Match period or decimal point.