1 / 31

Building a Lexical Analyzer: Roles, Design, and Implementation

Discover the crucial roles, design steps, and implementation concerns in building a lexical analyzer for programming languages, with a focus on symbols, regular expressions, DFA, and backtracking. Enhance your understanding of token recognition and state transition tables.

fgoldstein
Download Presentation

Building a Lexical Analyzer: Roles, Design, and Implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012

  2. Part I Building a Lexical Analyzer Joey Paquet, 2000, 2002, 2012

  3. Roles of the Scanner • Removal of comments • Comments are not part of the program’s meaning • Multiple-line comments? • Nested comments? • Case conversion • Is the lexical definition case sensitive? • For identifiers • For keywords Joey Paquet, 2000, 2002, 2012

  4. Roles of the Scanner • Removal of white spaces • Blanks, tabulars, carriage returns • Is it possible to identify tokens in a program without spaces? • Interpretation of compiler directives • #include, #ifdef, #ifndefand #define are directives to “redirect the input” of the compiler • May be done by a pre-compiler Joey Paquet, 2000, 2002, 2012

  5. Roles of the Scanner • Communication with the symbol table • A symbol table entry is created when an identifier is encountered • The lexical analyzer cannot create the whole entries • Convert the input file to a token stream • Input file is a character stream • Lexical convention: literals, operators, keywords, punctuation Joey Paquet, 2000, 2002, 2012

  6. id relop openpar if then assignop semi distance,rate,time,a,x >=,<,== ( if then = ; Tokens and Lexemes • Token: An element of the lexical definition of the language. • Lexeme: A sequence of characters identified as a token. Joey Paquet, 2000, 2002, 2012

  7. Design of a Lexical Analyzer • Steps 1- Construct a set of regular expressions (REs) that define the form of any valid token 2- Derive an NDFA from the REs 3- Derive a DFA from the NDFA 4- Translate to a state transition table 5- Implement the table 6- Implement the algorithm to interpret the table Joey Paquet, 2000, 2002, 2012

  8. id ::= letter(letter|digit)* Regular Expressions  : { } s : {s | s in s^} a : {a} r | s : {r | r in r^} or {s | s in s^} s* : {sn | s in s^ and n>=0} s+ : {sn | s in s^ and n>=1} Joey Paquet, 2000, 2002, 2012

  9. letter      letter   digit  Derive NDFA from REs • Could derive DFA from REs but: • Much easier to do NDFA, then derive DFA • No standard way of deriving DFAs from REs • Use Thompson’s construction (c.f. Louden’s) Joey Paquet, 2000, 2002, 2012

  10. letter letter letter letter letter [other] digit letter digit digit digit Derive DFA from NDFA • Use subset construction (c.f. Louden’s) • May be optimized • Easier to implement: • No  edges • Deterministic (no backtracking) Joey Paquet, 2000, 2002, 2012

  11. letter digit other final 0 1 N 1 1 1 2 N 2 Y letter 0 1 2 letter [other] digit Generate State Transition Table Joey Paquet, 2000, 2002, 2012

  12. Implementation Concerns • Backtracking • Principle : A token is normally recognized only when the next character is read. • Problem : Maybe this character is part of the next token. • Example : x<1. “<“ is recognized only when “1” is read. In this case, we have to backtrack on character to continue token recognition. • Can include the occurrence of these cases in the state transition table. Joey Paquet, 2000, 2002, 2012

  13. Implementation Concerns • Ambiguity • Problem : Some tokens’ lexemes are subsets of other tokens. • Example : • n-1.Is it <n><-><1> or <n><-1>? • Solutions : • Postpone the decision to the syntactic analyzer • Do not allow sign prefix to numbers in the lexical specification • Interact with the syntactic analyzer to find a solution. (Induces coupling) Joey Paquet, 2000, 2002, 2012

  14. Example • Alphabet : • {:, *, =, (, ), <, >, {, }, [a..z], [0..9]} • Simple tokens : • {(, ), {, }, :, <, >} • Composite tokens : • {:=, >=, <=, <>, (*, *)} • Words : • id ::= letter(letter | digit)* • num ::= digit* Joey Paquet, 2000, 2002, 2012

  15. Example • Ambiguity problems: • Backtracking: • Must back up a character when we read a character that is part of the next token. • Occurences are coded in the table Character Possible tokens : :, := > >, >= < <, <=, <> ( (, (* * *, *) Joey Paquet, 2000, 2002, 2012

  16. l d Final state with backtracking 3 2 Final state 5 4 6 7 19 1 11 10 8 9 12 13 15 14 16 17 18 l d d { } sp 20 ( ) * * : = < = > > = Joey Paquet, 2000, 2002, 2012

  17. Table-driven Scanner (Table) Joey Paquet, 2000, 2002, 2012

  18. Table-driven Scanner (Algorithm) nextToken() state = 0 token = null do lookup = nextChar() state = Table(state, lookup) if (isFinalState(state)) token = createToken() if (Table(state, “backup”) == yes) backupChar() until (token != null) return (token) Joey Paquet, 2000, 2002, 2012

  19. Table-driven Scanner • nextToken() • Extract the next token in the program (called by syntactic analyzer) • nextChar() • Read the next character in the input program • backupChar() • Backs up one character in the input file Joey Paquet, 2000, 2002, 2012

  20. Table-driven Scanner • isFinalState(state) • Returns TRUE if state is a final state • table(state, column) • Returns the value corresponding to [state, column] in the state transition table. • createToken() • Creates and returns a structure that contains the token type, its location in the source code, and its value (for literals). Joey Paquet, 2000, 2002, 2012

  21. nextToken() c = nextChar() case (c) of "[a..z],[A..Z]": c = nextChar() while (c in {[a..z],[A..Z],[0..9]}) do s = makeUpString() c = nextChar() if ( isReservedWord(s) )then token = createToken(RESWORD,null) else token = createToken(ID,s) backupChar() "[0..9]": c = nextChar() while (c in [0..9]) do v = makeUpValue() c = nextChar() token = createToken(NUM,v) backupChar() Hand-written Scanner Joey Paquet, 2000, 2002, 2012

  22. "{": c = nextChar() while ( c != "}" ) do c = nextChar() "(": c = nextChar() if ( c == "*" ) then c = nextChar() repeat while ( c != "*" ) do c = nextChar() c = nextChar() until ( c != ")" ) return ":": c = nextChar() if ( c == "=" ) then token = createToken(ASSIGNOP,null) else token = createToken(COLON,null) backupChar() Hand-written Scanner Joey Paquet, 2000, 2002, 2012

  23. "<": c = nextChar() if ( c == "=" ) then token = createToken(LEQ,null) else if ( c == ">" ) then token = createToken(NEQ,null) else token = createToken(LT,null) backupChar() ">": c = nextChar() if ( c == "=" ) then token = createToken(GEQ,null) else token = createToken(GT,null) backupChar() ")": token = createToken(RPAR,null) "*": token = createToken(STAR,null) "=": token = createToken(EQ,null) end case return token Hand-written Scanner Joey Paquet, 2000, 2002, 2012

  24. Part II Error recovery in Lexical Analysis Joey Paquet, 2000, 2002, 2012

  25. Possible Lexical Errors • Depends on the accepted conventions: • Invalid character • letter not allowed to terminate a number • numerical overflow • identifier too long • end of line before end of string • Are these lexical errors? Joey Paquet, 2000, 2002, 2012

  26. Accepted or Not? • 123a • <Error> or <num><id>? • 123456789012345678901234567 • <Error> related to machine’s limitations • “Hello world <CR> • Either <CR> is skipped or <Error> • ThisIsAVeryLongVariableName = 1 • Limit identifier length? Joey Paquet, 2000, 2002, 2012

  27. Error Recovery Techniques • Finding only the first error is not acceptable • Panic Mode: • Skip characters until a valid character is read • Guess Mode: • do pattern matching between erroneous strings and valid strings • Example: (beggin vs. begin) • Rarely implemented Joey Paquet, 2000, 2002, 2012

  28. Conclusions Joey Paquet, 2000, 2002, 2012

  29. Possible Implementations • Lexical Analyzer Generator (e.g. Lex) • safe, quick • Must learn software, unable to handle unusual situations • Table-Driven Lexical Analyzer • general and adaptable method, same function can be used for all table-driven lexical analyzers • Building transition table can be tedious and error-prone Joey Paquet, 2000, 2002, 2012

  30. Possible Implementations • Hand-written • Can be optimized, can handle any unusual situation, easy to build for most languages • Error-prone, not adaptable or maintainable Joey Paquet, 2000, 2002, 2012

  31. Lexical Analyzer’s Modularity • Why should the Lexical Analyzer and the Syntactic Analyzer be separated? • Modularity/Maintainability : system is more modular, thus more maintainable • Efficiency : modularity = task specialization = easier optimization • Reusability: can change the whole lexical analyzer without changing other parts Joey Paquet, 2000, 2002, 2012

More Related