1 / 91

Lexical Analysis

Lexical Analysis. Cheng-Chia Chen. Outline. Introduction to lexical analyzer Tokens Regular expressions (RE) Finite automata (FA) deterministic and nondeterministic finite automata (DFA and NFA) from RE to NFA from NFA to DFA from DFA to optimized DFA. Outline.

adah
Download Presentation

Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis Cheng-Chia Chen

  2. Outline • Introduction to lexical analyzer • Tokens • Regular expressions (RE) • Finite automata (FA) • deterministic and nondeterministic finite automata (DFA and NFA) • from RE to NFA • from NFA to DFA • from DFA to optimized DFA

  3. Outline • Informal sketch of lexical analysis • Identifies tokens in input string • Issues in lexical analysis • Lookahead • Ambiguities • Specifying lexers • Regular expressions • Examples of regular expressions

  4. Source Tokens Interm. Language Parsing The Structure of a Compiler Lexical analysis Code Gen. Machine Code Optimization

  5. Lexical Analysis • What do we want to do? Example: if (i == j) Z = 0; else Z = 1; • The input is just a sequence of characters: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Goal: Partition input string into substrings • And determine the categories (tokens) to which the substrings belong

  6. What’s a Token? • Output of lexical analysis is a stream of tokens • A token is a syntactic category • In English: noun, verb, adjective, … • In a programming language: Identifier, Integer, Keyword, Whitespace, … • Parser relies on the token distinctions: identifiers are treated differently than keywords

  7. Aspects of Token • language view: a set of strings (belonging to that token) • [while], [identifier], [arithOp] • Pattern (grammar): a rule defining a token • [while]: while • [identifier]: letter followed by letters and digits • [arithOp]: + or - or * or / • Lexeme (member): a string matched by the pattern of a token • while, var23, count, +, *

  8. Attributes of Tokens • Attributes are used to distinguish different lexemes in a token • [while,] • [identifier, var35] • [arithOp, +] • [integer, 26] • [string, “26”] • positional information: start/end line/position • Tokens affect syntax analysis and attributes affect semantic analysis and error handling

  9. Lexical Analyzer: Implementation • An implementation must do two things: • Recognize substrings corresponding to tokens • Return the attributes( lexeme and positional information) of the token • The lexeme is either the substring or some data object constructed from the substring.

  10. Example • input lines: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Token-lexeme pairs returned by the lexer: • [Whitespace, “\t”] • [if, ] • [OpenPar, “(“] • [Identifier, “i”] • [Relation, “==“] • [Identifier, “j”] • …

  11. Lexical Analyzer: Implementation • The lexer usually discards “uninteresting” tokens that don’t contribute to parsing. • Examples: Whitespace, Comments • Question: What happens if we remove all whitespace and all comments prior to lexing?

  12. Lexical Analysis in FORTRAN • FORTRAN rule: Whitespace is insignificant • E.g., VAR1is the same as VA R1 • Footnote: FORTRAN whitespace rule motivated by inaccuracy of punch card operators

  13. A terrible design! Example • Consider • DO 5 I = 1,25 • DO 5 I = 1.25 • The first is DO 5 I = 1 , 25 • The second is DO5I = 1.25 • Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after “,” is reached

  14. Lexical Analysis in FORTRAN. Lookahead. • Two important points: • The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time • “Lookahead” may be required to decide where one token ends and the next token begins • Even our simple example has lookahead issues ivs. if =vs. ==

  15. Lexical Analysis in PL/I • PL/I keywords are not reserved IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN • PL/I Declarations: DECLARE (ARG1,. . ., ARGN) • Can’t tell whether DECLARE is a keyword or array reference until after the ). • Requires arbitrary lookahead!

  16. Review • The goal of lexical analysis is to • Partition the input string into lexemes • Identify the token of each lexeme • Left-to-right scan => lookahead sometimes required

  17. Next • We still need • A way to describe the lexemes of each token • A way to resolve ambiguities • Is if two variables i and f? • Is ==two equal signs = =?

  18. Regular Expressions and Regular Languages • There are several formalisms for specifying tokens • Regular languages are the most popular • Simple and useful theory • Easy to understand • Efficient implementations

  19. Languages Def. Let S be a set of characters. A language over S is a set of strings of characters drawn from S (S is called the alphabet )

  20. Alphabet = English characters Language = English words Not every string on English characters is an English word likes, school,… beee,yykk,… Alphabet = ASCII Language = C programs Note: ASCII character set is different from English character set Examples of Languages

  21. Notation • Languages are sets of strings. • Need some notation for specifying which sets we want • For lexical analysis we care about regular languages, which can be described using regular expressions.

  22. Regular Expressions and Regular Languages • Each regular expression is a notation for a regular language (a set of words) • If A is a regular expression then we write L(A) to refer to the language denoted by A

  23. Atomic Regular Expressions • Single character: c L(c) = { c } (for any c 2) • Epsilon (empty string): e L(e) = {e}

  24. Compound Regular Expressions • Union (or choice) L( A | B ) = { s | s  L(A) or s  L(B) } • Concatenation: AB(where A and B are reg. exp.) L(A  B) = { ab | a  L(A) and b L(B) } • Note: • AB (set concatenation) and ab (string concatenation) will be abbreviated to AB and ab, respectively. • Examples: • if | then | else = { if, then, else} • 0 | 1 | … | 9 = { 0, 1, …, 9 } • Another example: (0 | 1) (0 | 1) = { 00, 01, 10, 11 }

  25. More Compound Regular Expressions • So far we do not have a notation for infinite languages • Iteration: A* L(A*) = { e } [ L(A) [ L(AA) [ L(AAA) [… • Examples: 0* = {e, 0, 00, 000, …} 1 0* = { strings starting with 1 and followed by 0’s }

  26. Example: Keyword • Keyword: else or if or begin… else | if | begin | …

  27. Example: Integers Integer: a non-empty string of digits ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )* • problem: reuse complicated expression • improvement: define intermediate reg. expr. digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 number = digit digit* Abbreviation: A+ = A A*

  28. Regular Definitions • Names for regular expressions • d1 =r1 • d2 =r2 • ... • dn =rn where ri over alphabet È {d1, d2, ..., d i-1} • note: Recursion is not allowed.

  29. Example • Identifier: strings of letters or digits, starting with a letter digit = 0 | 1 | ... | 9 letter = A | … | Z | a | … | z identifier = letter (letter | digit) * • Is (letter* | digit*) the same ?

  30. Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, CRNL and tabs WS = (\ | \t | \n | \r\n )+

  31. Example: Email Addresses • Consider chencc@cs.nccu.edu.tw  = letters [ { ., @ } name = letter+ address = name ‘@’ name (‘.’ name)*

  32. Notational Shorthands • One or more instances • r+ = r r* • r* = r+ | e • Zero or one instance • r? = r | e • Character classes • [abc] = a | b | c • [a-z] = a | b | ... | z • [ac-f] = a | c | d | e | f • [^ac-f] = S – [ac-f]

  33. Summary • Regular expressions describe many useful languages • Regular languages are a language specification • We still need an implementation • problem: Given a string s and a rexp R, is

  34. Implementation of Lexical Analysis

  35. Outline • Specifying lexical structure using regular expressions • Finite automata • Deterministic Finite Automata (DFAs) • Non-deterministic Finite Automata (NFAs) • Implementation of regular expressions RegExp => NFA => DFA =>optimized DFA => Tables

  36. Regular Expressions in Lexical Specification • Last lecture: a specification for the predicate s  L(R) • But a yes/no answer is not enough ! • Instead: partition the input into lexemes • We will adapt regular expressions to this goal

  37. Regular Expressions => Lexical Spec. (1) • Select a set of tokens • Number, Keyword, Identifier, ... • Write a rexp for the lexemes of each token • Number = digit+ • Keyword = if | else | … • Identifier = letter (letter | digit)* • OpenPar =‘(‘ • …

  38. Regular Expressions => Lexical Spec. (2) • Construct R, matching all lexemes for all tokens R = Keyword | Identifier | Number | … = R1 | R2 | R3 + … Facts: If s  L(R) then s is a lexeme • Furthermore s  L(Ri) for some “i” • This “i” determines the token that is reported

  39. Regular Expressions => Lexical Spec. (3) • Let the input be x1…xn (x1 ... xnare characters in the language alphabet) • For 1  i  n check x1…xi L(R) ? • It must be that x1…xi L(Rj) for some j • Remove x1…xi from input and go to (4)

  40. How to Handle Spaces and Comments? • We could create a token Whitespace Whitespace = (‘’ + ‘\n’ + ‘\r’ + ‘\t’)+ • We could also add comments in there • An input “ \t\n 5555 “ is transformed into Whitespace Integer Whitespace • Lexer skips spaces (preferred) • Modify step 5 from before as follows: It must be that xk ... xi L(Rj) for some j such that x1 ... xk-1 L(Whitespace) • Parser is not bothered with spaces

  41. Ambiguities (1) • There are ambiguities in the algorithm • How much input is used? What if • x1…xi L(R) and also • x1…xK L(R) • Rule: Pick the longest possible substring • The maximal match

  42. Ambiguities (2) • Which token is used? What if • x1…xi L(Rj) and also • x1…xi L(Rk) • Rule: use rule listed first (j if j < k) • Example: • R1 = Keyword and R2 = Identifier • “if” matches both. • Treats “if” as a keyword not an identifier

  43. Error Handling • What if No rule matches a prefix of input ? • Problem: Can’t just get stuck … • Solution: • Write a rule matching all “bad” strings • Put it last • Lexer tools allow the writing of: R = R1 | ... | Rn | Error • Token Error matches if nothing else matches

  44. Summary • Regular expressions provide a concise notation for string patterns • Use in lexical analysis requires small extensions • To resolve ambiguities • To handle errors • Good algorithms known (next) • Require only single pass over the input • Few operations per character (table lookup)

  45. Finite Automata • Regular expressions = specification • Finite automata = implementation • A finite automaton consists of • An input alphabet  • A set of states S • A start state n • A set of accepting states F  S • A set of transitions state input state

  46. Finite Automata • Transition s1as2 • Is read In state s1 on input “a” go to state s2 • If end of input (or no transition possible) • If in accepting state => accept • Otherwise => reject

  47. a Finite Automata State Graphs • A state • The start state • An accepting state • A transition

  48. 1 A Simple Example • A finite automaton that accepts only “1”

  49. 1 0 Another Simple Example • A finite automaton accepting any number of 1’s followed by a single 0 • Alphabet: {0,1}

  50. 1 0 0 0 1 1 And Another Example • Alphabet {0,1} • What language does this recognize?

More Related