1.03k likes | 1.52k Views
Lexical Analysis. Cheng-Chia Chen. Outline. The goal and niche of lexical analysis in a compiler Lexical tokens Regular expressions (RE) Use regular expressions in lexical specification Finite automata (FA) DFA and NFA from RE to NFA from NFA to DFA from DFA to optimized DFA
E N D
Lexical Analysis Cheng-Chia Chen
Outline • The goal and niche of lexical analysis in a compiler • Lexical tokens • Regular expressions (RE) • Use regularexpressions in lexical specification • Finite automata (FA) • DFA and NFA • from RE to NFA • from NFA to DFA • from DFA to optimized DFA • Lexical-analyzer generators
Source Tokens Interm. Language Parsing 1. The goal and niche of lexical analysis Lexical analysis (token stream) (char stream) Code Gen. Machine Code Optimization Goal of lexical analysis: breaking the input into individual words or “tokens”
Lexical Analysis • What do we want to do? Example: if (i == j) Z = 0; else Z = 1; • The input is just a sequence of characters: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Goal: Partition input string into substrings • And determine the categories (token types) to which the substrings belong
2. Lexical Tokens • What’s a token ? • Token attributes • Normal token and special tokens • Example of tokens and special tokens.
What’s a token? • a sequence of characters that can be treated as a unit in the grammar of a PL. Output of lexical analysis is a stream of tokens • Tokens are partitioned into categories called token types. ex: • In English: • book, students, like, help, strong,… : token - noun, verb, adjective, … : token type • In a programming language: • student, var34, 345, if, class, “abc”… : token • ID, Integer, IF, WHILE, Whitespace, … : token type • Parser relies on the token type instead of token distinctions to analyze: • var32 and var1 are treated the same, • var32(ID), 32(Integer) and if(IF) are treated differently.
Token attributes • token type : • category of the token; used by syntax analysis. • ex: identifier, integer, string, if, plus, … • token value : • semantic value used in semantic analysis. • ex: [integer, 26], [string, “26”] • token lexeme (member, text): • textual content of a token • [while, “while”], [identifier, “var23”], [plus, “+”],… • positional information: • start/end line/position of the textual content in the source program.
Notes on Token attributes • Token types affect syntax analysis • Token values affect semantic analysis • lexeme and positional information affect error handling • Only token type information must be supplied by the lexical analyzer. • Any program performing lexical analysis is called a scanner (lexer, lexical analyzer).
Aspects of Token types • Language view: A token type is the set of all lexemes of all its token instances. • ID = {a, ab, … } – {if, do,…}. • Integer = { 123, 456, …} • IF = {if}, WHILE={while}; • STRING={“abc”, “if”, “WHILE”,…} • Pattern (regular expression): a rule defining the language of all instances of a token type. • WHILE: w h i l e • ID: letter (letters | digits )* • ArithOp: + | - | * | /
Lexical Analyzer: Implementation • An implementation must do two things: • Recognize substrings corresponding to lexemes of tokens • Determine token attributes • type is necessary • value depends on the type/application, • lexeme/positional information depends on applications (eg: debug or not).
Example • input lines: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Token-lexeme pairs returned by the lexer: • [Whitespace, “\t”] • [if, - ] • [OpenPar, “(“] • [Identifier, “i”] • [Relation, “==“] • [Identifier, “j”] • …
Normal Tokens and special Tokens • Kinds of tokens • normal tokens: needed for later syntax analysis and must be passed to parser. • special tokens • skipped tokens (or nontoken): • do not contribute to parsing, • discarded by the scanner. • Examples: Whitespace, Comments • why need them ? • Question: What happens if we remove all whitespace and all comments prior to scanning?
Lexical Analysis in FORTRAN • FORTRAN rule: Whitespace is insignificant • E.g., VAR1is the same as VA R1 • Footnote: FORTRAN whitespace rule motivated by inaccuracy of punch card operators
A terrible design! Example • Consider • DO 5 I = 1,25 • DO 5 I = 1.25 • The first is DO 5 I = 1 , 25 • The second is DO5I = 1.25 • Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after “,” is reached
Lexical Analysis in FORTRAN. Lookahead. • Two important points: • The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time • “Lookahead” may be required to decide where one token ends and the next token begins • Even our simple example has lookahead issues ivs.if =vs. ==
Some Special Tokens • 1,5 are skipped. 2,3 need preprocess, • 4 need to be expanded.
The geography of lexical tokens ID: var1, last5,… NUM 23 56 0 000 special tokens : \t \n /* … */ IF:if LPAREN ( REAL 12.35 2.4 e –10 … RPAREN ) the set of all strings
Issues • Definition problem: • how to define (formally specify) the set of strings(tokens) belonging to a token type ? • => regular expressions • (Recognition problem) • How to determine which set (token type) a input string belongs to? • => DFA!
Languages Def. Let S be a set of symbols (or characters). • A language over S is a set of strings of characters drawn from S • (S is called the alphabet )
Alphabet = English characters Language = English words Not every string on English characters is an English word likes, school,… beee,yykk,… Alphabet = ASCII Language = C programs Note: ASCII character set is different from English character set Examples of Languages
Regular Expressions • A language (metaLanguage) for representing (or defining) languages(sets of words) • Definition: If S is an alphabet. The set of regular expression(RegExpr) over S is defined recursively as follows: • (Atomic RegExpr) : 1. any symbol c is a RegExpr. • 2. e (empty string) is a RegExpr. • (Compound RegExpr): if A and B are RegExpr, then so are 3. (A | B) (alternation) 4. (A B) (concatenation) 5. A* (repetition)
Semantics (Meaning) of regular expressions • For each regular expression A, we use L(A) to express the language defined by A. • I.e. L is the function: L: RegExpr(S) the set of Languages over S with L(A) = the language denoted by RegExpr A • The meaning of RegExpr can be made clear by explicitly defining L.
Atomic Regular Expressions • 1. Single symbol: c L(c) = { c } (for any c ) • 2. Epsilon (empty string): e L(e) = {e}
Compound Regular Expressions • 3. alternation ( or union or choice) L( (A | B) ) = { s | s L(A) or s L(B) } • 4. Concatenation: AB (where A and B are reg. exp.) L((A B)) =L(A) L(B) =def { ab | a L(A) and b L(B) } • Note: • Parentheses enclosing (A|B) and (AB) can be omitted if there is no worries of confusion. • MN (set concatenation) and ab (string concatenation) will be abbreviated to AB and ab, respectively. • AA and L(A) L(A) are abbreviated as A2 and L(A)2, respectively.
Examples • if | then | else { if, then, else} • 0 | 1 | … | 9 { 0, 1, …, 9 } • (0 | 1) (0 | 1) { 00, 01, 10, 11 }
More Compound Regular Expressions • 5. repetition ( or Iteration): A* L(A*) = { e } L(A) L(A)L(A) L(A)3… • Examples: • 0* : {e, 0, 00, 000, …} • 10* : strings starting with 1 and followed by 0’s. • (0|1)* 0 : Binary even numbers. • (a|b)*aa(a|b)*: strings of a’s and b’s containing consecutive a’s. • b*(abb*)*(a|e) : strings of a’s and b’s with no consecutive a’s.
Example: Keyword • Keyword: else or if or begin… else | if | begin | …
Example: Integers Integer: a non-empty string of digits ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )* • problem: reuse complicated expression • improvement: define intermediate reg. expr. digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 number = digit digit* Abbreviation: A+ = A A*
Regular Definitions • Names for regular expressions • d1 =r1 • d2 =r2 • ... • dn =rn where ri over alphabet È {d1, d2, ..., d i-1} • note: Recursion is not allowed.
Example • Identifier: strings of letters or digits, starting with a letter digit = 0 | 1 | ... | 9 letter = A | … | Z | a | … | z identifier = letter (letter | digit) * • Is (letter* | digit*) the same ?
Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, CRNL and tabs WS = (\ | \t | \n | \r\n )+
Example: Email Addresses • Consider chencc@cs.nccu.edu.tw = letters [ { ., @ } name = letter+ address = name ‘@’ name (‘.’ name)*
Notational Shorthands • One or more instances • r+ = r r* • r* = (r+ | e) • Zero or one instance • r? = (r | e) • Character classes • [abc] = a | b | c • [a-z] = a | b | ... | z • [ac-f] = a | c | d | e | f • [^ac-f] = S – [ac-f]
Summary • Regular expressions describe many useful languages • Regular languages are a language specification • We still need an implementation • problem: Given a string s and a rexp R, is
Goal • Specifying lexical structure using regular expressions
Regular Expressions in Lexical Specification • Last lecture: the specification of all lexemes in a token type using regular expression. • But we want a specification of all lexemes of all token types in a programming language. • Which may enable us to partition the input into lexemes • We will adapt regular expressions to this goal
Regular Expressions => Lexical Spec. (1) • Select a set of token types • Number, Keyword, Identifier, ... • Write a rexp for the lexemes of each token type • Number = digit+ • Keyword = if | else | … • Identifier = letter (letter | digit)* • LParen =‘(‘ • …
Regular Expressions => Lexical Spec. (2) • Construct R, matching all lexemes for all tokens R = Keyword | Identifier | Number | … = R1 | R2 | R3 + … Facts: If s L(R) then s is a lexeme • Furthermore s L(Ri) for some “i” • This “i” determines the token type that is reported
Regular Expressions => Lexical Spec. (3) 4. Let the input be x1…xn (x1 ... xnare symbols in the language alphabet) • For 1 i n check x1…xi L(R) ? 5. It must be that x1…xi L(Rj) for some j 6. Remove t = x1…xi from input if t is normal token, then pass it to the parser // else it is whitespace or comments, just skip it! 7.go to (4)
Ambiguities (1) • There are ambiguities in the algorithm • How much input is used? What if • x1…xi L(R) and also • x1…xK L(R) for some i != k. • Rule: Pick the longest possible substring • The longest match principle !!
Ambiguities (2) • Which token is used? What if • x1…xi L(Rj) and also • x1…xi L(Rk) • Rule: use rule listed first (j iff j < k) • Earlier rule first! • Example: • R1 = Keyword and R2 = Identifier • “if” matches both. • Treats “if” as a keyword not an identifier
Error Handling • What if No rule matches a prefix of input ? • Problem: Can’t just get stuck … • Solution: • Write a rule matching all “bad” strings • Put it last • Lexer tools allow the writing of: R = R1 | ... | Rn | Error • Token Error matches if nothing else matches
Summary • Regular expressions provide a concise notation for string patterns • Use in lexical analysis requires small extensions • To resolve ambiguities • To handle errors • Efficient algorithms exist (next) • Require only single pass over the input • Few operationsper character (table lookup)
5. Finite Automata • Regular expressions = specification • Finite automata = implementation • A finite automaton consists of • An input alphabet • A finite set of states S • A start state n • A set of accepting states F S • A set of transitions state input state
Finite Automata • Transition s1as2 • Is read In state s1 on input “a” go to state s2 • If end of input (or no transition possible) • If in accepting state => accept • Otherwise => reject
a Finite Automata State Transition Graphs • A state • The start state • An accepting state • A transition
1 A Simple Example • A finite automaton that accepts only “1”