The Role of the Lexical Analyzer Token LexicalAnalyzer Parser SourceProgram getNextToken error error Symbol Table
The Reason for Using the Lexical Analyzer • Simplifies the design of the compiler • LL(1) or LR(1) parsing with 1 token lookahead would not be possible (multiple characters/tokens to match) • Compiler efficiency is improved • Systematic techniques to implement lexical analyzers by hand or automatically from specifications • Stream buffering methods to scan input • Compiler portability is enhanced • Input-device-specific peculiarities can be restricted to the lexical analyzer.
Tokens, Patterns, and Lexemes • Token (符號單元) • A pair consisting of a token name and optional arrtibute value. • Example: num, id • Pattern (樣本) • A description of the form for the lexemes of a token. • Example: “non-empty sequence of digits”, “letter followed by letters and digits” • Lexeme (詞) • A sequence of characters that matches the pattern for a token. • Example: 123, abc
Input Buffering lexemeBegin forward Sentinels
Strings and Languages • Alphabet • An alphabet is a finite set of symbols (characters) • String • A stringis a finite sequence of symbols from • s denotes the length of string s • denotes the empty string, thus = 0 • Language • A language is a countable set of strings over some fixed alphabet • Abstract Language Φ • {ε}
String Operations • Concatenation (連接) • The concatenation of two strings x and y is denoted by xy • Identity (單位元素) • The empty string is the identity under concatenation. • s = s = s • Exponentiation • Define s0 = si = si-1s for i > 0 • By Define s1 = s s2 = ss
Language Operations • UnionL M = { ssL or sM } • ConcatenationL M = { xyx L and yM} • ExponentiationL0 = { } Li = Li-1L • Kleene closure(封閉包)L* = ∪i=0,…,Li • Positive closureL+ = ∪i=1,…,Li
Regular Expressions • Regular Expressions • A convenient means of specifying certain simple sets of strings. • We use regular expressions to define structures of tokens. • Tokens are built from symbols of a finite vocabulary. • Regular Sets • The sets of strings defined by regular expressions.
Regular Expressions • Basis symbols: • is a regular expression denoting language L() = {} • a is a regular expression denoting L(a) = {a} • If r and s are regular expressions denoting languages L(r) and M(s) respectively, then • rs is a regular expression denoting L(r) M(s) • rs is a regular expression denoting L(r)M(s) • r* is a regular expression denoting L(r)* • (r) is a regular expression denoting L(r) • A language defined by a regular expression is called a regular set.
Regular Definitions • If Σ is an alphabet of basic symbols, then a regular definitions is a sequence of definitions of the form: d1 r1 d2 r2 …dn rn • Each di is a new symbol, not in Σ and not the same as any other of d’s. • Each ri is a regular expression over the alphabet {d1, d2, …, di-1 } • Any dj in rican be textually substituted inrito obtain an equivalent set of definitions
Example: Regular Definitions Regular Definitions letter_ A | B | … | Z | a | b | … | z | _ digit 0 | 1 | … | 9 id letter_ ( letter_ | digit )* Regular definitions are not recursivedigits digit digits digit wrong
Extensions of Regular Definitions • One or more instance • r+ = rr* = r*r • r* = r+ | ε • Zero or one instance • r?=r |ε • Character classes • [a-z] = abc…z • [A-Za-z] = A|B|…|Z|a|…|z • Example • digit [0-9] • num digit+ (. digit+)? ( E (+-)? digit+ )?
Regular Definitions and Grammars Context-FreeGrammars stmt ifexprthenstmtif exprthenstmtelsestmtexpr term reloptermtermterm idnum ws ( blank | tab | newline )+ Regular Definitions digit [0-9] letter [A-Za-z] if if then then else elserelop <<=<>>>== id letter ( letter | digit )* num digit+ (. digit+)? ( E (+ | -)? digit+ )?
Transition Diagrams relop <<=<>>>== start < = 0 1 2 return(relop, LE) > 3 return(relop, NE) other * 4 return(relop, LT) = return(relop, EQ) 5 > = 6 7 return(relop, GE) other * 8 return(relop, GT)
Transition Diagrams id letter ( letter | digit )* letter or digit * other start letter 9 10 11 return (getToken(), installID() )
Finite Automata • Finite Automata are recognizers. • FA simply say “Yes” or “No” about each possible input string. • A FA can be used to recognize the tokens specified by a regular expression • Use FA to design of a Lexical Analyzer Generator • Two kind of the Finite Automata • Nondeterministic finite automata (NFA) • Deterministic finite automata (DFA) • Both DFA and NFA are capable of recognizing the same languages.
NFA Definitions • NFA = { S, , , s0, F } • A finite set of statesS • A set of input symbols Σ • input alphabet, ε is not in Σ • A transition function • : S S • A special start state s0 • A set of final states F, F S (accepting states)
Transition Graph for FA is a state is a transition is a the start state is a final state
3 Example a 0 1 2 a b c c • This machine accepts abccabc, but it rejects abcab. • This machine accepts (abc+)+.
a start b b a 0 1 2 3 b Transition Table • The mapping of an NFA can be represented in a transition table (0, a) = {0,1}(0, b) = {0}(1, b) = {2}(2, b) = {3}
DFA • DFA is a special case of an NFA • There are no moves on input ε • For each state s and input symbol a, there is exactly one edge out of s labeled a. • Both DFA and NFA are capable of recognizing the same languages.
Input An input string x terminated by an end-of-file character eof. A DFA D with start state s0, accepting states F, and transition function move. Output Answer “yes” if D accepts x; “no” otherwise. s = s0 c = nextChar(); while ( c != eof ) { s = move(s, c); c = nextChar(); } if (s is in F ) return “yes”; else return “no”; Simulating a DFA
a start b b a 0 1 2 3 b b a b b 0 1 2 3 a a a S = {0,1,2,3} = {a, b}s0 = 0F = {3} NFA vs DFA (a | b)*abb
The Regular Language • The regularlanguagedefinedby an NFA is the set of input strings it accepts. • Example: (ab)*abb for the example NFA • An NFA accepts an input string x if and only if • there is some path with edges labeled with symbols from x in sequence from the start state to some accepting state in the transition graph • A state transition from one state to another on the path is called a move.
Theorem • The followings are equivalent • Regular Expression • NFA • DFA • Regular Language • Regular Grammar
Convert Concept Regular Expression Minimization Deterministic Finite Automata Nondeterministic Finite Automata Deterministic Finite Automata
N(s) s | t N(t) a s t N(s) N(t) N(s) s* Construction of an NFA from a Regular Expression Use Thompson’s Construction ε a
r11 Example r9 r10 • ( a | b )* a b b r7 r8 b r5 r6 b r4 * a r3 ( ) r3 = r4 r1 | r2 a b
a 2 3 start a b b 0 1 6 7 8 9 10 b 4 5 Example • ( a | b )* a b b
Conversion of an NFA to a DFA • The subset construction algorithm converts an NFA into a DFA using the following operation.
Subset Construction(1) Initially, -closure(s0) is the only state in Dstates and it is unmarked; while (there is an unmarked state T in Dstates){ mark T;for (each input symbol a ) { U = -closure( move(T, a) );if (U is not in Dstates) add U as an unmarked state to DstatesDtran[T, a] = U } }
a 2 3 start a b b 0 1 6 7 8 9 10 b 4 5 Example • ( a | b )* a b b b C b a b start a b b A B D E a a a
a 1 2 start a b b 3 0 4 5 6 a b 7 8 b Example • a • abb • a*b+ DstatesA = {0,1,3,7}B = {2,4,7}C = {8}D = {7}E = {5,8}F = {6,8} a 0137 247 a b 7 b b b 8 68 58 b b
Minimizing the DFA • Step 1 • Start with an initial partition II with two group: F and S-F (aceepting and nonaccepting) • Step 2 • Split Procedure • Step 3 • If ( IInew = II ) IIfinal = II and continue step 4 else II = IInewand go to step 2 • Step 4 • Construct the minimum-state DFA by IIfinal group. • Delete the dead state
Split Procedure Initially, let IInew = II for ( each group G of II ) { Partition G into subgroup such that two states s and t are in the same subgroup if and only if for all input symbol a, states s and t have transition on a to states in the same group of II. /* at worst, a state will be in a subgroup by itself */ replace G in IInew by the set of all subgroup formed }
Example • initially, two sets {1, 2, 3, 5, 6}, {4, 7}. • {1, 2, 3, 5, 6} splits {1, 2, 5}, {3, 6} on c. • {1, 2, 5} splits {1}, {2, 5} on b.
Minimizing the DFA • Major operation: partition states into equivalent classes according to • final / non-final states • transition functions ( A B C D E ) ( A B C D ) ( E ) ( A B C ) ( D ) ( E ) ( A C ) ( B ) ( D ) ( E )
Important States of an NFA • The “important states” of an NFA are those without an -transition, that is • if move({s}, a) for some athen s is an important state • The subset construction algorithm uses only the important states when it determines-closure ( move(T, a) ) • Augment the regular expression r with a special end symbol # to make accepting states important: the new expression is r#
Converting a RE Directly to a DFA • Construct a syntax tree for (r)# • Traverse the tree to construct functions nullable, firstpos, lastpos, and followpos • Construct DFAD by algorithm 3.62
Function Computed From the Syntax Tree • nullable(n) • The subtree at node n generates languages including the empty string • firstpos(n) • The set of positions that can match the first symbol of a string generated by the subtree at node n • lastpos(n) • The set of positions that can match the last symbol of a string generated be the subtree at node n • followpos(i) • The set of positions that can follow position i in the tree
Computing followpos for (eachnode n in the tree) { //n is a cat-node with left child c1 and right child c2 if ( n == c1.c2) for (each i in lastpos(c1) ) followpos(i) = followpos(i) firstpos(c2);else if (n is a star-node) for ( eachi in lastpos(n) )followpos(i) = followpos(i) firstpos(n); }
Converting a RE Directly to a DFA Initialize Dstates to contain only the unmarked state firstpos(n0), where n0 is the root of syntax tree T for (r)#; while ( there is an unmarked state Sin Dstates) { mark S; for (each input symbol a ) {let U be the union of followpos(p) for all p in S that correspond to a; if (U is not in Dstates )add U as an unmarked state to DstatesDtran[S,a] = U; } }
○ # ○ 6 b ○ 5 n b ○ 4 a * 3 | a b 1 2 Example ( a | b )* a b b # n = ( a | b )* a nullable(n) = false firstpos(n) = { 1, 2, 3 } lastpos(n) = { 3 } followpos(1) = {1, 2, 3 }
Example {1, 2, 3} {6} ( a | b )* a b b # # {6} {6} {1, 2, 3} {5} 6 b {1, 2, 3} {4} {5} {5} nullable 5 b {1, 2, 3} {3} {4} {4} 4 firstpos lastpos a {3} {3} {1, 2} * {1, 2} 3 | {1, 2} {1, 2} a b {1} {1} {2} {2} 1 2