430 likes | 703 Views
LEXICAL ANALYSIS. Phung Hua Nguyen University of Technology 2006. Outline. Introduction to Lexical Analysis Token specification Language Regular Expressions (REs) Token recoginition REs NFA (Thompson’s construction, Algorithm 3.3) NFA DFA (subset construction, Algorithm 3.2)
E N D
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006
Outline • Introduction to Lexical Analysis • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs NFA (Thompson’s construction, Algorithm 3.3) • NFA DFA (subset construction, Algorithm 3.2) • DFA minimal DFA (Algorithm 3.6) • Programming Lexical Analysis
Introduction • Read the input characters • Produce as output a sequence of tokens • Eliminate white space and comments token lexical analyzer source program parser get next token symbol table Lexical Analysis
Why ? • Simplify design • Improve compiler efficiency • Enhance compiler portability Lexical Analysis
Tokens, Patterns, Lexemes Lexical Analysis
Outline • Introduction • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs NFA (Thompson’s construction, Algorithm 3.3) • NFA DFA (subset construction, Algorithm 3.2) • DFA minimal DFA (Algorithm 3.6) • Programming Lexical Analysis
Alphabet, Strings and Languages • Alphabet ∑: any finite set of symbols • The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…} • The binary alphabet {0,1} • The ASCII alphabet • String: a finite sequence of symbols drawn from ∑ : • Length |s| of a string s: the number of symbols in s • The empty string, denoted , || = 0 • Language: any set of strings over ∑; • its two special cases: • : the empty set • {} Lexical Analysis
Examples of Languages • ∑ ={a, á, à, ả, ã, ạ, b, c, d, đ,…} • Vietnamese language • ∑ = {0,1} • A string is an instruction • The set of Pentium instructions • ∑ = the ASCII set • A string is a program • The set of C programs Lexical Analysis
Terms (Fig.3.7) Lexical Analysis
String operations • String concatenation • If x and y are strings, xy is the string formed by appending y to x. E.g.: x = hom, y = nay xy = homnay • is the identity: y = y; x = x • String exponentiation • s0 = • si = si-1s E.g. s = 01, s0 = , s2 =0101, s3 = 010101 Lexical Analysis
Language Operations (Fig 3.8) Lexical Analysis
Examples • L = {A,B,…,Z,a,b,…,z} • D = {0,1,…,9} letters and digits strings consists of a letter followed by a digit all four-letter strings all strings of letters, including all strings of letters and digits beginning with a letter all strings of one or more digits Lexical Analysis
Regular Expressions (Res) over Alphabet ∑ • Inductive base: • is a RE, denoting the RL {} • a ∑ is a RE, denoting the RL {a} • Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then • (r)|(s) is a RE, denoting the RL L(r) L(s) • (r)(s) is a RE, denoting the RL L(r)L(s) • (r)* is a RE, denoting the RL (L(r))* • (r) is a RE, denoting the RL L(r) Lexical Analysis
Precedence and Associativity • Precedence: • “*” has the highest precedence • “concatenation” has the second highest precedence • “|” has the lowest precedence • Associativity: • all are left-associative E.g.: (a)|((b)*(c)) a|b*c Unnecessary parentheses can be removed Lexical Analysis
Example • ∑ = {a, b} • a|b denotes {a,b} • (a|b)(a|b) denotes {aa,ab,ba,bb} • a* denotes {,a,aa,aaa,aaaa,…} • (a|b)* denotes ? • a|a*b denotes ? Lexical Analysis
Notational Shorthands • One or more instances +: r+ = rr* • denotes the language (L(r))+ • has the same precedence and associativity as * • Zero or one instance ?: r? = r| • denotes the language (L(r) {}) • Character classes • [abc] denotes a|b|c • [A-Z] denotes A|B|…|Z • [a-zA-Z_][a-zA-Z0-9_]* denotes ? Lexical Analysis
Outline • Introduction • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs NFA (Thompson’s construction, Algorithm 3.3) • NFA DFA (subset construction, Algorithm 3.2) • DFA minimal DFA (Algorithm 3.6) • Programming Lexical Analysis
Overview RE 3.3 3.5 3.6 3.2 mDFA NFA DFA Lexical Analysis
Nondeterministic finite automata • A nondeterministic finite automaton (NFA) is a mathematical model that consists of • a finite set of states S • a set of input symbols ∑ • a transition function move: S ∑ S • a start state s0 • a finite set of final or accepting states F Lexical Analysis
A B A A Transition graph a A Lexical Analysis
Transition table Input symbol State Lexical Analysis
A Acceptance • A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x. 0 0 01010 B 1 0 0 1 0 A B A B A B 1 1 0 01011 0 0 1 1 1 error A B A B A ? Lexical Analysis
Deterministic finite automata • A deterministic finite automaton (DFA) is a special case of NFA in which • no state has an -transition, and • for each state s and input symbol a, there is at most one edge labeled a leaving s. Lexical Analysis
Thompson’s construction of NFA from REs • guided by the syntactic structure of the RE r • For , • For a in ∑ i f a i f Lexical Analysis
i i f f Thompson’s construction (cont’d) • Suppose N(s) and N(t) are NFA’s for REs s and t • For s|t, • For st, • For s*, • For (s), use N(s) itself N(s) N(t) N(t) N(s) f i N(t) Lexical Analysis
Outline • Introduction • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs NFA (Thompson’s construction) • NFA DFA (subset construction) • DFA minimal DFA (Algorithm 3.6) • Programming Lexical Analysis
Subset construction • s : an NFA state • T : a set of NFA states Lexical Analysis
Subset construction (cont’d) Let s0 be the start state of the NFA; Dstates contains the only unmarked state -closure(s0); while there is an unmarked state T in Dstatesdo begin mark T for each input symbol a do begin U := -closure(move(T; a)); if U is not in Dstatesthen Add U as an unmarked state to Dstates; DTran[T; a] := U; end; end; Lexical Analysis
DFA • Let (∑, S, T, F, s0) be the original NFA. The DFA is: • The alphabet: ∑ • The states: all states in Dstates • The transitions: DTran • The accepting states: all states in Dstates containing at least one accepting state in F of the NFA • The start state: -closure(s0) Lexical Analysis
Outline • Introduction • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs NFA (Thompson’s construction) • NFA DFA (subset construction) • DFA minimal DFA (Algorithm 3.6) • Programming Lexical Analysis
Minimise a DFA Initially, create two states: • one is the set of all final states: F • the other is the set of all non-final states: S - F while (more splits are possible) { Let S = {s1,…, sn} be a state and c be any char in ∑ Let t1,…, tn be the successor states to s1,…, sn under c if (t1,…, tn don't all belong to the same state) { Split S into new states so that si and sj remain in the same state iff ti and tj are in the same state } } Lexical Analysis
A B D A C B D Example b Step1: {A,B,C,D} {E} For a, {B,B,B,B} For b, {C,D,C,E} Split {A,B,C} {D} {E} Step 2: For b, {C,D,C} Split {A,C} {B} {D} {E} Step 3: For a, {B,B} For b, {C,C} Terminate b b a a b b E a a a b b b a b b E a a a Lexical Analysis
Outline • Introduction • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs NFA (Thompson’s construction) • NFA DFA (subset construction) • DFA minimal DFA (Algorithm 3.6) • Programming Lexical Analysis
Input Buffering begin… Scanner if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else forward++ eof Lexical Analysis
Input Buffering begin… Scanner eof forward = forward + 1 if (forward↑=eof) { if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else terminate the analysis } eof eof Lexical Analysis
0 1 6 5 Transition Diagrams < = relop <= | < |<> return(relop,LE) 2 > return(relop,NE) 3 other 4 return(relop,LT) letter other return(id,lexeme) 7 id letter(letter|digit)* letter or digit Transition diagram is a DFA in which there is no edge leaving out of a final state Lexical Analysis
Implementation token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c == ‘<‘) state = 1; else state = fail(0); break; case 1: c = nextchar(); if (c == ‘=‘) state = 2; else if (c == ‘>’ state = 3; else state = 4; break; case 2: retract(0); return new Token(relop,”<=”); case 4: retract(1); return new Token(relop,”<”); case 5: c = nextchar(); if (Character.isLetter(c)) state = 6; else state = fail(5); break; case 6: c = nextchar(); if (Character.isLetter(c) ||Character.isDigit(c)) continue; else state = 7; break; case 7: retract(1); return new Token(id, getLexeme()); Lexical Analysis
Implemetation (cont’d) int fail(int current_state) { forward = beginning; switch (current_state) { case 0: return 5; case 5: error(); } } void retract(int flag) { if (flag ==1) move forward back get lexeme from beginning to forward move forward onward beginning = forward state = 0 } b│e│g│i│n│:│=│ │ │… Lexical Analysis
Outline • Introduction • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs NFA (Thompson’s construction) • NFA DFA (subset construction) • DFA minimal DFA (Algorithm 3.6) • Programming Lexical Analysis