LEXICAL ANALYSIS

LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006

Outline • Introduction to Lexical Analysis • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction, Algorithm 3.3) • NFA  DFA (subset construction, Algorithm 3.2) • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

Introduction • Read the input characters • Produce as output a sequence of tokens • Eliminate white space and comments token lexical analyzer source program parser get next token symbol table Lexical Analysis

Why ? • Simplify design • Improve compiler efficiency • Enhance compiler portability Lexical Analysis

Tokens, Patterns, Lexemes Lexical Analysis

Outline • Introduction  • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction, Algorithm 3.3) • NFA  DFA (subset construction, Algorithm 3.2) • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

Alphabet, Strings and Languages • Alphabet ∑: any finite set of symbols • The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…} • The binary alphabet {0,1} • The ASCII alphabet • String: a finite sequence of symbols drawn from ∑ : • Length |s| of a string s: the number of symbols in s • The empty string, denoted , || = 0 • Language: any set of strings over ∑; • its two special cases: • : the empty set • {} Lexical Analysis

Examples of Languages • ∑ ={a, á, à, ả, ã, ạ, b, c, d, đ,…} • Vietnamese language • ∑ = {0,1} • A string is an instruction • The set of Pentium instructions • ∑ = the ASCII set • A string is a program • The set of C programs Lexical Analysis

Terms (Fig.3.7) Lexical Analysis

String operations • String concatenation • If x and y are strings, xy is the string formed by appending y to x. E.g.: x = hom, y = nay  xy = homnay •  is the identity: y = y; x = x • String exponentiation • s0 =  • si = si-1s E.g. s = 01, s0 = , s2 =0101, s3 = 010101 Lexical Analysis

Language Operations (Fig 3.8) Lexical Analysis

Examples • L = {A,B,…,Z,a,b,…,z} • D = {0,1,…,9} letters and digits strings consists of a letter followed by a digit all four-letter strings all strings of letters, including  all strings of letters and digits beginning with a letter all strings of one or more digits Lexical Analysis

Regular Expressions (Res) over Alphabet ∑ • Inductive base: •  is a RE, denoting the RL {} • a  ∑ is a RE, denoting the RL {a} • Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then • (r)|(s) is a RE, denoting the RL L(r)  L(s) • (r)(s) is a RE, denoting the RL L(r)L(s) • (r)* is a RE, denoting the RL (L(r))* • (r) is a RE, denoting the RL L(r) Lexical Analysis

Precedence and Associativity • Precedence: • “*” has the highest precedence • “concatenation” has the second highest precedence • “|” has the lowest precedence • Associativity: • all are left-associative E.g.: (a)|((b)*(c))  a|b*c  Unnecessary parentheses can be removed Lexical Analysis

Example • ∑ = {a, b} • a|b denotes {a,b} • (a|b)(a|b) denotes {aa,ab,ba,bb} • a* denotes {,a,aa,aaa,aaaa,…} • (a|b)* denotes ? • a|a*b denotes ? Lexical Analysis

Notational Shorthands • One or more instances +: r+ = rr* • denotes the language (L(r))+ • has the same precedence and associativity as * • Zero or one instance ?: r? = r| • denotes the language (L(r)  {}) • Character classes • [abc] denotes a|b|c • [A-Z] denotes A|B|…|Z • [a-zA-Z_][a-zA-Z0-9_]* denotes ? Lexical Analysis

Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction, Algorithm 3.3) • NFA  DFA (subset construction, Algorithm 3.2) • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

Overview RE 3.3 3.5 3.6 3.2 mDFA NFA DFA Lexical Analysis

Nondeterministic finite automata • A nondeterministic finite automaton (NFA) is a mathematical model that consists of • a finite set of states S • a set of input symbols ∑ • a transition function move: S  ∑ S • a start state s0 • a finite set of final or accepting states F Lexical Analysis

A B A A Transition graph a A Lexical Analysis

Transition table Input symbol State Lexical Analysis

A Acceptance • A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x. 0 0 01010 B 1 0 0 1 0 A  B  A  B  A  B 1 1 0 01011 0 0 1 1 1 error A  B  A  B  A  ? Lexical Analysis

Deterministic finite automata • A deterministic finite automaton (DFA) is a special case of NFA in which • no state has an -transition, and • for each state s and input symbol a, there is at most one edge labeled a leaving s. Lexical Analysis

Thompson’s construction of NFA from REs • guided by the syntactic structure of the RE r • For , • For a in ∑  i f a i f Lexical Analysis

i i f f Thompson’s construction (cont’d) • Suppose N(s) and N(t) are NFA’s for REs s and t • For s|t, • For st, • For s*, • For (s), use N(s) itself   N(s)  N(t)  N(t) N(s) f i    N(t)  Lexical Analysis

Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction)  • NFA  DFA (subset construction) • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

Subset construction • s : an NFA state • T : a set of NFA states Lexical Analysis

Subset construction (cont’d) Let s0 be the start state of the NFA; Dstates contains the only unmarked state -closure(s0); while there is an unmarked state T in Dstatesdo begin mark T for each input symbol a do begin U := -closure(move(T; a)); if U is not in Dstatesthen Add U as an unmarked state to Dstates; DTran[T; a] := U; end; end; Lexical Analysis

DFA • Let (∑, S, T, F, s0) be the original NFA. The DFA is: • The alphabet: ∑ • The states: all states in Dstates • The transitions: DTran • The accepting states: all states in Dstates containing at least one accepting state in F of the NFA • The start state: -closure(s0) Lexical Analysis

Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction)  • NFA  DFA (subset construction)  • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

Minimise a DFA Initially, create two states: • one is the set of all final states: F • the other is the set of all non-final states: S - F while (more splits are possible) { Let S = {s1,…, sn} be a state and c be any char in ∑ Let t1,…, tn be the successor states to s1,…, sn under c if (t1,…, tn don't all belong to the same state) { Split S into new states so that si and sj remain in the same state iff ti and tj are in the same state } } Lexical Analysis

A B D A C B D Example b Step1: {A,B,C,D} {E} For a, {B,B,B,B} For b, {C,D,C,E} Split {A,B,C} {D} {E} Step 2: For b, {C,D,C} Split {A,C} {B} {D} {E} Step 3: For a, {B,B} For b, {C,C} Terminate b b a a b b E a a a b b b a b b E a a a Lexical Analysis

Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction)  • NFA  DFA (subset construction)  • DFA  minimal DFA (Algorithm 3.6)  • Programming Lexical Analysis

Input Buffering begin… Scanner if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else forward++ eof Lexical Analysis

Input Buffering begin… Scanner eof forward = forward + 1 if (forward↑=eof) { if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else terminate the analysis } eof eof Lexical Analysis

0 1 6 5 Transition Diagrams < = relop  <= | < |<> return(relop,LE) 2 > return(relop,NE) 3 other 4 return(relop,LT) letter other return(id,lexeme) 7 id  letter(letter|digit)* letter or digit Transition diagram is a DFA in which there is no edge leaving out of a final state Lexical Analysis

Implementation token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c == ‘<‘) state = 1; else state = fail(0); break; case 1: c = nextchar(); if (c == ‘=‘) state = 2; else if (c == ‘>’ state = 3; else state = 4; break; case 2: retract(0); return new Token(relop,”<=”); case 4: retract(1); return new Token(relop,”<”); case 5: c = nextchar(); if (Character.isLetter(c)) state = 6; else state = fail(5); break; case 6: c = nextchar(); if (Character.isLetter(c) ||Character.isDigit(c)) continue; else state = 7; break; case 7: retract(1); return new Token(id, getLexeme()); Lexical Analysis

Implemetation (cont’d) int fail(int current_state) { forward = beginning; switch (current_state) { case 0: return 5; case 5: error(); } } void retract(int flag) { if (flag ==1) move forward back get lexeme from beginning to forward move forward onward beginning = forward state = 0 } b│e│g│i│n│:│=│ │ │… Lexical Analysis

Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction)  • NFA  DFA (subset construction)  • DFA  minimal DFA (Algorithm 3.6)  • Programming  Lexical Analysis

LEXICAL ANALYSIS

LEXICAL ANALYSIS

Presentation Transcript

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis