Lecture # 3

Lecture # 3 Chapter #3: Lexical Analysis

Role of Lexical Analyzer • It is the first phase of compiler • Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis • Reasons to make it a separate phase are: • Simplifies the design of the compiler • Provides efficient implementation • Improves portability

Interaction of the Lexical Analyzer with the Parser (Fig 3.1) Token,tokenval LexicalAnalyzer Parser SourceProgram Get nexttoken error error Symbol Table

Tokens, Patterns, and Lexemes • A token is a classification of lexical units • For example: id and num • Lexemes are the specific character strings that make up a token • For example: abc and 123 • Patterns are rules describing the set of lexemes belonging to a token • For example: “letter followed by letters and digits” and “non-empty sequence of digits”

Diff b/w Token, Lexeme and Pattern

Attributes of Tokens Lexical analyzer y := 31 + 28*x <id, “y”> <assign, :=> <num, 31> <operator , +> <num, 28> <operator, *> <id, “x”> token Parser tokenval(token attribute)

Section # 3.3: Specification of Tokens • Alphabet: Finite, nonempty set of symbols Example: Example: • Strings: Finite sequence of symbols from an alphabet e.g. 0011001 • Empty String: The string with zero occurrences of symbols from alphabet. The empty string is denoted by

Continue… • Length of String: Number of positions for symbols in the string. |w| denotes the length of string w Example |0110| = 4; | | = 0 • Powers of an Alphabet: = = the set of strings of length k with symbols from Example:

Continue.. • The set of all strings over is denoted

Continue.. • Language: is a specific set of strings over some fixed alphabet  Example: The set of legal English words The set of strings consisting of n 0's followed by n 1’s LP = the set of binary numbers whose value is prime

Concatenation and Exponentiation • The concatenation of two strings x and y is denoted by xy • The exponentiationof a string s is defined by s0 = si = si-1s for i > 0note that s = s = s

Language Operations • Union L  M = {s  s  L or s  M} • Concatenation LM = {xy x  L and y  M} • Exponentiation L0 = {}; Li = Li-1L • Kleene closure L* = i=0,…, Li • Positive closureL+ = i=1,…, Li

Regular Expressions • Basis symbols: •  is a regular expression denoting language {} • a   is a regular expression denoting {a} • If r and s are regular expressions denoting languages L(r) and M(s) respectively, then • rs is a regular expression denoting L(r)  M(s) • rs is a regular expression denoting L(r)M(s) • r* is a regular expression denoting L(r)* • (r) is a regular expression denoting L(r) • A language defined by a regular expression is called a Regular set or a Regular Language

Regular Definitions • Regular definitions introduce a naming convention: d1 r1d2 r2…dn rnwhere each ri is a regular expression over  {d1, d2, …, di-1}

Example:letter  AB…Zab…z digit  01…9id  letter ( letterdigit)* • The following shorthands are often used:r+ = rr*r? = r [a-z] = abc…z • Examples:digit  [0-9]num  digit+ (. digit+)? ( E (+-)? digit+ )?

Regular Definitions and Grammars Grammar stmt ifexprthenstmtif exprthenstmtelsestmtexpr term reloptermtermterm  idnum Regular definitions if if then then else elserelop <<=<>>>== id letter ( letter | digit )* num digit+ (. digit+)? ( E (+-)? digit+ )?

Coding Regular Definitions in Transition Diagrams relop <<=<>>>== start < = 0 1 2 return(relop, LE) > 3 return(relop, NE) other * 4 return(relop, LT) = return(relop, EQ) 5 > = 6 7 return(relop, GE) other * 8 return(relop, GT) id letter ( letterdigit )* letter or digit start letter other * 9 10 11 return(gettoken(),install_id())

Section 3.6 : Finite Automata • Finite Automata are used as a model for: • Software for designing digital circuits • Lexical analyzer of a compiler • Searching for keywords in a file or on the web. • Software for verifying finite state systems, such as communication protocols.

Design of a Lexical Analyzer Generator • Translate regular expressions to NFA • Translate NFA to an efficient DFA Optional regularexpressions NFA DFA Simulate NFAto recognizetokens Simulate DFAto recognizetokens

Nondeterministic Finite Automata • An NFA is a 5-tuple (S, , , s0, F) whereS is a finite set of states is a finite set of symbols, the alphabet is a mapping from S   to a set of statess0  S is the start stateF  S is the set of accepting (or final) states

Transition Graph • An NFA can be diagrammatically represented by a labeled directed graph called a transition graph a S = {0,1,2,3} = {a,b}s0 = 0F = {3} start b b a 0 1 2 3 b

Transition Table • The mapping  of an NFA can be represented in a transition table (0,a) = {0,1}(0,b) = {0}(1,b) = {2}(2,b) = {3}

The Language Defined by an NFA • An NFA accepts an input string x if and only if there is some path with edges labeled with symbols from x in sequence from the start state to some accepting state in the transition graph • A state transition from one state to another on the path is called a move • The language defined by an NFA is the set of input strings it accepts, such as (ab)*abb for the example NFA

From Regular Expression to -NFA (Thompson’s Construction)  start  i f start a a i f   N(r1) start r1r2 i f   N(r2) start r1r2 i N(r1) N(r2) f    start r* i N(r) f 

Combining the NFAs of a Set of Regular Expressions start a 1 2 a { action1 }abb { action2 }a*b+ { action3 } start a b b 3 4 5 6 a b start 7 8 b a 1 2  start  a b b 3 0 4 5 6 a b  7 8 b

Deterministic Finite Automata • A deterministic finite automaton is a special case of an NFA • No state has an -transition • For each state s and input symbol a there is at most one edge labeled a leaving s • Each entry in the transition table is a single state • At most one path exists to accept a string • Simulation algorithm is simple

Example DFA A DFA that accepts (ab)*abb b b a start b b a 0 1 2 3 a a

Conversion of an NFA into a DFA • The subset construction algorithm converts an NFA into a DFA using:-closure(s) = {s}  {ts  …  t} -closure(T) = sT-closure(s)move(T,a) = {ts a t and s  T} • The algorithm produces:Dstates is the set of states of the new DFA consisting of sets of states of the NFADtran is the transition table of the new DFA

-closure and move Examples -closure({0}) = {0,1,3,7}move({0,1,3,7},a) = {2,4,7}-closure({2,4,7}) = {2,4,7}move({2,4,7},a) = {7}-closure({7}) = {7}move({7},b) = {8}-closure({8}) = {8}move({8},a) =  a 1 2  start  a b b 3 0 4 5 6 a b  7 8 b a a b a none Also used to simulate NFAs

Subset Construction Example 1  a 2 3     start a b b 0 1 6 7 8 9 10   b 4 5  b DstatesA = {0,1,2,4,7}B = {1,2,3,4,6,7,8}C = {1,2,4,5,6,7}D = {1,2,4,5,6,7,9}E = {1,2,4,5,6,7,10} C b a b start a b b A B D E a a a

Subset Construction Example 2 a a1 1 2  start  a b b a2 3 0 4 5 6 a b  a3 7 8 b b DstatesA = {0,1,3,7}B = {2,4,7}C = {8}D = {7}E = {5,8}F = {6,8} a3 C a b b b start A D a a b b B E F a1 a3 a2 a3

Lecture # 3

Lecture # 3

Presentation Transcript

LECTURE

Lecture 25 Lecture 26

Lecture

Lecture

Lecture VIII Lecture IX

Lecture

Lecture 10 Lecture 10 Lecture 11 Lecture 11 Lecture 11 Lecture 11

Lecture S1: Sample Lecture

Lecture