CHAPTER 3 LEXICAL ANALYSIS

CHAPTER 3 LEXICAL ANALYSIS From: Chapter 3, The Dragon Book and Qin book Sequence of characters Sequence of tokens

Construct a diagram that illustrates the structure of the tokens of the source language , and then to hand-translate the diagram the diagram into a program for finding tokens 3.0 Approaches to implement a lexical analyzer

Pattern Matching technique • Specify and design program that execute actions triggered(触发) by patterns in strings • Introduce a pattern-action language called Lex for specifying lexical analyzers • Patterns are specified by regular expressions • A compiler for Lex can generate an efficient finite automation recognizer for the regular expressions

E=M*C**2 A Simple Lexical Analyzer Keyword? Identifier? Operators? ??? Identifier Identifier pattern matching

Grammar: Identifier pattern IdLetter(Letter|Digit) Lettera|b|c|……|z Digit0|1|2|……|9 • regular expressions: (a|b|c|……|z)(a|b|c|……|z| 0|1|2|……|9)*

Finite State Automata letter letter 1 2 digit

Bool identifier_pattern_matching(char*) { int flag=1; Ch=Read(); If ch>=“a”&ch<=“z” Flag=2; else return 0; while flag==2 ch=read(); if (ch>=“a”&ch<=“z”)||(ch>=“0”&ch<=“9”) flag=2; Return 1; }

Lexical analyzers are divided two processes: Scanning No tokenization of the input deletion of comments, compaction of whitespace characters Lexical analysis Producing tokens 3.1 The role of the lexical analyzer

token Lexical analyzer Source program Parser Get next token Symbol table

3.1.1 Reasons why the separation of lexicalanalysis and parsing Simplicity of design is the most important consideration. Compiler efficiency is improved. Compiler portability is enhanced.

Atokenis a pair consisting of a token name and an optional attribute value. Token name: Keywords, operators, identifiers, constants, literal strings, punctuation symbols(such as commas,semicolons) Apatternis a description of the form that the lexemes of token may take. A lexeme is a sequence of characters in the source program that matches the patter for a token and is identified by the lexical analyzer as an instance of that token.E.g.Relation {<.<=,>,>=,==,<>} 3.1.2 Tokens(表征), Patterns(模式), and Lexemes(词)

3.1.3 Attributes for Tokens A pointer to the symbol-table entry in which the information about the token is kept E.g3.2 E=M*C**2 <id, pointer to symbol-table entry for E> <assign_op,> <id, pointer to symbol-table entry for M> <multi_op,> <id, pointer to symbol-table entry for C> <exp_op,> <num,integer value 2>

3.1.4 Lexical Errors It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error. E.g., fi ( a == f(x)) ... Keyword “if”? an identifier?

Suppose a situation in which none of the patterns for tokens matches a prefix of the remaining input. E.g. $%#if a>0 a+=1;

The simplest recovery strategy is “panic(野蛮) mode” recovery. Delete successive characters from the remaining input until the lexical analyzer can find a well-formed token. This technique may occasionally confuse the parser, but in an interactive computing environment it may be quit adequate.

Other possible error-recovery actions Delete one extraneous character from the remaining input. Insert a missing character into the remaining input. Replace a character by another character. Transpose two adjacent characters.

Examining ways of speeding reading the source program Two-buffer scheme handling large lookahead safely 3.2 Input Buffering

Two buffers of the same size, say 4096, are alternately reloaded. Two pointers to the input are maintained: Pointer lexeme_Begin marks the beginning of the current lexeme. Pointer forward scans ahead until a pattern match is found. 3.2.1 Buffer Pairs

If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else forward:=forward + 1;

3.2.2 Sentinels E = M * eof C * * 2 eofeof

forward:=forward+1; If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else terminate lexical analysis;

How can deal with a long and long and……long lexeme, this is a problem in the two buffer scheme. DECLARE(ARG1, ARG2,……,ARGn) E.g. When a function is rewritten in c++, a function name is represent several function.

3.3 Specification of Tokens • Regular expressions are an important notation for specifying token patterns. • Study formal notations for regular expressions. • these expressions are used in lexical-analyzer generator. • Sec. 3.7 shows how to build the lexical analyzer by converting regular expressions to automata.

1、Regular Definition of Tokens Defined in regular expression e.g. identifier can be defined by regular Grammar Id  letter(letter|digit) letter A|B|…|Z|a|b|…|z digit0|1|2|…|9 Identifier can also be expressed by following regular expression (A|B|…|Z|a|b|…|z)(A|B|…|Z|a|b|…|z| 0|1|2|…|9)*

Regular expressions are an important notation for specifying patterns. Each pattern matches a set of strings, so regular expressions will serve as names for sets of strings.

2、Regular Expression & Regular language Regular Expression A notation that allows us to define a pattern in a high level language. Regular language Each regular expression r denotes a language L(r) (the set of sentences relating to the regular expression r)

Each token in a program can be expressed in a regular expression

3、The construct rule of regular expression over alphabet   is a regular expression that denote {}  is regular expression {} is the related regular language 2) If a is a symbol in , then a is a regular expression that denotes {a} a is regular expression {a} is the related regular language

3) Suppose  and  are regular expressions, then |, , (), * , * is also a regular expression L(|)= L()L() L()= L()L() L(())= L() L(*)={}L()L()L()... L()…  L()

4、Algebraic laws of regular expressions 1) |= | 2) |(|)=(|)|() =( ) 3) (| )=  | (|)= |  4)  =  =  5)(*)*=* 6) *=＋|＋＝  *= * 7) (|)*= (* | *)*= (**)*

8) If L(),then = |   = *  = |   =  * Notes: We assume that the precedence of * is the highest, the precedence of | is the lowest and they are left associative

5、Notational Short-hands a)One or more instances ( r )+ digit+ b)Zero or one instance r? is a shorthand for r| (E(+|-)?digits)? c)Character classes [a-z] denotes a|b|c|…|z [A-Za-z] [A-Za-z0-9]

1、Task of recognition of token in a lexical analyzer Isolate the lexeme for the next token in the input buffer Produce as output a pair consisting of the appropriate token and attribute-value, such as <id,pointer to table entry> , using the translation table given in the Fig in next page 3.4 Recognition of Tokens

2、Methods to recognition of token Use Transition Diagram

3、Transition Diagram(Stylized flowchart) Depict the actions that take place when a lexical analyzer is called by the parser to get the next token

Accepting state start > = return(relop,GE) 0 6 7 other Start state * 8 return(relop,GT) Notes: Here we use ‘*’ to indicate states on which input retraction must take place

4、Implementing a Transition Diagram Each state gets a segment of code If there are edges leaving a state, then its code reads a character and selects an edge to follow, if possible Use nextchar() to read next character from the input buffer

while (1) { switch(state) { case 0: c=nextchar(); if (c==blank || c==tab || c==newline){ state=0;lexeme_beginning++} else if (c== ‘<‘) state=1; else if (c==‘=‘) state=5; else if(c==‘>’) state=6 else state=fail(); break case 9: c=nextchar(); if (isletter( c)) state=10; else state=fail(); break … }}}

5、A generalized transition diagram Finite Automation Deterministic or non-deterministic FA Non-deterministic means that more than one transition out of a state may be possible on the the same input symbol

Input buffer Lexeme_beginning FA simulator i f d 2 =… 6、The model of recognition of tokens

e.g：The FA simulator for Identifiers is: Which represent the rule: identifier=letter(letter|digit)* letter letter 1 2 digit

1、Usage of FA Precisely recognize the regular sets A regular set is a set of sentences relating to a regular expression 2、Sorts of FA Deterministic FA Non-deterministic FA 3.5 Finite automata

3、Deterministic FA (DFA) DFA is a quintuple, M(S,,move,s0,F） S: a set of states : the input symbol alphabet move: a transition function, mapping from S  to S, move(s,a)=s’ s0: the start state, s0 ∈ S F: a set of states F distinguished as accepting states, FS

Note: 1) In a DFA, no state has an -transition; 2)In a DFA, for each state s and input symbol a, there is at most one edge labeled a leaving s 3)To describe a FA,we use the transition graph or transition table 4)A DFA accepts an input string x if and only if there is some path in the transition graph from start state to some accepting state

e.g. DFA M=({0,1,2,3},{a,b},move,0,{3}) Move: move(0,a)=1 m(0,b)=2 m(1,a)＝3 m(1,b)＝2 m(2,a)=1 m(2,b)=3 m(3,a)＝3 m(3,b)＝3 Transition table 1 a a a b a 0 3 b b b 2 Transition graph

e.g. Construct a DFA M，which can accept the a, b, c strings which begin with a or b, or begin with c and contain at most one a。Please write a C++ function to implement the DFA. b b c a 0 2 3 c c a b c a 1 b

CHAPTER 3 LEXICAL ANALYSIS

CHAPTER 3 LEXICAL ANALYSIS

Presentation Transcript

Lexical Analysis

Chapter 3: Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Chapter 2 Lexical Analysis

Chapter 3 Lexical analyzer

Lexical Analysis

Chapter 3: Lexical Analysis

Chapter 3: Lexical Analysis

LEXICAL ANALYSIS

Chapter 2 Lexical Analysis

Chapter 3: Lexical Analysis

Chapter 3. Lexical Analysis (1)

Chapter 4 Lexical analysis

Chapter 4 Lexical analysis

Lexical Analysis

Lexical Analysis

Chapter 2 Lexical Analysis

Lexical Analysis