Chapter 3. Lexical Analysis (1)

Chapter 3.Lexical Analysis (1)

Interaction of lexical analyzer with parser.

Lexical Analysis • Issues • Simpler design is preferred • Compiler efficiency is improved • Compiler portability is improved • Terms • Tokens  terminal symbols in a grammar • Patterns  rules to describing strings of a token • Lexemes  a set of strings matched by the pattern

Examples of tokens.

Difficulties in implementing lexical analyzers • FORTRAN • No delimiter is used • DO 5 I=1.25  DO 5 I=1,25  DO 5 I= 1 25 • PL/I • Keywords are not reserved • IF THEN THEN THEN = ELSE; ELSE ELSE=THEN;

Attributes for tokens • A lexical analyzer collects information about tokens into their associated attributes • Example • E = M * C ** 2 • <id, pointer to symbol-table entry for E> • <assign_op,> • <id, pointer to symbol-table entry for M> • <mult_op,_> • <id, pointer to symbol-table entry for C> • <exp_op,> • <num, integer value 2> generally stored in constant table

Lexical Errors • Rules for error recovery • Deleting an extraneous character • Inserting a missing character • Replacing an incorrect character by a correct character • Transposing two adjacent characters • Minimum-distance erroneous correction • Example • Detectable : 2as3, 2#31, … • Undetectable : fi(a == f(x)) …

Input Buffering • A single buffer could make a big difficulty • 두 버퍼 사이에 있는 word • Declare (arg1, …. , argn)  array or function • Buffer pairs • A good solution • Sentinels을 쓰면 매번 버퍼의 끝인지와 파일의 끝인지를 동시에 검사할 필요가 없음

Sentinels at end of each buffer half.

Specification of Tokens • Strings and languages • Alphabet or character class  finite set of symbols • String  sentence  word • |s|  length of a string s • ε : empty string, Ф ={ε} : empty set • x, y are strings  • xy : concatenation, εx = x ε = x • Operations on languages

Terms for parts of a string.

Definitions of operations on languages.

Regular Expressions 1. is a regular expression that denotes {}, that is, the set containing the empty string. 2. If a is symbol in , then a is a regular expression that denotes {a}, i.e., the set containing the string a. Although we use the same notation for all three, technically, the regular expression a is different from the string a or the symbol a. It will be clear from the context whether we are talking about a as a regular expression, string, or symbol. 3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then, a) (r)|(s) is a regular expression denoting L(r)  L(s). b) (r)(s) is a regular expression denoting L(r)L(s). c) (r)* is a regular expression denoting (L(r))*. d) (r) is a regular expression denoting L(r).

Examples on operations in regular expressions • Σ ={a,b}  alphabets • a | b  {a,b} • (a|b)(c|d)  {ac, ad, bc, bd} • a*  {ε, a, aa, aaa, …} • (a|b)*  (a*|b*)* • aa* = a+, ε|a+ = a* • (a|b) = (b|a)

Algebraic properties of regular expressions.

Regular Definitions • Regular definition • d1  r1 d2  r2 …. dn  rn • 예 • letter  A|B| … |Z|a|b| … |z • digit  0|1| … | 9 • id  letter (letter|digit)*

Notational Shorthands (1/2) • One or more instances. The unary postfix operator + means “one or more instances of.” If r is a regular expression that denotes the language L(r), then (r)+ is a regular expression that denotes the language (L(r))+. Thus, the regular expression a+ denotes the set of all strings of one or more a’s. The operator + has the same precedence and associativity as the operator *. The two algebraic identities r* = r+| and r+ = rr* relate the Kleene and positive closure operators. • Zero or one instance. The unary postfix operator ? means “zero or one instance of.” The notation r? is a shorthand for r|. If r is a regular expression, then, (r)? is a regular expression that denotes the language L(r)  {}. For example, using the + and ? operators, we can rewrite the regular definition for num in Example 3.5 as

Notational Shorthands (2/2) • Character classes. The notation [abc] where a, b, and c are alphabet symbols denotes the regular expression a | b | c. An abbreviated character class such as [a – z] denotes the regular expression a | b | ··· | z. Using character classes, we can describe identifiers as being strings generated by the regular expression [A – Za – z][A – Za – z0 – 9]*

Nonregular set • {wcw-1|w is a string of a’s and b’s}  context-free grammar is required to represent the string

Regular-expression patterns for tokens.

Transition diagram • Finite-state automata • states and edges • 몇 가지 예를 보여줌 …. • 다음 페이지, • 그림3.14는 앞의 예를 바탕으로 그림

Transition diagram for identifiers and keywords.

Lex에 의한 구현 • Regular definition  finite automata, transition diagram • C프로그램으로 출력 • Lexical analysis, pattern matching, …

Creating a lexical analyzer with Lex.

Lex program for the tokens of Fig. 3. 10. (1/2) %{ /*definitions of manifest constants LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */ %} /*regular definitions */ delim [ \ t \ n ] ws { delim }+ letter [ A-Za-z ] digit [ 0 – 9 ] id { letter } ( { letter } | { digit } )* number { digit } + ( \ .{ digit } + ) ? ( E [ + \ - ] ? { digit } + ) ?

Lex program for the tokens of Fig. 3. 10. (2/2) %% { ws } { /* no action and no return */ } if { return(IF); } then { return(THEN); } else { return(ELSE); } { id } { yylval = install_id(); return(ID); } { number } { yylval = install_num(); return(NUMBER); } “<” { yylval = LT; return(RELOP); } “<=” { yylval = LE; return(RELOP); } “=” { yylval = EQ; return(RELOP); } “<>” { yylval = NE; return(RELOP); } “>” { yylval = GT; return(RELOP); } “>=” { yylval = GE; return(RELOP); } %% install_id() { /* procedure to install the lexeme, whose first character is pointed to by yytext and whose length is yyleng, into the symbol table and return a pointer thereto */ } install_num() { /* similar procedure to install a lexeme that is a number */ }

Lookahead operator • DO 5 I = 1.25  DO 5 I=1,25 • DO/({letter | digit})* = ({letter} | {digit})*, • DO/{id}* = {digit}*, • IF(I,J)=3  IF(condition) statement • IF/ \( .* \) {letter}

Chapter 3. Lexical Analysis (1)

Chapter 3. Lexical Analysis (1)

Presentation Transcript

Lexical Analysis

Lexical Analysis Part 1

Chapter 3: Lexical Analysis

Lexical Analysis

Pertemuan 3 - 6 Lexical Analysis (Scanning)

Lexical Analysis Part 1

Chapter 2 Lexical Analysis

Lexical Analysis

Chapter 3: Lexical Analysis

Chapter 3: Lexical Analysis

Lexical Analysis

Lexical Analysis

Chapter 2 Lexical Analysis

Chapter 3: Lexical Analysis

CHAPTER 3 LEXICAL ANALYSIS

Chapter 4 Lexical analysis

Lexical Analysis

Lexical Analysis

Chapter 4 Lexical analysis

Lexical Analysis

Chapter 2 Lexical Analysis