280 likes | 529 Views
Chapter 3. Lexical Analysis (1). Interaction of lexical analyzer with parser. Lexical Analysis. Issues Simpler design is preferred Compiler efficiency is improved Compiler portability is improved Terms Tokens terminal symbols in a grammar
E N D
Lexical Analysis • Issues • Simpler design is preferred • Compiler efficiency is improved • Compiler portability is improved • Terms • Tokens terminal symbols in a grammar • Patterns rules to describing strings of a token • Lexemes a set of strings matched by the pattern
Difficulties in implementing lexical analyzers • FORTRAN • No delimiter is used • DO 5 I=1.25 DO 5 I=1,25 DO 5 I= 1 25 • PL/I • Keywords are not reserved • IF THEN THEN THEN = ELSE; ELSE ELSE=THEN;
Attributes for tokens • A lexical analyzer collects information about tokens into their associated attributes • Example • E = M * C ** 2 • <id, pointer to symbol-table entry for E> • <assign_op,> • <id, pointer to symbol-table entry for M> • <mult_op,_> • <id, pointer to symbol-table entry for C> • <exp_op,> • <num, integer value 2> generally stored in constant table
Lexical Errors • Rules for error recovery • Deleting an extraneous character • Inserting a missing character • Replacing an incorrect character by a correct character • Transposing two adjacent characters • Minimum-distance erroneous correction • Example • Detectable : 2as3, 2#31, … • Undetectable : fi(a == f(x)) …
Input Buffering • A single buffer could make a big difficulty • 두 버퍼 사이에 있는 word • Declare (arg1, …. , argn) array or function • Buffer pairs • A good solution • Sentinels을 쓰면 매번 버퍼의 끝인지와 파일의 끝인지를 동시에 검사할 필요가 없음
Specification of Tokens • Strings and languages • Alphabet or character class finite set of symbols • String sentence word • |s| length of a string s • ε : empty string, Ф ={ε} : empty set • x, y are strings • xy : concatenation, εx = x ε = x • Operations on languages
Regular Expressions 1. is a regular expression that denotes {}, that is, the set containing the empty string. 2. If a is symbol in , then a is a regular expression that denotes {a}, i.e., the set containing the string a. Although we use the same notation for all three, technically, the regular expression a is different from the string a or the symbol a. It will be clear from the context whether we are talking about a as a regular expression, string, or symbol. 3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then, a) (r)|(s) is a regular expression denoting L(r) L(s). b) (r)(s) is a regular expression denoting L(r)L(s). c) (r)* is a regular expression denoting (L(r))*. d) (r) is a regular expression denoting L(r).
Examples on operations in regular expressions • Σ ={a,b} alphabets • a | b {a,b} • (a|b)(c|d) {ac, ad, bc, bd} • a* {ε, a, aa, aaa, …} • (a|b)* (a*|b*)* • aa* = a+, ε|a+ = a* • (a|b) = (b|a)
Regular Definitions • Regular definition • d1 r1 d2 r2 …. dn rn • 예 • letter A|B| … |Z|a|b| … |z • digit 0|1| … | 9 • id letter (letter|digit)*
Unsigned numbers • Pascal digit 0|1| … |9 digits digit digit* operational_fraction . digits | ε optional_exponent (E(+|-| ε) digits | ε num digits operational_fraction optional_exponent
Notational Shorthands (1/2) • One or more instances. The unary postfix operator + means “one or more instances of.” If r is a regular expression that denotes the language L(r), then (r)+ is a regular expression that denotes the language (L(r))+. Thus, the regular expression a+ denotes the set of all strings of one or more a’s. The operator + has the same precedence and associativity as the operator *. The two algebraic identities r* = r+| and r+ = rr* relate the Kleene and positive closure operators. • Zero or one instance. The unary postfix operator ? means “zero or one instance of.” The notation r? is a shorthand for r|. If r is a regular expression, then, (r)? is a regular expression that denotes the language L(r) {}. For example, using the + and ? operators, we can rewrite the regular definition for num in Example 3.5 as
Notational Shorthands (2/2) • Character classes. The notation [abc] where a, b, and c are alphabet symbols denotes the regular expression a | b | c. An abbreviated character class such as [a – z] denotes the regular expression a | b | ··· | z. Using character classes, we can describe identifiers as being strings generated by the regular expression [A – Za – z][A – Za – z0 – 9]*
Nonregular set • {wcw-1|w is a string of a’s and b’s} context-free grammar is required to represent the string
Transition diagram • Finite-state automata • states and edges • 몇 가지 예를 보여줌 …. • 다음 페이지, • 그림3.14는 앞의 예를 바탕으로 그림
Lex에 의한 구현 • Regular definition finite automata, transition diagram • C프로그램으로 출력 • Lexical analysis, pattern matching, …
Lex program for the tokens of Fig. 3. 10. (1/2) %{ /*definitions of manifest constants LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */ %} /*regular definitions */ delim [ \ t \ n ] ws { delim }+ letter [ A-Za-z ] digit [ 0 – 9 ] id { letter } ( { letter } | { digit } )* number { digit } + ( \ .{ digit } + ) ? ( E [ + \ - ] ? { digit } + ) ?
Lex program for the tokens of Fig. 3. 10. (2/2) %% { ws } { /* no action and no return */ } if { return(IF); } then { return(THEN); } else { return(ELSE); } { id } { yylval = install_id(); return(ID); } { number } { yylval = install_num(); return(NUMBER); } “<” { yylval = LT; return(RELOP); } “<=” { yylval = LE; return(RELOP); } “=” { yylval = EQ; return(RELOP); } “<>” { yylval = NE; return(RELOP); } “>” { yylval = GT; return(RELOP); } “>=” { yylval = GE; return(RELOP); } %% install_id() { /* procedure to install the lexeme, whose first character is pointed to by yytext and whose length is yyleng, into the symbol table and return a pointer thereto */ } install_num() { /* similar procedure to install a lexeme that is a number */ }
Lookahead operator • DO 5 I = 1.25 DO 5 I=1,25 • DO/({letter | digit})* = ({letter} | {digit})*, • DO/{id}* = {digit}*, • IF(I,J)=3 IF(condition) statement • IF/ \( .* \) {letter}