200 likes | 426 Views
컴파일러 입문. 제 4 장 어휘 분석. Lexical Analysis the process by which the compiler groups certain strings of characters into individual tokens. Lexical Analyzer Scanner Lexer. 4.1 서 론. Token 문법적으로 의미 있는 최소 단위 Token - a single syntactic entity(terminal symbol).
E N D
컴파일러 입문 제 4 장 어휘 분석
Lexical Analysis the process by which the compiler groups certain strings of characters into individual tokens. Lexical Analyzer Scanner Lexer 4.1 서 론
Token • 문법적으로 의미 있는 최소 단위 Token - a single syntactic entity(terminal symbol). Token Number - string 처리의 효율성 위한 integer number. Token Value - numeric value or string value. ex) if(a>10) ... Token Number : 32 7 4 25 5 8 Token Value : 0 0 ‘a’ 0 10 0
Token classes • Special form - language designer 1. Keyword --- const, else, if, int, ... 2. Operator symbols --- +, -, *, /, ++, -- etc. 3. Delimiters --- ;, ,, (, ), [, ] etc. • General form - programmer 4. identifier --- stk, ptr, sum, ... 5. constant --- 526, 3.0, 0.1234e-10, ‘c’, “string” etc. • Token Structure - represented by regular expression. ex) id = (l + _)( l + d + _)*
Interaction of Lexical Analyzer with Parser • Lexical Analyzer is the procedureof Syntax Analyzer. L.A. Finite Automata. S.A. Pushdown Automata. • Token type • scanner가 parser에게 넘겨주는 토큰 형태.(token number, token value) ex) if ( x > y ) x = 10 ; (32,0) (7,0) (4,x) (25,0) (4,y) (8,0) (4,x) (23,0) (5,10) (20,0)
The reasons for separating the analysis phase of compiling into lexical analysis(scanning) and syntax analysis(parsing). 1. modular construction - simpler design. 2. compiler efficiency is improved. 3. compiler portability is enhanced. • Parsing table • Parser의 행동(Shift, Reduce, Accept, Error)을 결정. • Token number는 Parsing table의 index.
Symbol table의 용도 • L.A와 S.A시 identifier에 관한 정보를 수집하여 저장. • Semantic analysis와 Code generation시에 사용. • name + attributes ex) Hashed symbol table • chapter 12 참조
Specification of token structure - RE Specification of PL - CFG Scanner design steps 1. describe the structure of tokens inre. 2. or, directly design a transition diagram for the tokens. 3. and program a scanner according to the diagram. 4. moreover, we verify the scanner action through regular language theory. Character classification letter : a | b | c... | z | A | B | C |…| Z l digit : 0 | 1 | 2... | 9d special character : + | - | * | / | . | , | ... 4.2 토큰 인식
Transition diagram Regular grammar S lA | _A A lA | dA | _A | ε Regular expression S = lA + _A = (l + _)A A = lA + dA + _A + ε = (l + d + _)A + ε = (l + d + _)* S = (l + _)( l + d + _)* 4.2.1 Identifier Recognition
n : non-zero digit o : octal digit h : hexa digit 4.2.2 Integer number Recognition • Form : 10진수, 8진수, 16진수로 구분되어진다. 10진수 : 0이 아닌 수 시작 8진수 : 0으로 시작, 16진수 : 0x, 0X로 시작 • Transition diagram
Regular grammar S nA | 0B A dA | ε B oC | xD | XD | ε C oC | ε D hE E hE | ε • Regular expression E = hE + ε = h*ε = h* D = hE = hh* = h+ C = oC + ε = o* B = oC + xD + XD + ε = o+ + (x + X)D = o+ + (x + X)h+ + ε A = dA + ε = d* S = nA + 0B = nd* + 0(o+ + (x + X)h+ + ε) = nd* + 0 + 0o+ + 0(x + X)h+ ∴ S = nd* + 0 + 0o+ + 0(x + X)h+
4.2.3 Real number Recognition • Form : Fixed-point number & Floating-point number • Transition diagram • Regular grammar S dA D dE | +F | -G A dA | .B E dE |ε B dCF dE C dC | eD |ε G dE
Regular expression E = dE + ε = d* F = dE = dd* = d+ G = dE = dd* = d+ D = dE + '+'F + -G = dd* + '+'d+ + -d + = d+ + '+'d+ + -d+ = (ε+ '+' +-)d + C = dC + eD + ε = dC+e(ε + '+' +-)d+ + e = d*(e(ε + '+' +-) d+ + ε) B = dC=dd*(e(ε + '+' +-)d+ +ε) = d++(e(ε + '+' +-) d+ +ε) A = dA + .B = d*.d+(e(ε + '+' +-)d+ + ε) S = dA = dd*. d+(e(ε + '+' +-) d+ +ε) = d+.d+(e(ε + '+' +-) d+ + ε) = d+.d++ d+.d+e(ε + '+' +-) d+ 참고 Terminal +를 ‘+’로 표기.
Form : a sequence of characters between a pair of double quotes. Transition diagram where, a = char_set - {", \} and c = char_set Regular grammar S "A A aA | "B | \C B ε C cA 4.2.4 String Constant Recognition
Regular expression A = aA + " B + \C = aA + " + \cA = (a + \c)A + " = (a + \c)* " S = " A = "(a + \c)*" • ∴ S ="(a + \c)*"
Transition diagram where, a = char_set - {*} and b = char_set - {*, /}. Regular grammar S /A A *B B aB | *C C *C | bB | /D D ε 4.2.5 Comment Recognition
Regular expression C = *C + bB + /D = **(bB + /) B = aB + ***(bB + /) = aB + ***bB + ***/ = (a + *** b)B + ***/= (a + ***b)****/ A = *B = *(a + ***b)****/ S = /A =/* (a + ***b)****/ • A program which recognizes a comment statement. do { while (ch !='*') ch = getchar(); ch = getchar(); } while (ch != '/');