350 likes | 468 Views
Lecture 2 Lexical Analysis. CSCE 531 Compiler Construction. Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata Readings:. January 11, 2006. Overview. Last Time A little History Compilers vs Interpreter Data-Flow View of Compilers
E N D
Lecture 2 Lexical Analysis CSCE 531 Compiler Construction • Topics • Sample Simple Compiler • Operations on strings • Regular expressions • Finite Automata • Readings: January 11, 2006
Overview • Last Time • A little History • Compilers vs Interpreter • Data-Flow View of Compilers • Regular Languages • Course Pragmatics • Today’s Lecture • Why Study Compilers? • xx • References • Chapter 2, Chapter 3 • Assignment Due Wednesday Jan 18 • 3.3a; 3.5a,b; 3.6a,b,c; 3.7a; 3.8b
A Simple Compiler for Expressions • Chapter Two Overview • Structure of the simple compiler, really just translator for infix expressions postfix • Grammars • Parse Trees • Syntax directed Translation • Predictive Parsing • Translator for Simple Expressions • Grammar • Rewritten grammar (equivalent one better for pred. parsing) • Parsing modules fig 2.24 • Specification of Translator fig 2.35 • Structure of translator fig 2.36
Grammars • Grammar (or a context free grammar more correctly) has • A set of tokens also known as terminals • A set of nonterminals • A set of productions of the form nonterminal sequence of tokens and/or nonterminals • A special nonterminal the start symbol. • Example • E E + E • E E * E • E digit
Derivations • A derivation is a sequence of rewriting of a string of grammar symbols using the productions in a grammar. • We use the symbol to denote that one string of grammar symbols is obtained by rewritting another using a production • X Y if there is a production N β where • The nonterminal N occurs in the sequence X of Grammar symbols • And Y is the same as X except β replaces the N • Example • E E+E d+E d+ E*E d+ E+E*E d+d+E*E d+d+d*E d+d+d*d
Parse Trees • A graphical presentation of a derivation, satisfying • Root is the start symbol • Each leaf is a token or ε (note different font from text) • Each interior node is a nonterminal • If A is a parent with children X1 , X2 … Xn then A X1X2 … Xn is a production
Syntax directed Translation • Frequently the rewritting by a production will be called a reduction or reducing by the particular production. • Syntax directed translation attaches action (code) that are done when the reductions are performed • Example • E E + T {print(‘+’);} • E E - T {print(‘-’);} • E T • T 0 {print(‘0’);} • T 1 {print(‘1’);} • … • T 9 {print(‘9’);}
Specification of the translator • S L eof figure 2.38 • L E ; L • L Є • E T E’ • E’ + T { print(‘+’); } E’ • E’ - T { print(‘-’); } E’ • E Є • T F T’ • T’ * F { print(‘*’); } T’ • T’ / F { print(‘/’); } T’ T Є • F ( E ) • F id { print(id.lexeme);} • F num { print(num.value);}
E T E’ E’ + T { print(‘+’); } E’ E’ - T { print(‘-’); } E’ E Є Expr() { int t; term(); while(1) switch(lookahead){ case ‘+’: case ‘-’: t = lookahead; match(lookahead); term(); emit(t, NONE); continue; … Translating to code
Overview of the Code Figure 2.36 • /class/csce531-001
Operations on Strings • A language over an alphabet is a set of strings of characters from the alphabet. • Operations on strings: • let x=x1x2…xn and t=t1t2…tm then • Concatenation: xt =x1x2…xnt1t2…tm • Alternation: x|t = either x1x2…xn or t1t2…tm
Operations on Sets of Strings • Operations on sets of strings: • For these let S = {s1, s2, … sm} and R = {r1, r2, … rn} • Alternation: S | T = S U T = {s1, s2, … sm, r1, r2, … rn } • Concatenation: • ST ={st | where s Є S and t Є T} • = { s1r1, s1r2, … s1rn, s2r1, … s2rn, … smr1, … smrn} • Power: S2 = S S, S3= S2 S, Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*
Operations cont. Kleene Closure • Powers: • S2 = S S • S3= S2 S • … • Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*
Examples of Operations on Sets of Strings • Operations on sets of strings: • For these let S = {a,b,c} and R = {t,u} • Alternation: S | T = S U T = {a,b,c,t,u} • Concatenation: • ST ={st | where s Є S and t Є T} • = { at, au, bt, bu, ct, cu} • Power: S2 = { aa, ab, ac, ba, bb, bc, ca, cb, cc} • S3= { aaa, aab, aac, … ccc} 27 elements • Kleene closure: S* = {any string of any length of a’s, b’s and c’s}
Regular Expressions • For a given alphabet Σ the following are regular expressions: • If a ЄΣ then a is a regular expression and L(a) = { a } • Є is a regular expression and L(Є) = { Є } • Φ is a regular expression and L(Φ) = Φ • And if s and t are regular expressions denoting languages L(s) and L(t) respectively then • st is a regular expression and L(st) = L(s) L(t) • s | t is a regular expression and L(s | t) = L(s) U L(t) • s* is a regular expression and L(s*) = L(s)*
Why Regular Expressions? • We use regular expressions to describe the tokens • Examples: • Reg expr for C identifiers • C identifiers? Any string of letters, underscores and digits that start with a letter or underscore ID reg expr = (letter | underscore) (letter | underscore | digit)* Or more explicitly ID reg expr = ( a|b|…|z|_)(a|b|…z|_|0|1…|9)*
Pop Quiz • Given r and s are regular expressions then • What is rЄ ? r | Є ? • Describe the Language denoted by 0*110* • Describe the Language denoted by (0|1)*110* • Give a regular expression for the language of 0’s and 1’s such that end in a 1 • Give a regular expression for the language of 0’s and 1’s such that every 0 is followed by a 1
Recognizers of Regular Languages • To develop efficient lexical analyzers (scanners) we will rely on a mathematical model called finite automata, similar to the state machines that you have probably seen. In particular we will use deterministic finite automata, DFAs. • The construction of a lexical analyzer will then proceed as: • Identify all tokens • Develop regular expressions for each • Convert the regular expressions to finite automata • Use the transition table for the finite automata as the basis for the scanner • We will actually use the tools lex and/or flex for steps 3 and 4.
Transition Diagram for a DFA • Start in state s0 then if the input is “f” make transition to state s1. • The from state s1 if the input is “o” make transition to state s2. • And from state s2 if the input is “r” make transition to state s3. • The double circle denotes an “accepting state” which means we recognized the token. • Actually there is a missing state and transition f o r s1 s2 s3 s0
Now what about “fort” • The string “fort” is an identifier, not the keyword “for” followed by “t.” • Thus we can’t really recognize the token until we see a terminator – whitespace or a special symbol ( one of ,;(){}[]
Deterministic Finite Automata • A Deterministic finite automaton (DFA) is a mathematical model that consists of • 1. a set of states S • 2. a set of input symbols ∑, the input alphabet • 3. a transition function δ: S x ∑ Sthat for each state and each input maps to the next state • 4. a state s0that is distinguished as the start state • 5. a set of states F distinguished as accepting (or final) states
DFA to recognize keyword “for” • Σ= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …} • S = {s0, s1, s2, s3, sdead} • s0, is the start state • SF = {s3} • δ given by the table below
Language Accepted by a DFA • A string x0x1…xn is accepted by a DFA M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn determines a path through the state diagram for the DFA that ends in an Accepting State. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.
DFA1.c • /* • * Deteministic Finite Automata Simulation • * • * One line of input is read and then processed character by character. • * Thus '\n' (EOL) is treated as the end of input. • * The major functions are: • * delta(s,c) - that implements the tranistion function, and • * accept(s) - that tells whether state s is an accepting state or not. • * The particular DFA recognizes strings of digits that end in 000. • * The DFA has: • * S = {0, 1, 2, 3, DEAD_STATE} • * Transitions on 0: S0=>S1, S1=>S2, S2=>S3, S3=>S3 • * Transitions on non-zero digits: S0=>S0, S1=>S0, S2=>S0, S3=>S0 • * Transitions on non-digits: Si=> DEAD_STATE • * • */
#include <stdio.h> • #define DEAD_STATE -1 • #define ACCEPT 1 • #define DO_NOT 0 • #define EOL '\n' • main(){ • int c; • int state; • state = 0; • while((c = getchar()) != EOL && state != DEAD_STATE){ • state = delta(state, c); • } • if(accept(state)){ • printf("Accept!\n"); • }else{ • printf("Do not accept!\n"); • } • }
/* DFA Transition function delta */ • /* delta(s,c) = transition from state s on input c */ • int delta(int s, int c){ • switch (s){ • case 0: if (c == '0') return 1; • else if((c > '0') && (c <= '9')) return 0; • else return(DEAD_STATE); • break; • case 1: if (c == '0') return 2; • else if((c > '0') && (c <= '9')) return 0; • else return(DEAD_STATE); • break; • case 2: if (c == '0') return 3; • else if((c > '0') && (c <= '9')) return 0; • else return(DEAD_STATE); • break; • case 3: if (c == '0') return 3; • else if((c > '0') && (c <= '9')) return 0; • else return(DEAD_STATE); • break; • case DEAD_STATE: return DEAD_STATE; • break; • default: • printf("Bad State\n"); • return(DEAD_STATE); • } • }
int accept(state){ • if (state == 3) return ACCEPT; • else return DO_NOT; • }
Non-Deterministic Finite Automata • What does deterministic mean? • In a Non-Deterministic Finite Automata (NFA) we relax the restriction that the transition function δ maps every state and every element of the alphabet to a unique state, i.e. δ: S x ∑ S • An NFA can: • Have multiple transitions from a state for the same input • Have Є transitions, where a transition from one state to another can be accomplished without consuming an input character • Not have transitions defined for every state and every input • Note for NFAs δ: S x ∑ 2S where is the power set of S
Language Accepted by an NFA • A string x0x1…xn is accepted by an NFA • M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn can determines a path through the state diagram for the NFA that ends in an Accepting State, taking Є where ever necessary. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.
Thompson Construction • For any regular expression R construct an NFA, M, that accepts the language denoted by R, i.e., L(M) = L(R).