490 likes | 507 Views
Lecture 2 Lexical Analysis. CSCE 531 Compiler Construction. Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata Readings:. January 18, 2018. Overview. Last Time A little History Compilers vs Interpreter Data-Flow View of Compilers
E N D
Lecture 2 Lexical Analysis CSCE 531 Compiler Construction • Topics • Sample Simple Compiler • Operations on strings • Regular expressions • Finite Automata • Readings: January 18, 2018
Overview • Last Time • A little History • Compilers vs Interpreter • Data-Flow View of Compilers • Regular Languages • Course Pragmatics • Today’s Lecture • Why Study Compilers? • xx • References • Chapter 2, Chapter 3 • Assignment Due Wednesday Jan 18 • 3.3a; 3.5a,b; 3.6a,b,c; 3.7a; 3.8b
A Simple Compiler for Expressions • Chapter Two Overview • Structure of the simple compiler, really just translator for infix expressions postfix • Grammars • Parse Trees • Syntax directed Translation • Predictive Parsing • Translator for Simple Expressions • Grammar • Rewritten grammar (equivalent one better for pred. parsing) • Parsing modules fig 2.24 • Specification of Translator fig 2.35 • Structure of translator fig 2.36
Grammars • Grammar (or a context free grammar more correctly) has • A set of tokens also known as terminals • A set of nonterminals • A set of productions of the form nonterminal sequence of tokens and/or nonterminals • A special nonterminal the start symbol. • Example • E E + E • E E * E • E digit
Derivations • A derivation is a sequence of rewriting of a string of grammar symbols using the productions in a grammar. • We use the symbol to denote that one string of grammar symbols is obtained by rewritting another using a production • X Y if there is a production N β where • The nonterminal N occurs in the sequence X of Grammar symbols • And Y is the same as X except β replaces the N • Example • E E+E d+E d+ E*E d+ E+E*E d+d+E*E d+d+d*E d+d+d*d
Language generated by a grammar • * is the formally the transitive, reflexive closure of the relation • s1 * s2 informally if s2 can be derived in zero or more steps from s1 • The language generated by the Grammar G is denoted L(G) and is defined by • L(G) = {w ε T* | S * w} • Where T* is any finite string of terminals. • Example: S aS | b • L(G)={b, ab, aab, …}
Parse Trees • A graphical presentation of a derivation, satisfying • Root is the start symbol • Each leaf is a token or ε (note different font from text) • Each interior node is a nonterminal • If A is a parent with children X1 , X2 … Xn then A X1X2 … Xn is a production • Note for each derivation there is a unique parse tree. • However, for any parse tree there are many corresponding derivations. • A leftmost derivation is a derivation in which at each step the leftmost nonterminal is replaced. • E *lmE + E *lm id + E *lm id + E * E *lm id + id * E *lm id + id * id
Ambiguity • E E + E | E * E | ‘(‘ E ‘)’ | id
The Empty String ε • ε = the string with no characters • S Sa | a • S Sa | ε
Syntax directed Translation • Frequently the rewritting by a production will be called a reduction or reducing by the particular production. • Syntax directed translation attaches action (code) that are done when the reductions are performed • Example • E E + T {print(‘+’);} • E E - T {print(‘-’);} • E T • T 0 {print(‘0’);} • T 1 {print(‘1’);} • … • T 9 {print(‘9’);}
Quadruples • Quadruples • Result • Left operand • Operator • Right operand
Fig 2.4 Intermediate forms for a loop • do • i = i + 1; • while ( a[i] < v);“ • Parse Tree • Quadruples • Parse Tree • Quadruples
Specification of the translator • S L eof figure 2.38 • L E ; L • L ε • E T E’ • E’ + T { print(‘+’); } E’ • E’ - T { print(‘-’); } E’ • E ε • T F T’ • T’ * F { print(‘*’); } T’ • T’ / F { print(‘/’); } T’ • T ε • F ( E ) • F id { print(id.lexeme);} • F num { print(num.value);}
E T E’ E’ + T { print(‘+’); } E’ E’ - T { print(‘-’); } E’ E ε Expr() { int t; term(); while(1) switch(lookahead){ case ‘+’: case ‘-’: t = lookahead; match(lookahead); term(); emit(t, NONE); continue; … Translating to code
Recursive-descent parsing is a top-down method of syntax analysis in which a set of recursive procedures is used to process the input. • One procedure is associated with each nonterminal of a grammar. • Here, we consider a simple form of recursive-descent parsing, called predictive parsing, in which the lookahead symbol unambiguously determines the ow of control through the procedure body for each nonterminal. • The sequence of procedure calls during the analysis of an input string implicitly de nes a parse tree for the input, and can be used to build an explicit parse tree, if desired.
stmt for ( optexpr ; optexpr ; optexpr ) stmt • each nonterminal leads to a call of its procedure, in the following sequence of calls: • match(for); match( ‘(‘ ); optexpr (); match( ‘;’); optexpr (); match(‘;’); optexpr (); match( ‘)’ ); stmt ();
First(alpha); nullable • FIRST( α ) to be the set of terminals that appear as the first symbol of one or more strings of terminals generated from α. • S aS | b | ε • FIRST(a) = {a} • FIRST(S) = { a, ε } • Note if XX1X2 … XnX then • FIRST(X) contains FIRST(X1) • If X1 * ε then FIRST(X) contains FIRST(X2) • If X1X2 … Xi * ε then FIRST(X) contains FIRST(Xi+1) • A string w is nullable if w * ε
Using First to direct parsing • Lookahead = next_token • If A α is a production and • the lookahead is in FIRST(α) • then reduce by the production A α • For predictive parsing to work if there are two (or more) productions • A α • A β • Then FIRST(α) ∩ FIRST(β) must be empty
Overview of the Code Figure 2.36 • ~matthews/public/csce531
Semantic Actions – Translate To Postfix • expr expr + term { print( ‘+’ ) } • | expr - term { print( ‘-’ ) } • | term • term term * factor { print( ‘*’ ) } • | term / factor { print( ‘/’ ) } • | factor • factor ( expr ) • | num { print(num:value ) } • | id { print( id.lexeme) }
Trace • x * 2 + z • Parse Tree • Leftmost derivation • Production Action • E *lm F
Operations on Strings • A language over an alphabet is a set of strings of characters from the alphabet. • Operations on strings: • let x=x1x2…xn and t=t1t2…tm then • Concatenation: xt =x1x2…xnt1t2…tm • Alternation: x | t = either x1x2…xn or t1t2…tm
Operations on Sets of Strings • Operations on sets of strings: • For these let S = {s1, s2, … sm} and R = {r1, r2, … rn} • Alternation: S | T = S U T = {s1, s2, … sm, r1, r2, … rn } • Concatenation: • ST ={st | where s Є S and t Є T} • = { s1r1, s1r2, … s1rn, s2r1, … s2rn, … smr1, … smrn} • Power: S2 = S S, S3= S2 S, Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*
Operations cont. Kleene Closure • Powers: • S2 = S S • S3= S2 S • … • Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*
Examples of Operations on Sets of Strings • Operations on sets of strings: • For these let S = {a,b,c} and R = {t,u} • Alternation: S | T = S U T = {a,b,c,t,u} • Concatenation: • ST ={st | where s Є S and t Є T} • = { at, au, bt, bu, ct, cu} • Power: S2 = { aa, ab, ac, ba, bb, bc, ca, cb, cc} • S3= { aaa, aab, aac, … ccc} 27 elements • Kleene closure: S* = {any string of any length of a’s, b’s and c’s}
Regular Expressions • For a given alphabet Σ the following are regular expressions: • If a ЄΣ then a is a regular expression and L(a) = { a } • Є is a regular expression and L(Є) = { Є } • Φ is a regular expression and L(Φ) = Φ • And if s and t are regular expressions denoting languages L(s) and L(t) respectively then • st is a regular expression and L(st) = L(s) L(t) • s | t is a regular expression and L(s | t) = L(s) U L(t) • s* is a regular expression and L(s*) = L(s)*
Why Regular Expressions? • We use regular expressions to describe the tokens • Examples: • Reg expr for C identifiers • C identifiers? Any string of letters, underscores and digits that start with a letter or underscore ID reg expr = (letter | underscore) (letter | underscore | digit)* Or more explicitly ID reg expr = ( a|b|…|z|_)(a|b|…z|_|0|1…|9)*
Pop Quiz • Given r and s are regular expressions then • What is rЄ ? r | Є ? • Describe the Language denoted by 0*110* • Describe the Language denoted by (0|1)*110* • Give a regular expression for the language of 0’s and 1’s such that end in a 1 • Give a regular expression for the language of 0’s and 1’s such that every 0 is followed by a 1
Recognizers of Regular Languages • To develop efficient lexical analyzers (scanners) we will rely on a mathematical model called finite automata, similar to the state machines that you have probably seen. In particular we will use deterministic finite automata, DFAs. • The construction of a lexical analyzer will then proceed as: • Identify all tokens • Develop regular expressions for each • Convert the regular expressions to finite automata • Use the transition table for the finite automata as the basis for the scanner • We will actually use the tools lex and/or flex for steps 3 and 4.
Transition Diagram for a DFA • Start in state s0 then if the input is “f” make transition to state s1. • The from state s1 if the input is “o” make transition to state s2. • And from state s2 if the input is “r” make transition to state s3. • The double circle denotes an “accepting state” which means we recognized the token. • Actually there is a missing state and transition f o r s1 s2 s3 s0
Now what about “fort” • The string “fort” is an identifier, not the keyword “for” followed by “t.” • Thus we can’t really recognize the token until we see a terminator – whitespace or a special symbol ( one of ,;(){}[]
Deterministic Finite Automata • A Deterministic finite automaton (DFA) is a mathematical model that consists of • 1. a set of states S • 2. a set of input symbols ∑, the input alphabet • 3. a transition function δ: S x ∑ Sthat for each state and each input maps to the next state • 4. a state s0that is distinguished as the start state • 5. a set of states F distinguished as accepting (or final) states
DFA to recognize keyword “for” • Σ= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …} • S = {s0, s1, s2, s3, sdead} • s0, is the start state • SF = {s3} • δ given by the table below
Language Accepted by a DFA • A string x0x1…xn is accepted by a DFA M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn determines a path through the state diagram for the DFA that ends in an Accepting State. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.
Non-Deterministic Finite Automata • What does deterministic mean? • In a Non-Deterministic Finite Automata (NFA) we relax the restriction that the transition function δ maps every state and every element of the alphabet to a unique state, i.e. δ: S x ∑ S • An NFA can: • Have multiple transitions from a state for the same input • Have Є transitions, where a transition from one state to another can be accomplished without consuming an input character • Not have transitions defined for every state and every input • Note for NFAs δ: S x ∑ 2S where is the power set of S
Language Accepted by an NFA • A string x0x1…xn is accepted by an NFA • M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn can determines a path through the state diagram for the NFA that ends in an Accepting State, taking Є where ever necessary. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.
Thompson Construction • For any regular expression R construct an NFA, M, that accepts the language denoted by R, i.e., L(M) = L(R).