Simple Compiler Design for Expressions: A Practical Guide

Learn about lexical analysis and syntax-directed translation in compiler construction with this lecture on simple compiler design for expressions, covering grammars, parse trees, and predictive parsing. Explore the structure and functionality of a basic compiler, understand the derivation process, and delve into regular languages. Get a practical overview of how to design a compiler for expression evaluation.

Simple Compiler Design for Expressions: A Practical Guide

  1. Lecture 2 Lexical Analysis CSCE 531 Compiler Construction • Topics • Sample Simple Compiler • Operations on strings • Regular expressions • Finite Automata • Readings: January 18, 2018

  2. Overview • Last Time • A little History • Compilers vs Interpreter • Data-Flow View of Compilers • Regular Languages • Course Pragmatics • Today’s Lecture • Why Study Compilers? • xx • References • Chapter 2, Chapter 3 • Assignment Due Wednesday Jan 18 • 3.3a; 3.5a,b; 3.6a,b,c; 3.7a; 3.8b

  3. A Simple Compiler for Expressions • Chapter Two Overview • Structure of the simple compiler, really just translator for infix expressions  postfix • Grammars • Parse Trees • Syntax directed Translation • Predictive Parsing • Translator for Simple Expressions • Grammar • Rewritten grammar (equivalent one better for pred. parsing) • Parsing modules fig 2.24 • Specification of Translator fig 2.35 • Structure of translator fig 2.36

  4. Grammars • Grammar (or a context free grammar more correctly) has • A set of tokens also known as terminals • A set of nonterminals • A set of productions of the form nonterminal  sequence of tokens and/or nonterminals • A special nonterminal the start symbol. • Example • E  E + E • E  E * E • E  digit

  5. Derivations • A derivation is a sequence of rewriting of a string of grammar symbols using the productions in a grammar. • We use the symbol  to denote that one string of grammar symbols is obtained by rewritting another using a production • X Y if there is a production N  β where • The nonterminal N occurs in the sequence X of Grammar symbols • And Y is the same as X except β replaces the N • Example • E  E+E  d+E  d+ E*E  d+ E+E*E  d+d+E*E  d+d+d*E  d+d+d*d

  6. Language generated by a grammar • * is the formally the transitive, reflexive closure of the relation  • s1 * s2 informally if s2 can be derived in zero or more steps from s1 • The language generated by the Grammar G is denoted L(G) and is defined by • L(G) = {w ε T* | S * w} • Where T* is any finite string of terminals. • Example: S  aS | b • L(G)={b, ab, aab, …}

  7. Parse Trees • A graphical presentation of a derivation, satisfying • Root is the start symbol • Each leaf is a token or ε (note different font from text) • Each interior node is a nonterminal • If A is a parent with children X1 , X2 … Xn then A  X1X2 … Xn is a production • Note for each derivation there is a unique parse tree. • However, for any parse tree there are many corresponding derivations. • A leftmost derivation is a derivation in which at each step the leftmost nonterminal is replaced. • E *lmE + E *lm id + E *lm id + E * E *lm id + id * E *lm id + id * id

  8. Ambiguity • E  E + E | E * E | ‘(‘ E ‘)’ | id

  9. The Empty String ε • ε = the string with no characters • S  Sa | a • S  Sa | ε

  10. Equivalent Grammars

  11. Syntax directed Translation • Frequently the rewritting by a production will be called a reduction or reducing by the particular production. • Syntax directed translation attaches action (code) that are done when the reductions are performed • Example • E  E + T {print(‘+’);} • E  E - T {print(‘-’);} • E  T • T  0 {print(‘0’);} • T  1 {print(‘1’);} • … • T  9 {print(‘9’);}

  12. Fig 2.1

  13. Quadruples • Quadruples • Result • Left operand • Operator • Right operand

  14. Fig 2.3 Dataflow model of compiler

  15. Fig 2.4 Intermediate forms for a loop • do • i = i + 1; • while ( a[i] < v);“ • Parse Tree • Quadruples • Parse Tree • Quadruples

  16. Left Factoring

  17. Specification of the translator • S  L eof figure 2.38 • L  E ; L • L  ε • E  T E’ • E’  + T { print(‘+’); } E’ • E’  - T { print(‘-’); } E’ • E  ε • T  F T’ • T’  * F { print(‘*’); } T’ • T’  / F { print(‘/’); } T’ • T  ε • F  ( E ) • F  id { print(id.lexeme);} • F  num { print(num.value);}

  18. E  T E’ E’  + T { print(‘+’); } E’ E’  - T { print(‘-’); } E’ E  ε Expr() { int t; term(); while(1) switch(lookahead){ case ‘+’: case ‘-’: t = lookahead; match(lookahead); term(); emit(t, NONE); continue; … Translating to code

  19. Recursive-descent parsing is a top-down method of syntax analysis in which a set of recursive procedures is used to process the input. • One procedure is associated with each nonterminal of a grammar. • Here, we consider a simple form of recursive-descent parsing, called predictive parsing, in which the lookahead symbol unambiguously determines the ow of control through the procedure body for each nonterminal. • The sequence of procedure calls during the analysis of an input string implicitly de nes a parse tree for the input, and can be used to build an explicit parse tree, if desired.

  20. stmt for ( optexpr ; optexpr ; optexpr ) stmt • each nonterminal leads to a call of its procedure, in the following sequence of calls: • match(for); match( ‘(‘ ); optexpr (); match( ‘;’); optexpr (); match(‘;’); optexpr (); match( ‘)’ ); stmt ();

  21. First(alpha); nullable • FIRST( α ) to be the set of terminals that appear as the first symbol of one or more strings of terminals generated from α. • S  aS | b | ε • FIRST(a) = {a} • FIRST(S) = { a, ε } • Note if XX1X2 … XnX then • FIRST(X) contains FIRST(X1) • If X1 * ε then FIRST(X) contains FIRST(X2) • If X1X2 … Xi * ε then FIRST(X) contains FIRST(Xi+1) • A string w is nullable if w * ε

  22. Using First to direct parsing • Lookahead = next_token • If A  α is a production and • the lookahead is in FIRST(α) • then reduce by the production A  α • For predictive parsing to work if there are two (or more) productions • A  α • A  β • Then FIRST(α) ∩ FIRST(β) must be empty

  23. Overview of the Code Figure 2.36 • ~matthews/public/csce531

  24. Semantic Actions – Translate To Postfix • expr  expr + term { print( ‘+’ ) } • | expr - term { print( ‘-’ ) } • | term • term  term * factor { print( ‘*’ ) } • | term / factor { print( ‘/’ ) } • | factor • factor  ( expr ) • | num { print(num:value ) } • | id { print( id.lexeme) }

  25. Trace • x * 2 + z • Parse Tree • Leftmost derivation • Production Action • E *lm F

  26. Fig 2.46

  27. Operations on Strings • A language over an alphabet is a set of strings of characters from the alphabet. • Operations on strings: • let x=x1x2…xn and t=t1t2…tm then • Concatenation: xt =x1x2…xnt1t2…tm • Alternation: x | t = either x1x2…xn or t1t2…tm

  28. Operations on Sets of Strings • Operations on sets of strings: • For these let S = {s1, s2, … sm} and R = {r1, r2, … rn} • Alternation: S | T = S U T = {s1, s2, … sm, r1, r2, … rn } • Concatenation: • ST ={st | where s Є S and t Є T} • = { s1r1, s1r2, … s1rn, s2r1, … s2rn, … smr1, … smrn} • Power: S2 = S S, S3= S2 S, Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

  29. Operations cont. Kleene Closure • Powers: • S2 = S S • S3= S2 S • … • Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

  30. Examples of Operations on Sets of Strings • Operations on sets of strings: • For these let S = {a,b,c} and R = {t,u} • Alternation: S | T = S U T = {a,b,c,t,u} • Concatenation: • ST ={st | where s Є S and t Є T} • = { at, au, bt, bu, ct, cu} • Power: S2 = { aa, ab, ac, ba, bb, bc, ca, cb, cc} • S3= { aaa, aab, aac, … ccc} 27 elements • Kleene closure: S* = {any string of any length of a’s, b’s and c’s}

  31. Examples of Operations on Sets of Strings

  32. Regular Expressions • For a given alphabet Σ the following are regular expressions: • If a ЄΣ then a is a regular expression and L(a) = { a } • Є is a regular expression and L(Є) = { Є } • Φ is a regular expression and L(Φ) = Φ • And if s and t are regular expressions denoting languages L(s) and L(t) respectively then • st is a regular expression and L(st) = L(s) L(t) • s | t is a regular expression and L(s | t) = L(s) U L(t) • s* is a regular expression and L(s*) = L(s)*

  33. Why Regular Expressions? • We use regular expressions to describe the tokens • Examples: • Reg expr for C identifiers • C identifiers? Any string of letters, underscores and digits that start with a letter or underscore ID reg expr = (letter | underscore) (letter | underscore | digit)* Or more explicitly ID reg expr = ( a|b|…|z|_)(a|b|…z|_|0|1…|9)*

  34. Pop Quiz • Given r and s are regular expressions then • What is rЄ ? r | Є ? • Describe the Language denoted by 0*110* • Describe the Language denoted by (0|1)*110* • Give a regular expression for the language of 0’s and 1’s such that end in a 1 • Give a regular expression for the language of 0’s and 1’s such that every 0 is followed by a 1

  35. Recognizers of Regular Languages • To develop efficient lexical analyzers (scanners) we will rely on a mathematical model called finite automata, similar to the state machines that you have probably seen. In particular we will use deterministic finite automata, DFAs. • The construction of a lexical analyzer will then proceed as: • Identify all tokens • Develop regular expressions for each • Convert the regular expressions to finite automata • Use the transition table for the finite automata as the basis for the scanner • We will actually use the tools lex and/or flex for steps 3 and 4.

  36. Transition Diagram for a DFA • Start in state s0 then if the input is “f” make transition to state s1. • The from state s1 if the input is “o” make transition to state s2. • And from state s2 if the input is “r” make transition to state s3. • The double circle denotes an “accepting state” which means we recognized the token. • Actually there is a missing state and transition f o r s1 s2 s3 s0

  37. Now what about “fort” • The string “fort” is an identifier, not the keyword “for” followed by “t.” • Thus we can’t really recognize the token until we see a terminator – whitespace or a special symbol ( one of ,;(){}[]

  38. Deterministic Finite Automata • A Deterministic finite automaton (DFA) is a mathematical model that consists of • 1. a set of states S • 2. a set of input symbols ∑, the input alphabet • 3. a transition function δ: S x ∑  Sthat for each state and each input maps to the next state • 4. a state s0that is distinguished as the start state • 5. a set of states F distinguished as accepting (or final) states

  39. DFA to recognize keyword “for” • Σ= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …} • S = {s0, s1, s2, s3, sdead} • s0, is the start state • SF = {s3} • δ given by the table below

  40. Language Accepted by a DFA • A string x0x1…xn is accepted by a DFA M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn determines a path through the state diagram for the DFA that ends in an Accepting State. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

  41. What is the Language Accepted by…

  42. Non-Deterministic Finite Automata • What does deterministic mean? • In a Non-Deterministic Finite Automata (NFA) we relax the restriction that the transition function δ maps every state and every element of the alphabet to a unique state, i.e. δ: S x ∑  S • An NFA can: • Have multiple transitions from a state for the same input • Have Є transitions, where a transition from one state to another can be accomplished without consuming an input character • Not have transitions defined for every state and every input • Note for NFAs δ: S x ∑  2S where is the power set of S

  43. Language Accepted by an NFA • A string x0x1…xn is accepted by an NFA • M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn can determines a path through the state diagram for the NFA that ends in an Accepting State, taking Є where ever necessary. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

  44. Language Accepted by an NFA

  45. Thompson Construction • For any regular expression R construct an NFA, M, that accepts the language denoted by R, i.e., L(M) = L(R).

