1 / 49

Lecture 2 Lexical Analysis

Lecture 2 Lexical Analysis. CSCE 531 Compiler Construction. Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata Readings:. January 18, 2018. Overview. Last Time A little History Compilers vs Interpreter Data-Flow View of Compilers

diannel
Download Presentation

Lecture 2 Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2 Lexical Analysis CSCE 531 Compiler Construction • Topics • Sample Simple Compiler • Operations on strings • Regular expressions • Finite Automata • Readings: January 18, 2018

  2. Overview • Last Time • A little History • Compilers vs Interpreter • Data-Flow View of Compilers • Regular Languages • Course Pragmatics • Today’s Lecture • Why Study Compilers? • xx • References • Chapter 2, Chapter 3 • Assignment Due Wednesday Jan 18 • 3.3a; 3.5a,b; 3.6a,b,c; 3.7a; 3.8b

  3. A Simple Compiler for Expressions • Chapter Two Overview • Structure of the simple compiler, really just translator for infix expressions  postfix • Grammars • Parse Trees • Syntax directed Translation • Predictive Parsing • Translator for Simple Expressions • Grammar • Rewritten grammar (equivalent one better for pred. parsing) • Parsing modules fig 2.24 • Specification of Translator fig 2.35 • Structure of translator fig 2.36

  4. Grammars • Grammar (or a context free grammar more correctly) has • A set of tokens also known as terminals • A set of nonterminals • A set of productions of the form nonterminal  sequence of tokens and/or nonterminals • A special nonterminal the start symbol. • Example • E  E + E • E  E * E • E  digit

  5. Derivations • A derivation is a sequence of rewriting of a string of grammar symbols using the productions in a grammar. • We use the symbol  to denote that one string of grammar symbols is obtained by rewritting another using a production • X Y if there is a production N  β where • The nonterminal N occurs in the sequence X of Grammar symbols • And Y is the same as X except β replaces the N • Example • E  E+E  d+E  d+ E*E  d+ E+E*E  d+d+E*E  d+d+d*E  d+d+d*d

  6. Language generated by a grammar • * is the formally the transitive, reflexive closure of the relation  • s1 * s2 informally if s2 can be derived in zero or more steps from s1 • The language generated by the Grammar G is denoted L(G) and is defined by • L(G) = {w ε T* | S * w} • Where T* is any finite string of terminals. • Example: S  aS | b • L(G)={b, ab, aab, …}

  7. Parse Trees • A graphical presentation of a derivation, satisfying • Root is the start symbol • Each leaf is a token or ε (note different font from text) • Each interior node is a nonterminal • If A is a parent with children X1 , X2 … Xn then A  X1X2 … Xn is a production • Note for each derivation there is a unique parse tree. • However, for any parse tree there are many corresponding derivations. • A leftmost derivation is a derivation in which at each step the leftmost nonterminal is replaced. • E *lmE + E *lm id + E *lm id + E * E *lm id + id * E *lm id + id * id

  8. Ambiguity • E  E + E | E * E | ‘(‘ E ‘)’ | id

  9. The Empty String ε • ε = the string with no characters • S  Sa | a • S  Sa | ε

  10. Equivalent Grammars

  11. Syntax directed Translation • Frequently the rewritting by a production will be called a reduction or reducing by the particular production. • Syntax directed translation attaches action (code) that are done when the reductions are performed • Example • E  E + T {print(‘+’);} • E  E - T {print(‘-’);} • E  T • T  0 {print(‘0’);} • T  1 {print(‘1’);} • … • T  9 {print(‘9’);}

  12. Fig 2.1

  13. Quadruples • Quadruples • Result • Left operand • Operator • Right operand

  14. Fig 2.3 Dataflow model of compiler

  15. Fig 2.4 Intermediate forms for a loop • do • i = i + 1; • while ( a[i] < v);“ • Parse Tree • Quadruples • Parse Tree • Quadruples

  16. Left Factoring

  17. Specification of the translator • S  L eof figure 2.38 • L  E ; L • L  ε • E  T E’ • E’  + T { print(‘+’); } E’ • E’  - T { print(‘-’); } E’ • E  ε • T  F T’ • T’  * F { print(‘*’); } T’ • T’  / F { print(‘/’); } T’ • T  ε • F  ( E ) • F  id { print(id.lexeme);} • F  num { print(num.value);}

  18. E  T E’ E’  + T { print(‘+’); } E’ E’  - T { print(‘-’); } E’ E  ε Expr() { int t; term(); while(1) switch(lookahead){ case ‘+’: case ‘-’: t = lookahead; match(lookahead); term(); emit(t, NONE); continue; … Translating to code

  19. Recursive-descent parsing is a top-down method of syntax analysis in which a set of recursive procedures is used to process the input. • One procedure is associated with each nonterminal of a grammar. • Here, we consider a simple form of recursive-descent parsing, called predictive parsing, in which the lookahead symbol unambiguously determines the ow of control through the procedure body for each nonterminal. • The sequence of procedure calls during the analysis of an input string implicitly de nes a parse tree for the input, and can be used to build an explicit parse tree, if desired.

  20. stmt for ( optexpr ; optexpr ; optexpr ) stmt • each nonterminal leads to a call of its procedure, in the following sequence of calls: • match(for); match( ‘(‘ ); optexpr (); match( ‘;’); optexpr (); match(‘;’); optexpr (); match( ‘)’ ); stmt ();

  21. First(alpha); nullable • FIRST( α ) to be the set of terminals that appear as the first symbol of one or more strings of terminals generated from α. • S  aS | b | ε • FIRST(a) = {a} • FIRST(S) = { a, ε } • Note if XX1X2 … XnX then • FIRST(X) contains FIRST(X1) • If X1 * ε then FIRST(X) contains FIRST(X2) • If X1X2 … Xi * ε then FIRST(X) contains FIRST(Xi+1) • A string w is nullable if w * ε

  22. Using First to direct parsing • Lookahead = next_token • If A  α is a production and • the lookahead is in FIRST(α) • then reduce by the production A  α • For predictive parsing to work if there are two (or more) productions • A  α • A  β • Then FIRST(α) ∩ FIRST(β) must be empty

  23. Overview of the Code Figure 2.36 • ~matthews/public/csce531

  24. Semantic Actions – Translate To Postfix • expr  expr + term { print( ‘+’ ) } • | expr - term { print( ‘-’ ) } • | term • term  term * factor { print( ‘*’ ) } • | term / factor { print( ‘/’ ) } • | factor • factor  ( expr ) • | num { print(num:value ) } • | id { print( id.lexeme) }

  25. Trace • x * 2 + z • Parse Tree • Leftmost derivation • Production Action • E *lm F

  26. Fig 2.46

  27. Operations on Strings • A language over an alphabet is a set of strings of characters from the alphabet. • Operations on strings: • let x=x1x2…xn and t=t1t2…tm then • Concatenation: xt =x1x2…xnt1t2…tm • Alternation: x | t = either x1x2…xn or t1t2…tm

  28. Operations on Sets of Strings • Operations on sets of strings: • For these let S = {s1, s2, … sm} and R = {r1, r2, … rn} • Alternation: S | T = S U T = {s1, s2, … sm, r1, r2, … rn } • Concatenation: • ST ={st | where s Є S and t Є T} • = { s1r1, s1r2, … s1rn, s2r1, … s2rn, … smr1, … smrn} • Power: S2 = S S, S3= S2 S, Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

  29. Operations cont. Kleene Closure • Powers: • S2 = S S • S3= S2 S • … • Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

  30. Examples of Operations on Sets of Strings • Operations on sets of strings: • For these let S = {a,b,c} and R = {t,u} • Alternation: S | T = S U T = {a,b,c,t,u} • Concatenation: • ST ={st | where s Є S and t Є T} • = { at, au, bt, bu, ct, cu} • Power: S2 = { aa, ab, ac, ba, bb, bc, ca, cb, cc} • S3= { aaa, aab, aac, … ccc} 27 elements • Kleene closure: S* = {any string of any length of a’s, b’s and c’s}

  31. Examples of Operations on Sets of Strings

  32. Regular Expressions • For a given alphabet Σ the following are regular expressions: • If a ЄΣ then a is a regular expression and L(a) = { a } • Є is a regular expression and L(Є) = { Є } • Φ is a regular expression and L(Φ) = Φ • And if s and t are regular expressions denoting languages L(s) and L(t) respectively then • st is a regular expression and L(st) = L(s) L(t) • s | t is a regular expression and L(s | t) = L(s) U L(t) • s* is a regular expression and L(s*) = L(s)*

  33. Why Regular Expressions? • We use regular expressions to describe the tokens • Examples: • Reg expr for C identifiers • C identifiers? Any string of letters, underscores and digits that start with a letter or underscore ID reg expr = (letter | underscore) (letter | underscore | digit)* Or more explicitly ID reg expr = ( a|b|…|z|_)(a|b|…z|_|0|1…|9)*

  34. Pop Quiz • Given r and s are regular expressions then • What is rЄ ? r | Є ? • Describe the Language denoted by 0*110* • Describe the Language denoted by (0|1)*110* • Give a regular expression for the language of 0’s and 1’s such that end in a 1 • Give a regular expression for the language of 0’s and 1’s such that every 0 is followed by a 1

  35. Recognizers of Regular Languages • To develop efficient lexical analyzers (scanners) we will rely on a mathematical model called finite automata, similar to the state machines that you have probably seen. In particular we will use deterministic finite automata, DFAs. • The construction of a lexical analyzer will then proceed as: • Identify all tokens • Develop regular expressions for each • Convert the regular expressions to finite automata • Use the transition table for the finite automata as the basis for the scanner • We will actually use the tools lex and/or flex for steps 3 and 4.

  36. Transition Diagram for a DFA • Start in state s0 then if the input is “f” make transition to state s1. • The from state s1 if the input is “o” make transition to state s2. • And from state s2 if the input is “r” make transition to state s3. • The double circle denotes an “accepting state” which means we recognized the token. • Actually there is a missing state and transition f o r s1 s2 s3 s0

  37. Now what about “fort” • The string “fort” is an identifier, not the keyword “for” followed by “t.” • Thus we can’t really recognize the token until we see a terminator – whitespace or a special symbol ( one of ,;(){}[]

  38. Deterministic Finite Automata • A Deterministic finite automaton (DFA) is a mathematical model that consists of • 1. a set of states S • 2. a set of input symbols ∑, the input alphabet • 3. a transition function δ: S x ∑  Sthat for each state and each input maps to the next state • 4. a state s0that is distinguished as the start state • 5. a set of states F distinguished as accepting (or final) states

  39. DFA to recognize keyword “for” • Σ= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …} • S = {s0, s1, s2, s3, sdead} • s0, is the start state • SF = {s3} • δ given by the table below

  40. Language Accepted by a DFA • A string x0x1…xn is accepted by a DFA M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn determines a path through the state diagram for the DFA that ends in an Accepting State. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

  41. What is the Language Accepted by…

  42. Non-Deterministic Finite Automata • What does deterministic mean? • In a Non-Deterministic Finite Automata (NFA) we relax the restriction that the transition function δ maps every state and every element of the alphabet to a unique state, i.e. δ: S x ∑  S • An NFA can: • Have multiple transitions from a state for the same input • Have Є transitions, where a transition from one state to another can be accomplished without consuming an input character • Not have transitions defined for every state and every input • Note for NFAs δ: S x ∑  2S where is the power set of S

  43. Language Accepted by an NFA • A string x0x1…xn is accepted by an NFA • M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn can determines a path through the state diagram for the NFA that ends in an Accepting State, taking Є where ever necessary. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

  44. Language Accepted by an NFA

  45. Thompson Construction • For any regular expression R construct an NFA, M, that accepts the language denoted by R, i.e., L(M) = L(R).

More Related