390 likes | 765 Views
Lecture 3 Syntactic Definition. KU | Fall 2018 | Drew Davidson. Announcements. Entry Surveys (E1) processed Lots of really useful info – thanks! Office hours now set on course website Lecture slides now posted on website P1 is now released L2 video / assignment is up. Live Assignments.
E N D
Lecture 3Syntactic Definition KU | Fall 2018 | Drew Davidson
Announcements • Entry Surveys (E1) processed • Lots of really useful info – thanks! • Office hours now set on course website • Lecture slides now posted on website • P1 is now released • L2 video / assignment is up Live Assignments L2 P1 H1
Live Assignments P1 H2 Last Time: Implementing TokenizersReview Lecture 2 – Implementing Scanners RegEx -free NFA DFA -NFA Tokenizer Thompson’s Construction Algorithm Transition Action Table -elimination Rabin-Scott Powerset Construction
Live Assignments P1 H2 Last Time: Implementing TokenizersReview Lecture 2 – Implementing Scanners RegEx -free NFA DFA -NFA Tokenizer Thompson’s Construction Algorithm Transition Action Table -elimination Rabin-Scott Powerset Construction
Live Assignments P1 H2 Last Time: Implementing TokenizersReview Lecture 2 – Implementing Scanners RegEx -free NFA DFA -NFA Tokenizer Thompson’s Construction Algorithm Transition Action Table -elimination Rabin-Scott Powerset Construction
Live Assignments P1 H2 Last Time: Implementing TokenizersReview Lecture 2 – Implementing Scanners RegEx -free NFA DFA -NFA Tokenizer Thompson’s Construction Algorithm Transition Action Table -elimination Rabin-Scott Powerset Construction
Live Assignments P1 H2 Last Time: Implementing TokenizersReview Lecture 2 – Implementing Scanners RegEx -free NFA DFA -NFA Tokenizer Thompson’s Construction Algorithm Transition Action Table -elimination Rabin-Scott Powerset Construction
Live Assignments P1 H2 From FSMs to Tokenizers…Where we left off last time… • Give our FSMs the ability to put chars back Amount to rewind Token to return , A S3 (letter, digit) letter S2 S • Add an EOF (end of file) alphabet symbol letter, digit
A Simple TokenizerDFA -> Tokenizer • Consider a language with 2 statement types • Assignment: ID = expr • Increment: ID += expr • Where expr is of the form • ID + ID • ID < ID • ID <= ID • Identifiers follow C conventions
A Simple TokenizerDFA -> Tokenizer ‘=‘ (‘=‘) (‘=‘) ‘=‘ B C ‘<‘ ‘+‘ A A A A A A G H A E F I ‘=‘ S ‘_’ |letter ‘_‘|letter|digit D (‘_‘|letter|digit) (: any other character)
A Simple TokenizerDFA -> Tokenizer ‘=‘ (‘=‘) (‘=‘) ‘=‘ B C ‘<‘ ‘+‘ A A A A A A F A G H I E ‘=‘ S ‘_’ |letter ‘_‘|letter|digit D (‘_‘|letter|digit) (: any other character)
Fill in the Transition Action table ‘=‘ (‘=‘) (‘=‘) ‘=‘ B C ‘<‘ ‘+‘ A A A A A A F E G H A I ‘=‘ S ‘_’ |letter ‘_‘|letter|digit D (‘_‘|letter|digit) (: any other character)
‘=‘ (‘=‘) (‘=‘) ‘=‘ B C ‘<‘ ‘+‘ A A A A A A H G A E F I ‘=‘ S ‘_’ |letter ‘_‘|letter|digit D (‘_‘|letter|digit) (: any other character)
COMPILER Code Generation Execution Runtime Environment Optimization Intermediate Representation Parsing SDT Semantics Lexical Analysis Syntactic Definiton
Live Assignments P1 H2 This TimePreview Lecture 3 – Defining Syntax How Language Syntax is Formally Defined • Check in on our compiler • Quick review of Context-Free Grammars • Why we need ‘em • How we use ‘em Syntactic Definition
Building the CompilerProgress Pics Source code (sequence of chars) Scanner Lexical analysis • Our Enhanced-RegEx scanner can emit a stream of tokens: Parser Syntactic analysis In progress + X Z Y = Semantic analysis • … but doesn’t enforce structure IR (Intermediate Representation) code generation IR optimization Code generation Machine code optimization Output code in T
Building the CompilerProgress Pics • Our Enhanced-RegEx scanner can emit a stream of tokens: + X Z Y = • … but doesn’t enforce structure An unstructured, unordered soup of tokens
Regular Languages: Lack StrengthCFGs: Why we need ‘em Cannot specify source code constructs we need using RegExes • i.e. No DFAs can recognize exactly the constructs we need Cute, but weak
Regular Languages: Matching ProblemCFGs: Why we need ‘em Consider language of nested parentheses: Examples: ( ) (( )) () ()
Regular Languages: Matching ProblemCFGs: Why we need ‘em Consider language of nested parentheses: cannot be matched by a regular expression (it is not a regular language) • Intuition: An FSM can only handle a finite depth of parentheses that we can handle • Lets sketch the proof
Nested Parens: Proof SketchCFGs: Why we need ‘em S Assume an FSM can recognize Let be the number of states in . Feed left-parens into We must have revisited some state on two input positions and . There must be a path from to a final state. But this means that it accepts some suffix of closed parens at input and , but both cannot be correct ? ? ?
A Brief Reality CheckCFGs: Why we need ‘em Question 1: Given the previous, can we recognize the language C-Style comments with regex? /* … */
Need More Powerful Languages ClassCFGs: Why we need ‘em Chomsky Hierarchy: Recursively enumerable Context-Sensitive Context-Free Regular
Why Not Max Out Power Level?CFGs: Why we need ‘em Question: Why not use something more powerful for tokenization? Expressive power comes with a price • Less efficient matching • Fewer properties of the language
Defining Languages with GrammarsCFGs: How we use ‘em • A set of (recursive) rewriting rules to rewrite sequence of symbols • Any “completed” sequence represents a string in the language
Defining Languages with GrammarsCFGs: How we use ‘em • A set of (recursive) rewriting rules to rewrite sequence of symbols • Any “completed” sequence represents a string in the language CFG = (N,,P,S) where: • N: set of nonterminal symbols • : set of terminal symbols • P: set of productions • S: start nonterminal in N Rules where LHS: a single nonterminal symbol RHS: a sequence of any symbols
Defining Languages with GrammarsCFGs: How we use ‘em Example: N = { A } = { (, ), } S = A P CFG = (N,,P,S) where: • N: set of nonterminal symbols • : set of terminal symbols • P: set of productions • S: start nonterminal in N
Defining Languages with GrammarsCFGs: How we use ‘em Producing a string Example: N = { A } = { (, ), } S = A P Begin sequence with start symbol A Apply a production in P (a derivation step) Get a new sequence Apply another production in P Get a new sequence Apply another production in P Get a new sequence All terminals, this string is in language
Simplifying Notation: ShorthandCFGs: How we use ‘em Example: N = { A } = { (, ), } S = A P Say N and Implicit: Whatever symbols appears in productions Say S Implicit: LHS of top production Collapse rules with the same LHS using bar of context-free grammar notation
Simplifying Notation: ShorthandCFGs: How we use ‘em Example: N = { A } = { (, ), } S = A P Denote grammar as Say N and Implicit: Whatever symbols appears in productions A ( A ) | Or equivalently as Say S Implicit: LHS of top production A ( A ) | Collapse rules with the same LHS using bar
Simplifying Notation: ShorthandCFGs: How we use ‘em • EBNF (Backus Normal Form) Denote grammar as A ::= ( A ) | A ( A ) | Or equivalently as A ( A ) |
Some languages denoted in BNFCFGs: How we use ‘em A ::= ( A ) | l o l FlGo l F ::= l G ol | ro f l G ::= G o | FlGol lGo o l l o o l Frofl a a Y Y a Y Y ::= aY Z ::= w t f a aa Y … Accepts no strings (not even the empty string)
Parse TreesCFGs: How we use ‘em lGol lGo o l l o o l F Represent Derivations • Nodes are symbols in a tree • Rooted at start symbol • Children are derivation step • Leaves are final string (if all nonterminals) F l G o l G o
CFG use in the CompilerCFGs: How we use ‘em Compile Push Symbols
CFG use in the CompilerCFGs: How we use ‘em CFG for PL Syntactic Structure Productions specify valid programs • Let set of terminals be the tokens in the language • Let the nonterminals be the groupings of language constructs • (loops, statements, functions, calls, etc) • The grammar will recognize (or reject) the stream of tokens from the Lexer Let’s see an example with this grammar Productions Prog ::= beginStmtsend Stmts ::= Stmtssemi Stmt | Stmt Stmt ::= idassignExpr Expr ::= id | Exprcrossid
Parse Tree Prog Derivation Sequence Prog beginStmtsend beginStmtssemi Stmt end beginStmtsemi Stmt end beginidassignExprsemi Stmt end beginidassignExprsemi idassignExpr end beginidassignidsemi idassignExpr end beginidassignidsemi idassignExprcrossid end beginidassignidsemi idassignidcrossid end begin Stmts end Stmts semi Stmt Prod. 1 Stmt id assign Expr Prod. 2 Prod. 3 assign Expr id Expr cross id Prod. 4 id id Prod. 4 Productions Prog ::= beginStmtsend Stmts ::= Stmtssemi Stmt | Stmt Stmt ::= idassignExpr Expr ::= id | Exprcrossid Prod. 5 Prod. 6 Prod. 6