1.43k likes | 1.54k Views
COMP3190: Principle of Programming Languages. Formal Language Syntax. Motivation. The problem of parsing structured text is very common Consider the structure of email addresses (using a grammar): <emailAddress> := <person> @ <host> <person> := <word> <host> := <word> | <word>.<host>
E N D
COMP3190: Principle of Programming Languages Formal Language Syntax
Motivation The problem of parsing structured text is very common Consider the structure of email addresses (using a grammar): <emailAddress> := <person> @ <host> <person> := <word> <host> := <word> | <word>.<host> Describe and recognize email addresses in arbitrary text.
Outline • DFA & NFA • Regular expression • Regular languages • Context free languages &PDA • Scanner • Parser
Deterministic Finite Automata (DFA) • Q: finite set of states • Σ: finite set of “letters” (alphabet) • δ: QxΣ -> Q (transition function) • q0: start state (in Q) • F : set of accept states (subset of Q) • Acceptance: input consumed with the automata in a final state.
Example of DFA 0 1 1 q2 q1 0 Accepts all strings that end in 1
Another Example of a DFA S b a b a r1 q1 b a a b q2 r2 a b Accepts all strings that start and end with “a” OR start and end with “b”
Non-deterministic Finite Automata (NFA) Transition function is different • δ: QxΣε-> P(Q) • P(Q) is the powerset of Q (set of all subsets) • Σε is the union of Σ and the special symbol ε (denoting empty) String is accepted if there is at least one pathleading to an accept state, and input consumed.
Example of an NFA 0, 1 0, 1 0, ε 1 1 q1 q2 q3 q4 What strings does this NFA accept?
Outline • DFA & NFA • Regular expression • Regular languages • Context free languages &PDA • Scanner • Parser
Regular Expressions R is a regular expression if R is • “a” for some a in Σ. • ε (the empty string). • member of the empty language. • the union of two regular expressions. • the concatenation of two regular expr. • R1* (Kleene closure: zero or more repetitions of R1).
Regular Expression Notation • a: an ordinary letter • ε: the empty string • M | N: choosing from M or N • MN: concatenation of M and N • M*: zero or more times (Kleene star) • M+: one or more times • M?: zero or one occurence • [a-zA-Z] character set alternation (choice) • . period stands for any single char exc. newline
Examples of Regular Expressions {0, 1}* 0 all strings that end in 0 {0, 1} 0* string that start with 1 or 0 followed by zero or more 0s. {0, 1}* all strings {0n1n, n >=0} not a regular expression!!!
Converting a Regular Expression to an NFA a N M ε MN ε ε M ε ε M ε N M* M|N
Regular expression->NFA Language: Strings of 0s and 1s in which the number of 0s is even Regular expression: (1*01*0)*1*
Converting an NFA to a DFA • For set of states S, closure(S) is the set of states that can be reached from S without consuming any input. • For a set of states S, DFAedge(s, c) is the set of states that can be reached from S by consuming input symbol c. • Each set of NFA states corresponds to one DFA state (hence at most 2n states).
NFA -> DFA Initial classes:{A, B, E}, {C, D} No class requires partitioning! Hence a two-state DFA is obtained.
Obtaining the minimal equivalent DFA • Initially two equivalence classes: final and nonfinal states. • Search for an equivalence class C and an input letter a such that with a as input, the states in C make transitions to states in k>1 different equivalence classes. • Partition C into k classes accordingly • Repeat until unable to find a class to partition.
Outline • DFA & NFA • Regular expression • Regular languages • Context free languages &PDA • Scanner • Parser
Regular Grammar • Later definitions build on earlier ones • Nothing defined in terms of itself (no recursion) Regular grammar for numeric literals in Pascal:digit -> 0|1|2|...|8|9 unsigned_integer -> digit digit* unsigned_number -> unsigned_integer (( . unsigned_integer) | ε ) (( e (+ | - | ε ) unsigned_integer ) | ε )
Languages and Automata in Programming Languages • Regular languages • Recognized(accepted) by finite automata • Useful for tokenizing program text (lexical analysis) • Context-free languages • Recognized(accepted) by pushdown automata • Useful for parsing the syntax of a program
Important Theorems • A language is regular if a regular expression describes it. • A language is regular if a finite automata recognizes it. • DFAs and NFAs are equally powerful.
Outline • DFA & NFA • Regular expression • Regular languages • Context free languages &PDA • Scanner • Parser
Context-free Grammars • Context-free grammars are defined by substitution rules • Big Jim ate gree cheesegreen Jim ate green cheese • Jim ate cheese • Cheese ate Jim
Context-free Grammars • Context-free grammars are used to formally describe the syntax of programming languages. • Every syntactically correct program is derived using the context-free grammar of the language. • Parsing a program involves tracing such derivation, given the context-free grammar and the program.
Context-free Grammars A context-free grammar consists of • V: a finite set of variables • Σ: a finite set of terminals • R: a finite set of rules of the formvariable -> {variable, terminal}* • S: the start variable
Pushdown Automata (PDA) • A pushdown automata consists of • Q: a set of states • Σ: input alphabet (of terminals) • Γ: stack alphabet • δ: a set of transition rulesQx Σεx Γε-> P(Qx Γε)currentState, inputSymbol, headOfStack ->newState, pushSymbolOnStack • q0: the start state • F: the set of accept states (subset of Q) Deterministic: At most one move is possible from any configuration
How does a PDA accept? • By final state: • Consume all the input while • Reaching a final state • By empty stack: • Consume all the input while • Having an empty stack • Set of final states is irrelevant
Example of a PDA ε, ε ->$ 0, ε->0 q2 q1 1, 0->ε ε, $->ε q3 q4 1, 0->ε Notation: a, b->c: when PDA reads “a” from input, it replaces “b” at the top of stack with “c”. What does this PDA accept?
Important Theorems • A language is context-free iff a pushdown automata recognizes it • Non-deterministic PDA are more powerful than deterministic ones
Example of Context-free Language That Requires a Non-deterministic PDA {w wR | w belongs to {0, 1}*} i.e. wR is w written backwards Idea: Non-deterministically guess the middle of the input string
The Solution ε, ε ->$ 0, ε->0 1, ε->1 q2 q1 ε, ε->ε ε, $->ε q3 q4 1, 1->ε0, 0->ε
Derivations and Parse Trees Nested constructs require recursion, i.e. context-free grammars CFG for arithmetic expressions expression -> identifier | number | - expression | (expression) | expression operator expression operator -> + | - | * | /
Parse Tree for Slope*x + Intercept Is this the only parse tree for this expression and grammar?
A Better Expression Grammar 1. expression -> term | expression add_op term 2. term -> factor | term mult_op factor 3. factor -> identifier | number | - factor | (expression) 4. add_op -> + | - 5. mult_op -> * | / A good grammar reflects the internal structure of programs. This grammar is unambiguous and captures (HOW?):- operator precedence (*,/ bind tighter than +,- )- associativity (ops group left to right)
And Better Parse Trees... 3 + 4 * 5 10 - 4 - 3
Syntax-directed Compilation • Parser calls scanner to obtain tokens. • Assembles tokens into parse tree. • Passes tree to later phases of compilation. • Scanner: deterministic finite automata. • Parser: pushdown automata. • Scanners and parsers can be generated automatically from regular expressions and CFGs (e.G. lex/yacc).
Outline • DFA & NFA • Regular expression • Regular languages • Context free languages &PDA • Scanner • Parser
Scanning • Accept the longest possible token in each invocation of the scanner. • Implementation. • Capture finite automata. • Case(switch) statements. • Table and driver.
Scanner Generators • Start with a regular expression. • Construct an NFA from it. • Use a set of subsets construction to obtain an equivalent DFA. • Construct the minimal equivalent DFA.
Outline • DFA & NFA • Regular expression • Regular languages • Context free languages &PDA • Scanner • Parser • Top-down parsing • Bottom-up Parsing • Comparison
Parsing approaches • Parsing in general has O(n3) cost. • Need classes of grammars that can be parsed in linear time • Top-down or predictive parsing orrecursive descent parsingor LL parsing (Left-to-right Left-most) • Bottom-up or shift-reduce parsing orLR parsing (Left-to-right Right-most)
A Simple Grammar for a Comma-separated List of Identifiers id_list -> id id_list_tail id_list_tail -> , id id_list_tail id_list_tail -> ; _________________________ String to be parsed: A, B, C;
Outline • DFA & NFA • Regular expression • Regular languages • Context free languages &PDA • Scanner • Parser • Top-down parsing • Bottom-up Parsing • Comparison
Top-down Parsing • Predicts a derivation • Matches non-terminal against token observed in input
LL(1) Grammar • A grammar for which a top-down deterministic parser can be produced with one token of look-ahead. • LL(1) grammar: • For a given non-terminal, the lookahead symbol uniquely determines the production to apply • Top-down parsing = predictive parsing • Driven by predictive parsing table of • non-terminals x terminals productions