1.14k likes | 1.38k Views
Lecture 3 Introduction to Parsing and Top-Down Parsing. Cheng-Chia Chen. Outlines. Parsing overview Limitation of regular expressions Context-free Grammar Derivation and Parse Tree Ambiguity of CFGs Concrete Parse Tree v.s. Abstract Syntax Tree. Recursive Descent Parsing
E N D
Lecture 3 Introduction to Parsing andTop-Down Parsing Cheng-Chia Chen
Outlines • Parsing overview • Limitation of regular expressions • Context-free Grammar • Derivation and Parse Tree • Ambiguity of CFGs • Concrete Parse Tree v.s. Abstract Syntax Tree. • Recursive Descent Parsing • LL(1) parsing
Parsing Overview • What is syntax ? • The way in which words are put together to form phrases, clauses, or sentences. --- Webster’s Dictionary • The function of a parser : • Input: sequence of tokens from lexer • Output: parse tree of the program
?: INT INT == ID ID Example • Java expr x == y ? 1 : 2 • Parser input ID == ID ? INT : INT • Parser output
The Role of the Parser • Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and invalid sequences of tokens • We need • A language for describing valid sequences of tokens • A method for distinguishing valid from invalid sequences of tokens
Limitation of regular expression • Recall how we define lexical structures using regular expressions. • Ex: ---- (1) • digits = [0-9]+ • sum = (digits “+”)* digits • match sums of the form: 28 + 301 + 9. • Can we use the same way to define the syntax of a language?
Inadequacy of regular expressions • Consider how to use regular expressions to express sums with parentheses ? • (109+235), • 61, • (1+ (250 + 34)) • A solution: --- (2) • digits = [0-9]+ • sum = expr + expr • expr = “(“ sum “)” | digits • Note the difference b/t (1) and (2). • in (1) digits and sum can be and actually is treated as abbreviations of their right hand side(RHS). • in (2), the LHS name are used recursively at their RHS, and cannot be treated as abbreviations.
Inadequacy of regular expressions • To translate (1) • digits = [0-9]+ • sum = (digits “+”)* digits • into FA, we first translate each definition into normal regular expressions by replacing every reference of definitions by its RHS recursively: • digits = [0-9]+ • sum = ([0-9]+ “+”)* [0-9]+ • and then apply normal procedures to translate regular expressions into DFAs. • But for definitions with recursion like (2), this approach does not work.
--- (2) • digits = [0-9]+ • sum = expr “+” expr • expr = “(“sum“)” | digits • Repalce sum in 3. by their RHS at 2: • expr = “(“expr“+”expr“)” | digits---- (4) • Repace expr at (4) by itself, we get: • expr = “(“ (“(“ expr “+” expr “)” | digits) “+” • (“(“ expr “+” expr “)” | digits) • | digits--- (5) • Conclusion: • It is hopeless to eliminate names in RHS by finite substitutions it there are direct or indirect recursions.
It can be shown that • regular expressions with non-recursive abbreviations do not increase the expressive power of regular expressions, but , • Regular expressions allowing recursive abbreviations do increase the expressive power of regular expressions • --- This formalism is called context-free grammar, • and is just what we need for parsing.
Redundancy induced by recursion • Alternation(|) is not needed: • expr = ab(c|d) e ==> aux = c | d expr = a b aux e • ==> aux=c aux = d expr = a b aux e • Repetition(*) is not needed • expr = (a b c ) * ==> • expr = (a b c) expr --- right recursion • // expr = expr (a b c ) --- left recursion • expr = e.
Context-Free Grammars • Programming language constructs have recursive structure • An EXPR is if EXPR then EXPR else EXPR fi , or while ( EXPR ) EXPR , or … • Context-free grammars are a natural notation for this recursive structure
CFGs (Cont.) • A CFG consists of • A set of terminals T • A set of non-terminals N • A start symbolS (a non-terminal) • A set P of productions, where each rule r is of the form: AssumingX N X e, or X Y1 Y2 ... Ynwhere Yi N T
Notational Conventions: • In these lecture notes • Non-terminals are written upper-case (S, A,B,C,…) • Terminals are written lower-case (a,b,c,…) • X,Y,Z, … range over T U N • a,b,g, … ranges over strings of V U T. • The start symbol (S) is the left-hand side of the first production • Each terminal symbol corresponds to a token type from the lexer. • Each non-terminal symbol corresponds to a symbol occurring at the LHS of a production rule.
Examples of CFGs Expr if Expr then Expr else Expr | while Expr do Expr | id
Examples of CFGs Simple arithmetic expressions: E E * E | E + E | ( E ) | id
The Language of a CFG Read productions as replacement rules: X Y1 … Yn Means X can be replaced by Y1 … Yn X e Means X can be erased (replaced with empty string)
Key Idea • Begin with a string consisting of the start symbol “S” • Replace any non-terminal Xin the string by a right-hand side of some production • Repeat (2) until there are no non-terminals in the string X Y1 … Yn
The Language of a CFG (Cont.) • Write X1…X2…Xn X1… Xi-1 Y1 … Yn Xi+1…Xn if there is a production Xi Y1 … Yn • More formally : • a X b a g b if ∃ x g ∈ P. • or define to be the binary relation • { (a X b,a g b ) | a X b a g b • if ∃ x g ∈ P. } • on (T U N)*.
The Language of a CFG (Cont.) • Write X1…Xn* Y1 … Ym if X1…Xn…… Y1 … Ym in 0 or more steps • Or formally: • Define * to be the reflexive and transitive closure of the relation . • i.e., a * b iff • a = b or • ∃ a1,a2,…,an (n ≥ 1) such that a a1 … an = b.
The Language of a CFG Let Gbe a context-free grammar with start symbol S. Then the language L(G) of G is the set: { a | S * a where a is a terminal string. }
Terminals • Terminals are called because there are no rules for replacing them • Once generated, terminals are permanent • Terminals ought to be token types of the language
Examples Strings of balanced parentheses : Two representations of the same grammar G: OR
Arithmetic Example Simple arithmetic expressions: Some elements of the language:
Notes The idea of a CFG is a big step. But: • Membership in a language is just “yes” or “no” • We need also the parse tree of the input • Must handle errors gracefully • Need (tools for) an implementation of CFG’s.
Derivations and Parse Trees A derivation is a sequence of strings over terminals and nonterminals S0, S1 , S2 , … Sn such that 1. Si S i+1 for 0 ≤ i < n. 2. S0 = S is the start symbol. • A derivation can be drawn as a tree • Start symbol is the tree’s root • For a production X Y1… Yn • add children Y1… Yn to node X
Derivation Example • Grammar • String
Derivation Example (Cont.) E E + E E * E id id id
Derivation in Detail (2) E E + E
Derivation in Detail (3) E E + E E * E
Derivation in Detail (4) E E + E E * E id
Derivation in Detail (5) E E + E E * E id id
Derivation in Detail (6) E E + E E * E id id id
Properties of parse trees. • A parse tree has • start symbol at the root • terminals or empty nodes at the leaves • Non-terminals at the internal nodes • if internal node X has children Y1,…,Yn, then • X Y1 Y2… Yn is a production rule. • An in-order traversal of the leaves is the original input. • The parse tree makes explicit the structure of the input string.
Left-most and Right-most Derivations • The example is a right-most derivation • At each step, replace the right-most non-terminal • There is an equivalent notion of a left-most derivation
Right-most Derivation in Detail (3) E E + E id
Right-most Derivation in Detail (4) E E + E E * E id
Right-most Derivation in Detail (5) E E + E E * E id id
Right-most Derivation in Detail (6) E E + E E * E id id id
Derivations and Parse Trees • Note that right-most and left-most derivations have the same parse tree • The difference is the order in which branches are added
Summary of Derivations • We are not just interested in whether s 2L(G) • We need a parse tree for s • A derivation defines a parse tree • But one parse tree may have many derivations • Left-most and right-most derivations are important in parser implementation
Issues • A parser consumes a sequence of tokens s and produces a parse tree • Issues: • How to recognize that s 2L(G) ? • How to generate a parse tree of s once s L(G) • Ambiguity: Is there more than one parse tree (interpretation) for some string s ? • Error handling: What should we do if no parse tree exist for an input string.
Ambiguity • Grammar E ! E + E | E * E | ( E ) | int • String int * int + int
Ambiguity (Cont.) This string has two parse trees E E E + E E E * E E int int E + E * int int int int
Ambiguity (Cont.) • A grammar is ambiguous if it has more than one parse tree for some string • Equivalently, there is more than one right-most or left-most derivation for some string • Ambiguity is bad • Leave meaning of some programs ill-defined • Ambiguity is common in programming languages • Arithmetic expressions • IF-THEN-ELSE
Dealing with Ambiguity • There are several ways to handle ambiguity • Most direct method is to rewrite the grammar unambiguously E ! T + E | T T ! int * T | int | ( E ) • Enforces precedence of * over +