560 likes | 685 Views
Parsing. Discrete Mathematics and Its Applications Baojian Hua bjhua@ustc.edu.cn. Syntax Tree. A systematic way to put some program into memory data type definition + a bunch of functions programmer explicit calls them tedious and error-prone
E N D
Parsing Discrete Mathematics and Its Applications Baojian Hua bjhua@ustc.edu.cn
Syntax Tree • A systematic way to put some program into memory • data type definition + a bunch of functions • programmer explicit calls them • tedious and error-prone • But we write programs in ASCII form, so how can we construct the tree automatically? • A technique called (automatic) parsing • A program clever enough to do this automatically
Roadmap stream of characters stream of tokens abstract syntax • Lexer: eat ascii sequence, emit token sequence • Parser: eat token sequence, emit abstract syntax trees • other part: later in this course Lexer Parser other part
Parsing • Take as input a sequence of terminals, and construct syntax trees automatically • Problem: how do we know whether a sequence of input tokens is valid?
Example Am I valid? y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Oops, mismatch! S -> x A Nonterminals: S A B C terminals: ID_X ID_Y ID_U ID_V ID_T ID_M ID_W ID_Z
Example Am I valid? y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Another try! S -> y B
Example Am I valid? y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Aha, Great! S -> y B
Example Derive “m z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Recursion S -> y B
Example Derive “m z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z First try S -> y B B -> t C
Example Derive “m z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Mismatch S -> y B B -> t C
Example Derive “m z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Second try S -> y B B -> m C
Example Derive “z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Recursion S -> y B B -> m C
Example Derive “z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Mismatch S -> y B B -> m C C -> w
Example Derive “z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Second try S -> y B B -> m C C -> z
Example Derive “z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Matched. Sucess S -> y B B -> m C C -> z S y B m C z
Recursive Decedent Algorithm • This process can be described by a recursive decedent algorithm • For each nonterminal, write a (recursive) parsing function • every RHS becomes a case in a big switch • function may take some semantic actions, besides parsing • later in this course
Recursive Decedent Algorithm // The function interface looks like: void parseS (); void parseA (); void parseB (); void parseC (); S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z
Recursive Decedent Algorithm struct token t; // recall the token module t = getToken (); // and the lexer module void parserS () { switch (t) { case ID_X: t = getToken (); parseA (); return; case ID_Y: t = getToken (); parseB (); return; S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z
Recursive Decedent Algorithm default: // not ID_X or ID_Y error (“want ‘x’ or ‘y’); return; } } // Leave the algorithm for // parseA (), parseB () and // parseC () to you. S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z
Summary so Far • Recursive decedent parsing: • also called predictive parsing, or top-down parsing • simple and efficient • can be coded by hand quickly • see problem 2 in lab #4 • But the constraint is that not all formal grammar can be parsed by a recursive decedent parser • Example below
Example Derive me x m z S ::= x A | x B A ::= m C | m C B ::= m z | m z C ::= w | y S -> x A or S -> x B
Recursive Decedent Algorithm? struct token t; // recall the token module t = getToken (); // and the lexer module void parserS () { switch (t) { case ID_X: t = getToken (); ?????? // what code here? return; default: error (“….”); return;} } S ::= x A | x B A ::= m C | m C B ::= m z | m z C ::= w | y
Another Example Derive me x = 3+4; y = x-(1+2); stm -> id = exp; | id = exp; stm exp -> exp + exp | exp - exp | num | id | (exp) stm -> id = exp; or stm -> id = exp; stm
Moral • We’d introduce a notion of what’s a production’s RHS s could start with • s\in (T\/N)* • We call it a first (terminal) set, written as F[s], for string s\in (T\/N)* • Next, we first compute the first set F for given terminals or nonterminals
First Set F S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z F [S] = {x, y} F [A] = {u, v} F [B] = {t, m} F [C] = {w, z} // And generalize to string F [x A] = {x} F [y B] = {y} F [u C] = {u} F [v C] = {v} …
First Set F S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z F (S) = {x, y} F (A) = {u, v} F (B) = {t, m} F (C) = {w, z} // And generalize to string F (x A) = {x} F (y B) = {y} F (u C) = {u} F (v C) = {v} … // Then why this grammar could be parsed?
Example S ::= x A | x B A ::= m C | m C B ::= m z | m z C ::= w | y F (S) = ? F (A) = ? F (B) = ? F (C) = ? Predicative parsing?
Another Example stm -> id = exp; | id = exp; stm exp -> exp + exp | exp - exp | num | id | (exp) F (stm) = ? F (exp) = ? Predicative parsing?
Empty Production Rules Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a F (Z) = ? F (Y) = ? F (X) = ? Predicative parsing?
Algorithm #1: nullable // nullable[X]: whether X derives \eps or not // all initialized to false repeat for each production rule X -> Y1 Y2 … Yn if (nullable[Y1, …, Yn]=true) or (n=0)) nullable[X] = true until nullable[] did not change
Example Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a // initialization nullable[Z, Y, X] = false // round 1 nullable[Z] = false nullable[Y] = true nullable[X] = true // round 2 nullable[Z] = false nullable[Y] = true nullable[X] = true // finished!
Algorithm #2: First Set // F[X]: first terminals X could derive // all initialized to empty repeat for each production rule X -> Y1 Y2 … Yn for (i=1 to n) if (nullable[Y1, Y2, Y_{i-1}]=true or (i=1)) F[X] = F[X] \/ F[Yi] until F[] did not change
Example Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a // initialization F[Z, Y, X] = {} // round 1 F[Z] = {d} F[Y] = {c} F[X] = {c, a} // round 2 F[Z] = {d, c, a} F[Y] = {c} F[X] = {c, a} // round 3…
Pitfalls S ::= x A B A ::= y | \eps B ::= y | \eps Try to calculate nullable[] and F[] Try to derive “x y” What’s the problem?
Algorithm #3: Follow Set // W[X]: terminals may follow X // all initialized to empty repeat for each production rule X -> Y1 Y2 … Yn for (i=1 to n) for (j=i+1 to n) if (nullable[Y_{i+1}, …, Yn]=true or (i=n)) W[Yi] = W[Yi] \/ W[X] if (nullable[Y_{i+1}, …, Y_{j-1}]=true or(i+1=j)) W[Yi] = W[Yi] \/ F[Yj] until W[] did not change
Example Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a // initialization W[Z, Y, X] = {} // round 1 W[Z] = {EOF} W[Y] = {d, c, a} W[X] = {c, d, a} // round 2 W[Z] = {EOF} W[Y] = {d, c, a} W[X] = {d, c, a} // finished!
First and Follow Together Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a // first F[Z] = {d, c, a} F[Y] = {c} F[X] = {c, a} // follow W[Z] = {EOF} W[Y] = {d, c, a} W[X] = {d, c, a} // first All[X] = { F[X], if nullable[X]=false { W[X], otherwise.
LL(1) • Grammar whole predicative parsing tables contain no duplicate entries are called LL(1) • left-to-right parse, left-to-right-derivation, 1-symbol lookahead • one pass, no backtracking, very efficient • precise error reports • For some non-LL(1) grammar, there are some standard methods to transform them • example below
Eliminating Left Recursion X -> X a | c // General transforming rules: X -> X α1 | … | X αm | β1 | … | βn // to X -> β1X’ | … | βnX’ X’ ->α1 X’ | … | αm X’ | \eps X -> c X’ X’ -> a X’ | \eps
Eliminating Left Recursion S -> if (E) then S else S; | if (E) then S; // General rules: X -> α X1 | … | α Xn // to X -> αX’ X’ ->X1 | … | Xn S -> if (E) then S S’ S’ -> else S; | ;
Eliminating Ambiguity • In programming language syntax, ambiguity often arises from missing operator precedence or associativity • * higher precedence than +? • * and + are left associative? • Ambiguious grammar are hard to use as language syntax
E E E - E E - E 15 E E 4 - E - E 15 3 3 4 Ambiguous grammars • A grammar is ambiguous if there is a sentence with >1 parse tree 15 - 3 - 4
Association exp -> exp - exp | num // Derivation #1: exp -> exp - exp -> 15 - exp -> 15 - exp - exp -> 15 - 3 - exp -> 15 - 3 - 4 // Derivation #2: exp -> 15 X -> 15 - 3 X -> 15 - 3 - 4 X -> 15 - 3 - 4 exp -> num X X -> - num X | \eps
Precedence exp -> exp - exp | exp * exp | num // But the derivation: 3-4*5 exp -> 3 X -> 3 - 4 X -> 3 - 4 X -> 3 - 4 * 5 X -> 3 - 4 * 5 X -> 3 - 4 * 5 \eps -> 3 - 4 * 5 // What’s the problem? exp -> num X X -> - num X | * num X | \eps
Precedence exp -> exp - exp | exp * exp | num // The derivation: 3-4*5 exp -> term X -> 3 Y X -> 3 \eps X -> 3 X -> 3 - term X -> 3 - 4 Y X -> 3 - 4 * 5 Y X -> 3 - 4 * 5 \eps X -> 3 - 4 * 5 X -> 3 - 4 * 5 \eps -> 3 - 4 * 5 exp -> term X X -> - term X | \eps term -> num Y Y -> * num Y | \eps
Parsing Function // Given production: X -> s // Parsing function has the form: void parseX () { trans (s); } // case analysis on possible shape of s: // 1. s == a; for some terminal a trans (a) = if (t==a) t = nextToken (); else error (“syntax error: expecting: a”);
Parsing Function // Given production: X -> s void parseX () { trans (s); } // case analysis on possible shape of s: // 2. s == Y; for some nonterminal Y trans (Y) = Y ()
Parsing Function // Given production: X -> s void parseX () { trans (s); } // case analysis on possible shape of s: // 3. s == s1 s2 … sn trans (s1 s2 … sn) = trans(s1) trans(s2) … trans(sn)