Parsing

Parsing Discrete Mathematics and Its Applications Baojian Hua bjhua@ustc.edu.cn

Syntax Tree • A systematic way to put some program into memory • data type definition + a bunch of functions • programmer explicit calls them • tedious and error-prone • But we write programs in ASCII form, so how can we construct the tree automatically? • A technique called (automatic) parsing • A program clever enough to do this automatically

Roadmap stream of characters stream of tokens abstract syntax • Lexer: eat ascii sequence, emit token sequence • Parser: eat token sequence, emit abstract syntax trees • other part: later in this course Lexer Parser other part

Parsing • Take as input a sequence of terminals, and construct syntax trees automatically • Problem: how do we know whether a sequence of input tokens is valid?

Example Am I valid? y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Oops, mismatch! S -> x A Nonterminals: S A B C terminals: ID_X ID_Y ID_U ID_V ID_T ID_M ID_W ID_Z

Example Am I valid? y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Another try! S -> y B

Example Am I valid? y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Aha, Great! S -> y B

Example Derive “m z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Recursion S -> y B

Example Derive “m z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z First try S -> y B B -> t C

Example Derive “m z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Mismatch S -> y B B -> t C

Example Derive “m z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Second try S -> y B B -> m C

Example Derive “z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Recursion S -> y B B -> m C

Example Derive “z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Mismatch S -> y B B -> m C C -> w

Example Derive “z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Second try S -> y B B -> m C C -> z

Example Derive “z” y m z S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z Matched. Sucess S -> y B B -> m C C -> z S y B m C z

Recursive Decedent Algorithm • This process can be described by a recursive decedent algorithm • For each nonterminal, write a (recursive) parsing function • every RHS becomes a case in a big switch • function may take some semantic actions, besides parsing • later in this course

Recursive Decedent Algorithm // The function interface looks like: void parseS (); void parseA (); void parseB (); void parseC (); S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z

Recursive Decedent Algorithm struct token t; // recall the token module t = getToken (); // and the lexer module void parserS () { switch (t) { case ID_X: t = getToken (); parseA (); return; case ID_Y: t = getToken (); parseB (); return; S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z

Recursive Decedent Algorithm default: // not ID_X or ID_Y error (“want ‘x’ or ‘y’); return; } } // Leave the algorithm for // parseA (), parseB () and // parseC () to you. S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z

Summary so Far • Recursive decedent parsing: • also called predictive parsing, or top-down parsing • simple and efficient • can be coded by hand quickly • see problem 2 in lab #4 • But the constraint is that not all formal grammar can be parsed by a recursive decedent parser • Example below

Example Derive me x m z S ::= x A | x B A ::= m C | m C B ::= m z | m z C ::= w | y S -> x A or S -> x B

Recursive Decedent Algorithm? struct token t; // recall the token module t = getToken (); // and the lexer module void parserS () { switch (t) { case ID_X: t = getToken (); ?????? // what code here? return; default: error (“….”); return;} } S ::= x A | x B A ::= m C | m C B ::= m z | m z C ::= w | y

Moral • We’d introduce a notion of what’s a production’s RHS s could start with • s\in (T\/N)* • We call it a first (terminal) set, written as F[s], for string s\in (T\/N)* • Next, we first compute the first set F for given terminals or nonterminals

First Set F S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z F [S] = {x, y} F [A] = {u, v} F [B] = {t, m} F [C] = {w, z} // And generalize to string F [x A] = {x} F [y B] = {y} F [u C] = {u} F [v C] = {v} …

First Set F S ::= x A | y B A ::= u C | v C B ::= t C | m C C ::= w | z F (S) = {x, y} F (A) = {u, v} F (B) = {t, m} F (C) = {w, z} // And generalize to string F (x A) = {x} F (y B) = {y} F (u C) = {u} F (v C) = {v} … // Then why this grammar could be parsed?

Parsing Table

Example S ::= x A | x B A ::= m C | m C B ::= m z | m z C ::= w | y F (S) = ? F (A) = ? F (B) = ? F (C) = ? Predicative parsing?

Empty Production Rules Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a F (Z) = ? F (Y) = ? F (X) = ? Predicative parsing?

Algorithm #1: nullable // nullable[X]: whether X derives \eps or not // all initialized to false repeat for each production rule X -> Y1 Y2 … Yn if (nullable[Y1, …, Yn]=true) or (n=0)) nullable[X] = true until nullable[] did not change

Example Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a // initialization nullable[Z, Y, X] = false // round 1 nullable[Z] = false nullable[Y] = true nullable[X] = true // round 2 nullable[Z] = false nullable[Y] = true nullable[X] = true // finished!

Algorithm #2: First Set // F[X]: first terminals X could derive // all initialized to empty repeat for each production rule X -> Y1 Y2 … Yn for (i=1 to n) if (nullable[Y1, Y2, Y_{i-1}]=true or (i=1)) F[X] = F[X] \/ F[Yi] until F[] did not change

Example Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a // initialization F[Z, Y, X] = {} // round 1 F[Z] = {d} F[Y] = {c} F[X] = {c, a} // round 2 F[Z] = {d, c, a} F[Y] = {c} F[X] = {c, a} // round 3…

Pitfalls S ::= x A B A ::= y | \eps B ::= y | \eps Try to calculate nullable[] and F[] Try to derive “x y” What’s the problem?

Algorithm #3: Follow Set // W[X]: terminals may follow X // all initialized to empty repeat for each production rule X -> Y1 Y2 … Yn for (i=1 to n) for (j=i+1 to n) if (nullable[Y_{i+1}, …, Yn]=true or (i=n)) W[Yi] = W[Yi] \/ W[X] if (nullable[Y_{i+1}, …, Y_{j-1}]=true or(i+1=j)) W[Yi] = W[Yi] \/ F[Yj] until W[] did not change

Example Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a // initialization W[Z, Y, X] = {} // round 1 W[Z] = {EOF} W[Y] = {d, c, a} W[X] = {c, d, a} // round 2 W[Z] = {EOF} W[Y] = {d, c, a} W[X] = {d, c, a} // finished!

First and Follow Together Z ::= d | X Y Z Y ::= c | \eps X ::= Y | a // first F[Z] = {d, c, a} F[Y] = {c} F[X] = {c, a} // follow W[Z] = {EOF} W[Y] = {d, c, a} W[X] = {d, c, a} // first All[X] = { F[X], if nullable[X]=false { W[X], otherwise.

Parsing Table

LL(1) • Grammar whole predicative parsing tables contain no duplicate entries are called LL(1) • left-to-right parse, left-to-right-derivation, 1-symbol lookahead • one pass, no backtracking, very efficient • precise error reports • For some non-LL(1) grammar, there are some standard methods to transform them • example below

Eliminating Left Recursion X -> X a | c // General transforming rules: X -> X α1 | … | X αm | β1 | … | βn // to X -> β1X’ | … | βnX’ X’ ->α1 X’ | … | αm X’ | \eps X -> c X’ X’ -> a X’ | \eps

Eliminating Left Recursion S -> if (E) then S else S; | if (E) then S; // General rules: X -> α X1 | … | α Xn // to X -> αX’ X’ ->X1 | … | Xn S -> if (E) then S S’ S’ -> else S; | ;

Eliminating Ambiguity • In programming language syntax, ambiguity often arises from missing operator precedence or associativity • * higher precedence than +? • * and + are left associative? • Ambiguious grammar are hard to use as language syntax

E E E - E E - E 15 E E 4 - E - E 15 3 3 4 Ambiguous grammars • A grammar is ambiguous if there is a sentence with >1 parse tree 15 - 3 - 4

Association exp -> exp - exp | num // Derivation #1: exp -> exp - exp -> 15 - exp -> 15 - exp - exp -> 15 - 3 - exp -> 15 - 3 - 4 // Derivation #2: exp -> 15 X -> 15 - 3 X -> 15 - 3 - 4 X -> 15 - 3 - 4 exp -> num X X -> - num X | \eps

Precedence exp -> exp - exp | exp * exp | num // But the derivation: 3-4*5 exp -> 3 X -> 3 - 4 X -> 3 - 4 X -> 3 - 4 * 5 X -> 3 - 4 * 5 X -> 3 - 4 * 5 \eps -> 3 - 4 * 5 // What’s the problem? exp -> num X X -> - num X | * num X | \eps

Precedence exp -> exp - exp | exp * exp | num // The derivation: 3-4*5 exp -> term X -> 3 Y X -> 3 \eps X -> 3 X -> 3 - term X -> 3 - 4 Y X -> 3 - 4 * 5 Y X -> 3 - 4 * 5 \eps X -> 3 - 4 * 5 X -> 3 - 4 * 5 \eps -> 3 - 4 * 5 exp -> term X X -> - term X | \eps term -> num Y Y -> * num Y | \eps

Parsing Function // Given production: X -> s // Parsing function has the form: void parseX () { trans (s); } // case analysis on possible shape of s: // 1. s == a; for some terminal a trans (a) = if (t==a) t = nextToken (); else error (“syntax error: expecting: a”);

Parsing Function // Given production: X -> s void parseX () { trans (s); } // case analysis on possible shape of s: // 2. s == Y; for some nonterminal Y trans (Y) = Y ()

Parsing Function // Given production: X -> s void parseX () { trans (s); } // case analysis on possible shape of s: // 3. s == s1 s2 … sn trans (s1 s2 … sn) = trans(s1) trans(s2) … trans(sn)

Parsing

Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing