Understanding Top-Down Parsing for Compiler Design

Lecture 6: Top-Down Parsing CS5363 Programming Languages and Compilers Xiaoyin Wang

Lecture Outline • Implementation of parsers • Two approaches • Top-down • Bottom-up • Today: Top-Down • Easier to understand and program manually • Then: Bottom-Up • More powerful and used by most parser generators

Token Stream Input Abstract Syntax Tree Parser Example IF LPAREN ID(x) ifStmt RPAREN GEQ ID(y) >= assign ID(y) BECOMES INT(42) SCOLON ID(x) ID(y) ID(y) INT(42)

1 t2 3 t9 4 7 t6 t5 t8 Intro to Top-Down Parsing • The parse tree is constructed • From the top • From left to right • Terminals are seen in order of appearance in the token stream: t2 t5 t6 t8 t9

Recursive Descent Parsing • Consider the grammar E  T + E | T T  int | int * T | ( E ) • Token stream is: int5 * int2 • Start with top-level non-terminal E • Try the rules for E in order

Recursive Descent Parsing. Example (Cont.) • Try E0 T1 + E2 => (E3)+ E2 • Then try a rule for T1  ( E3 ) • But ( does not match input token int5 • TryT1  int . Token matches. => int + E2 • But + after T1 does not match input token * • Try T1  int * T2 • This will match but + after T1 will be unmatched • Has exhausted the choices for T1 • Backtrack to choice for E0

E0 T1 int5 T2 * int2 Recursive Descent Parsing. Example (Cont.) • Try E0 T1 • Follow same steps as before for T1 • And succeed with T1  int * T2 and T2  int • Withthe following parse tree

A Recursive Descent Parser. Preliminaries • Let TOKEN be the type of tokens • Special tokens INT, LPARA, RPARA, PLUS, TIMES • Let the global next point to the next token

A Recursive Descent Parser (2) • Define boolean functions that check the token string for a match of • A given token terminal bool term(TOKEN tok) { return *next++ == tok; } • A given production of S (the nth) bool Sn() { … } • Any production of S: bool S() { … } • These functions advance next

A Recursive Descent Parser (3) • For production E  T + E bool E1() { return T() && term(PLUS) && E(); } • For production E  T bool E2() { return T(); } • For all productions of E (with backtracking) bool E() { TOKEN *save = next; return (next = save, E1()) || (next = save, E2()); }

A Recursive Descent Parser (4) • Functions for non-terminal T bool T1() { return term(INT); } bool T2() { return term(INT) && term(TIMES) && T(); } bool T3() { return term(LPARA) && E() && term(RPARA); } bool T() { TOKEN *save = next; return (next = save, T1()) || (next = save, T2()) || (next = save, T3()); }

Recursive Descent Parsing. Notes. • To start the parser • Initialize next to point to first token • Invoke E() • Notice how this simulates our previous example • Easy to implement by hand • But does not always work …

When Recursive Descent Does Not Work • Consider a production S  S a | a bool S1() { return S() && term(a); } bool S() { return S1(); } • S() will get into an infinite loop • A left-recursive grammar has a non-terminal S S + S for some  • Recursive descent does not work in such cases

Elimination of Left Recursion • Consider the left-recursive grammar S S  |  • S generates all strings starting with a  and followed by a number of  • Can rewrite using right-recursion S  S’ S’   S’ | 

More Elimination of Left-Recursion • In general S S 1 | … | S n | 1 | … | m • All strings derived from S start with one of 1,…,mand continue with several instances of1,…,n • Rewrite as S  1 S’ | … | m S’ S’  1 S’ | … | n S’ | 

General Left Recursion • The grammar S  A  |  A  S  is also left-recursive because S+ S  

Summary of Recursive Descent • Simple and general parsing strategy • Left-recursion must be eliminated first • … but that can be done automatically • Used in gcc, chrome, roslyn, … • Limitation: backtracking • Thought to be too inefficient • In practice, backtracking is eliminated by restricting the grammar

Predictive Parsers • Like recursive-descent but parser can “predict” which production to use • By looking at the next few tokens • No backtracking • Predictive parsers accept LL(k) grammars • L means “left-to-right” scan of input • L means “leftmost derivation” • k means “predict based on k tokens of lookahead” • In practice, LL(1) is usually used

LL(1) Languages • In recursive-descent, for each non-terminal and input token there may be a choice of production • LL(1) means that for each non-terminal and token there is only one production • Can be specified via 2D tables • One dimension for current non-terminal to expand • One dimension for next token • A table entry contains one production

An example • S -> *BC | abc • B -> bC | * • C -> aCc | c • input: *baccc Step (1): S -> *BC Step (2): B -> bC Step (3): C -> aCc Step (4): C -> c Step (5): C -> c *BC (*baccc) *bCC (*baccc) *baCcC (*baccc) *baccC (*baccc) *baccc (*baccc)

Left Factoring • Recall the grammar E  T + E | T T  int | int * T | ( E ) • Hard to predict because • For T two productions start with int • For E it is not clear how to predict • A grammar must be left-factored before use for predictive parsing

Checking whether we are fine • E  T X Only 1 production • X  + E |  Different 1st token • T  ( E ) | int Y Different 1st token • Y  * T |  Different 1st token Then we can simply perform predictive parsing as in the example before.

We may still have some issues: Non-Terminals • Consider this grammar E  T X | X * X Which one to choose? X  + E |  T  ( E ) | int Y Y  * T |  if +... ? if (... ? if int ... ?

We may still have some issues: Epsilon • Consider this grammar E  T X | X * X Which one to choose? X  + E |  T  ( E ) | int Y Y  * T |  What about if *…. ?

Constructing Parsing Tables • If A , whether we should select  ? • Given a token t where t can start a string derived from  •  * t  • We say that t  First() • Given a token t if  is  and t can follow an A • S *  A t  • We say t  Follow(A)

Computing First Sets Definition: First(X) = { t | X * t}  { | X * } Algorithm sketch (see book for details): • for all terminals t do First(t)  { t } • for each production X   do First(X)  {  } • if X  A1 … An  and   First(Ai), 1  i  n do • add First() - {} to First(X) • for each X  A1 … An s.t.   First(Ai), 1  i  n do • add  to First(X) • repeat steps 3 & 4 until no First set can be grown

First Sets. Example • Recall the grammar E  X T X  + E |  T  ( E ) | int Y Y  * T |  • First sets First( ( ) = { ( } First( T ) = {int, ( } First( ) ) = { ) } First( E ) = {+, int, (} First( int) = { int } First( X ) = {+,  } First( + ) = { + } First( Y ) = {*,  } First( * ) = { * }

Computing Follow Sets • Definition: Follow(X) = { t | S *  X t  } • Intuition • If S is the start symbol then $  Follow(S) • If X  A B then First(B)  Follow(A) and Follow(X)  Follow(B) • Also if B *  then Follow(X)  Follow(A)

Computing Follow Sets (Cont.) Algorithm sketch: • Follow(S)  { $ } • For each production A   X  • add First() - {} to Follow(X) • For each A   X  where   First() • add Follow(A) to Follow(X) • repeat step(s) until no Follow set grows

Follow Sets. Example • Recall the grammar E  T X X  + E |  T  ( E ) | int Y Y  * T |  • Follow sets Follow( + ) = { int, ( } Follow( * ) = { int, ( } Follow( ( ) = { int, ( } Follow( E ) = {), $} Follow( X ) = {$, ) } Follow( T ) = {+, ) , $} Follow( ) ) = {+, ) , $} Follow( Y ) = {+, ) , $} Follow( int) = {*, +, ) , $}

Constructing LL(1) Parsing Tables • Construct a parsing table T for CFG G • For each production A   in G do: • For each terminal t  First() do • T[A, t] =  • If   First(), for each t  Follow(A) do • T[A, t] =  • If   First() and $  Follow(A) do • T[A, $] = 

LL(1) Parsing Table Example • Left-factored grammar E  T X | X X X  + E |  T  ( E ) | int Y Y  * T |  • The LL(1) parsing table:

LL(1) Parsing Tables. Errors • Blank entries indicate error situations • Consider the [E,*] entry • “There is no way to derive a string starting with * from non-terminal E”

Notes on LL(1) Parsing Tables • If any entry is multiply defined then G is not LL(1) • If G is ambiguous • If G is left recursive • If G is not left-factored • And in other cases as well • Most programming language grammars are not LL(1) • There are tools that build LL(1) tables

More Examples on First Set and Follow Set {'x', ε} A -> ‘x’ B | ε B-> ‘y’ | ε C -> ‘z’ E -> ABC F -> CBA S -> E | F First (A) = ? {‘y', ε} First (B) = ? {‘z’} First (C) = ? First (E) = ? First (A)/{ε} U First (B)/{ε} U First (C) ={‘x’, ‘y’, ‘z’} First (F) = ? First (C) ={‘z’} First (E) U First (F) = {‘x’, ‘y’, ‘z’} First (S) = ?

C -> ‘z’ B-> ‘y’ | ε A -> ‘x’ B | ε Example: Follow Set • S -> E | F Follow (S) = {$}, Follow (E) = Follow (F) = Follow (S) = {$} • F -> CBA Follow (A) = Follow (F) = {$} Follow (B) = First (A) U Follow (F) / {ε} = {‘x’, $} Follow (C) = First (A) U First (B) U Follow (F) / {ε} = {‘x’, ‘y’, $} • E -> ABC Follow (A) U= (First (B) U First (C) / {ε}) = {‘y’, ‘z’, ‘$’} Follow (B) U= First (C) / {ε} = {‘x’, ‘z’, $} Follow (C) U= Follow (E) = {‘x’, ‘y’, $}

S -> E | F E -> ABC F -> CBA Example: Follow Set • C -> ‘z’ • B-> ‘y’ | ε • A -> ‘x’ B | ε Follow (B) U= Follow (A) = {‘x’, ‘y’, ‘z’, $}

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Summary • First (A) = {‘x’, ε} • First (B) = {‘y’, ε} • First (C) = {‘z’} • First (E) = {‘x’, ‘y’, ‘z’} • First (F) = {‘z’} • First (S) = {‘x’, ‘y’, ‘z’} Follow (A) = {‘y’, ‘z’, ‘$’} Follow (B) = {‘x’, ‘y’, ‘z’, $} Follow (C) = {‘x’, ‘y’, $} Follow (E) = {$} Follow (F) = {$} Follow (S) = {$}

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction S -> E, First (E) = {'x', 'y', 'z'} and ε not in First (S)

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction S -> F, First (F) = {'z'} and ε not in First (S)

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction F -> CBA, First (CBA) = {'z'} and ε not in First (F)

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction E -> ABC, First (ABC) = {‘x’, ‘y’, 'z'} and ε not in First (F)

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction C -> ‘z’, First (‘z’) = {'z'} and ε not in First (C)

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction B -> ‘y’, First (‘y’) = {‘y'} and ε in First (B)

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction B -> ε, Follow (B) = {‘x’, ‘y‘, ‘z’, $} and ε in First (B)

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction A -> ‘x’ B, First (A) = {‘x’} and ε in First (A)

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction A -> ε, Follow (A) = {‘y’, ‘z’, $} and ε in First (A)

A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction Not a LL(1), Conflicts!!!

The process of Recursive Descent Parsing • Consider the grammar E  T X X  + E |  T  ( E ) | int Y Y  * T |  • Token stream is: int* int

Understanding Top-Down Parsing for Compiler Design