930 likes | 992 Views
Lecture 6: Top-Down Parsing. CS5363 Programming Languages and Compilers Xiaoyin Wang. Lecture Outline. Implementation of parsers Two approaches Top-down Bottom-up Today: Top-Down Easier to understand and program manually Then: Bottom-Up More powerful and used by most parser generators.
E N D
Lecture 6: Top-Down Parsing CS5363 Programming Languages and Compilers Xiaoyin Wang
Lecture Outline • Implementation of parsers • Two approaches • Top-down • Bottom-up • Today: Top-Down • Easier to understand and program manually • Then: Bottom-Up • More powerful and used by most parser generators
Token Stream Input Abstract Syntax Tree Parser Example IF LPAREN ID(x) ifStmt RPAREN GEQ ID(y) >= assign ID(y) BECOMES INT(42) SCOLON ID(x) ID(y) ID(y) INT(42)
1 t2 3 t9 4 7 t6 t5 t8 Intro to Top-Down Parsing • The parse tree is constructed • From the top • From left to right • Terminals are seen in order of appearance in the token stream: t2 t5 t6 t8 t9
Recursive Descent Parsing • Consider the grammar E T + E | T T int | int * T | ( E ) • Token stream is: int5 * int2 • Start with top-level non-terminal E • Try the rules for E in order
Recursive Descent Parsing. Example (Cont.) • Try E0 T1 + E2 => (E3)+ E2 • Then try a rule for T1 ( E3 ) • But ( does not match input token int5 • TryT1 int . Token matches. => int + E2 • But + after T1 does not match input token * • Try T1 int * T2 • This will match but + after T1 will be unmatched • Has exhausted the choices for T1 • Backtrack to choice for E0
E0 T1 int5 T2 * int2 Recursive Descent Parsing. Example (Cont.) • Try E0 T1 • Follow same steps as before for T1 • And succeed with T1 int * T2 and T2 int • Withthe following parse tree
A Recursive Descent Parser. Preliminaries • Let TOKEN be the type of tokens • Special tokens INT, LPARA, RPARA, PLUS, TIMES • Let the global next point to the next token
A Recursive Descent Parser (2) • Define boolean functions that check the token string for a match of • A given token terminal bool term(TOKEN tok) { return *next++ == tok; } • A given production of S (the nth) bool Sn() { … } • Any production of S: bool S() { … } • These functions advance next
A Recursive Descent Parser (3) • For production E T + E bool E1() { return T() && term(PLUS) && E(); } • For production E T bool E2() { return T(); } • For all productions of E (with backtracking) bool E() { TOKEN *save = next; return (next = save, E1()) || (next = save, E2()); }
A Recursive Descent Parser (4) • Functions for non-terminal T bool T1() { return term(INT); } bool T2() { return term(INT) && term(TIMES) && T(); } bool T3() { return term(LPARA) && E() && term(RPARA); } bool T() { TOKEN *save = next; return (next = save, T1()) || (next = save, T2()) || (next = save, T3()); }
Recursive Descent Parsing. Notes. • To start the parser • Initialize next to point to first token • Invoke E() • Notice how this simulates our previous example • Easy to implement by hand • But does not always work …
When Recursive Descent Does Not Work • Consider a production S S a | a bool S1() { return S() && term(a); } bool S() { return S1(); } • S() will get into an infinite loop • A left-recursive grammar has a non-terminal S S + S for some • Recursive descent does not work in such cases
Elimination of Left Recursion • Consider the left-recursive grammar S S | • S generates all strings starting with a and followed by a number of • Can rewrite using right-recursion S S’ S’ S’ |
More Elimination of Left-Recursion • In general S S 1 | … | S n | 1 | … | m • All strings derived from S start with one of 1,…,mand continue with several instances of1,…,n • Rewrite as S 1 S’ | … | m S’ S’ 1 S’ | … | n S’ |
General Left Recursion • The grammar S A | A S is also left-recursive because S+ S
Summary of Recursive Descent • Simple and general parsing strategy • Left-recursion must be eliminated first • … but that can be done automatically • Used in gcc, chrome, roslyn, … • Limitation: backtracking • Thought to be too inefficient • In practice, backtracking is eliminated by restricting the grammar
Predictive Parsers • Like recursive-descent but parser can “predict” which production to use • By looking at the next few tokens • No backtracking • Predictive parsers accept LL(k) grammars • L means “left-to-right” scan of input • L means “leftmost derivation” • k means “predict based on k tokens of lookahead” • In practice, LL(1) is usually used
LL(1) Languages • In recursive-descent, for each non-terminal and input token there may be a choice of production • LL(1) means that for each non-terminal and token there is only one production • Can be specified via 2D tables • One dimension for current non-terminal to expand • One dimension for next token • A table entry contains one production
An example • S -> *BC | abc • B -> bC | * • C -> aCc | c • input: *baccc Step (1): S -> *BC Step (2): B -> bC Step (3): C -> aCc Step (4): C -> c Step (5): C -> c *BC (*baccc) *bCC (*baccc) *baCcC (*baccc) *baccC (*baccc) *baccc (*baccc)
Left Factoring • Recall the grammar E T + E | T T int | int * T | ( E ) • Hard to predict because • For T two productions start with int • For E it is not clear how to predict • A grammar must be left-factored before use for predictive parsing
Left-Factoring Example • Recall the grammar E T + E | T T int | int * T | ( E ) • Factor out common prefixes of productions E T X X + E | T ( E ) | int Y Y * T |
Checking whether we are fine • E T X Only 1 production • X + E | Different 1st token • T ( E ) | int Y Different 1st token • Y * T | Different 1st token Then we can simply perform predictive parsing as in the example before.
We may still have some issues: Non-Terminals • Consider this grammar E T X | X * X Which one to choose? X + E | T ( E ) | int Y Y * T | if +... ? if (... ? if int ... ?
We may still have some issues: Epsilon • Consider this grammar E T X | X * X Which one to choose? X + E | T ( E ) | int Y Y * T | What about if *…. ?
Constructing Parsing Tables • If A , whether we should select ? • Given a token t where t can start a string derived from • * t • We say that t First() • Given a token t if is and t can follow an A • S * A t • We say t Follow(A)
Computing First Sets Definition: First(X) = { t | X * t} { | X * } Algorithm sketch (see book for details): • for all terminals t do First(t) { t } • for each production X do First(X) { } • if X A1 … An and First(Ai), 1 i n do • add First() - {} to First(X) • for each X A1 … An s.t. First(Ai), 1 i n do • add to First(X) • repeat steps 3 & 4 until no First set can be grown
First Sets. Example • Recall the grammar E X T X + E | T ( E ) | int Y Y * T | • First sets First( ( ) = { ( } First( T ) = {int, ( } First( ) ) = { ) } First( E ) = {+, int, (} First( int) = { int } First( X ) = {+, } First( + ) = { + } First( Y ) = {*, } First( * ) = { * }
Computing Follow Sets • Definition: Follow(X) = { t | S * X t } • Intuition • If S is the start symbol then $ Follow(S) • If X A B then First(B) Follow(A) and Follow(X) Follow(B) • Also if B * then Follow(X) Follow(A)
Computing Follow Sets (Cont.) Algorithm sketch: • Follow(S) { $ } • For each production A X • add First() - {} to Follow(X) • For each A X where First() • add Follow(A) to Follow(X) • repeat step(s) until no Follow set grows
Follow Sets. Example • Recall the grammar E T X X + E | T ( E ) | int Y Y * T | • Follow sets Follow( + ) = { int, ( } Follow( * ) = { int, ( } Follow( ( ) = { int, ( } Follow( E ) = {), $} Follow( X ) = {$, ) } Follow( T ) = {+, ) , $} Follow( ) ) = {+, ) , $} Follow( Y ) = {+, ) , $} Follow( int) = {*, +, ) , $}
Constructing LL(1) Parsing Tables • Construct a parsing table T for CFG G • For each production A in G do: • For each terminal t First() do • T[A, t] = • If First(), for each t Follow(A) do • T[A, t] = • If First() and $ Follow(A) do • T[A, $] =
LL(1) Parsing Table Example • Left-factored grammar E T X | X X X + E | T ( E ) | int Y Y * T | • The LL(1) parsing table:
LL(1) Parsing Tables. Errors • Blank entries indicate error situations • Consider the [E,*] entry • “There is no way to derive a string starting with * from non-terminal E”
Notes on LL(1) Parsing Tables • If any entry is multiply defined then G is not LL(1) • If G is ambiguous • If G is left recursive • If G is not left-factored • And in other cases as well • Most programming language grammars are not LL(1) • There are tools that build LL(1) tables
More Examples on First Set and Follow Set {'x', ε} A -> ‘x’ B | ε B-> ‘y’ | ε C -> ‘z’ E -> ABC F -> CBA S -> E | F First (A) = ? {‘y', ε} First (B) = ? {‘z’} First (C) = ? First (E) = ? First (A)/{ε} U First (B)/{ε} U First (C) ={‘x’, ‘y’, ‘z’} First (F) = ? First (C) ={‘z’} First (E) U First (F) = {‘x’, ‘y’, ‘z’} First (S) = ?
C -> ‘z’ B-> ‘y’ | ε A -> ‘x’ B | ε Example: Follow Set • S -> E | F Follow (S) = {$}, Follow (E) = Follow (F) = Follow (S) = {$} • F -> CBA Follow (A) = Follow (F) = {$} Follow (B) = First (A) U Follow (F) / {ε} = {‘x’, $} Follow (C) = First (A) U First (B) U Follow (F) / {ε} = {‘x’, ‘y’, $} • E -> ABC Follow (A) U= (First (B) U First (C) / {ε}) = {‘y’, ‘z’, ‘$’} Follow (B) U= First (C) / {ε} = {‘x’, ‘z’, $} Follow (C) U= Follow (E) = {‘x’, ‘y’, $}
S -> E | F E -> ABC F -> CBA Example: Follow Set • C -> ‘z’ • B-> ‘y’ | ε • A -> ‘x’ B | ε Follow (B) U= Follow (A) = {‘x’, ‘y’, ‘z’, $}
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Summary • First (A) = {‘x’, ε} • First (B) = {‘y’, ε} • First (C) = {‘z’} • First (E) = {‘x’, ‘y’, ‘z’} • First (F) = {‘z’} • First (S) = {‘x’, ‘y’, ‘z’} Follow (A) = {‘y’, ‘z’, ‘$’} Follow (B) = {‘x’, ‘y’, ‘z’, $} Follow (C) = {‘x’, ‘y’, $} Follow (E) = {$} Follow (F) = {$} Follow (S) = {$}
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction S -> E, First (E) = {'x', 'y', 'z'} and ε not in First (S)
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction S -> F, First (F) = {'z'} and ε not in First (S)
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction F -> CBA, First (CBA) = {'z'} and ε not in First (F)
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction E -> ABC, First (ABC) = {‘x’, ‘y’, 'z'} and ε not in First (F)
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction C -> ‘z’, First (‘z’) = {'z'} and ε not in First (C)
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction B -> ‘y’, First (‘y’) = {‘y'} and ε in First (B)
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction B -> ε, Follow (B) = {‘x’, ‘y‘, ‘z’, $} and ε in First (B)
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction A -> ‘x’ B, First (A) = {‘x’} and ε in First (A)
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction A -> ε, Follow (A) = {‘y’, ‘z’, $} and ε in First (A)
A -> ‘x’ B | ε E -> ABC B-> ‘y’ | ε F -> CBA C -> ‘z’ S -> E | F Table Construction Not a LL(1), Conflicts!!!
The process of Recursive Descent Parsing • Consider the grammar E T X X + E | T ( E ) | int Y Y * T | • Token stream is: int* int