380 likes | 504 Views
CS 381 - Summer 2005 Top-down and Bottom-up Parsing - a whirlwind tour. June 20, 2005 Slide acknowledgment: Radu Rugina, CS 412. Simplified Compiler Structure. Source code. Understand source code. if (b == 0) a = b;. Front end (machine-independent). Intermediate code. Optimize.
E N D
CS 381 - Summer 2005Top-down and Bottom-up Parsing - a whirlwindtour June 20, 2005 Slide acknowledgment: Radu Rugina, CS 412
Simplified Compiler Structure Source code Understand source code if (b == 0) a = b; Front end (machine-independent) Intermediate code Optimize Optimizer Intermediate code Back end (machine-dependent) Generate assembly code Assembly code cmp $0,ecx cmovz edx,ecx
Simplified Front-End Structure Source code (character stream) if (b == 0) a = b; Lexical Analysis Tokenstream if ( b == 0 ) a = b ; Syntax Analysis (Parsing) if Abstract SyntaxTree (AST) == = b 0 a b Semantic Analysis
Parse Tree vs. AST S E + S E ( S ) 5 E + S + + 1 E S + 5 E 2 1 + ( S ) + 2 E + S E 3 4 3 4 • Parse tree also called “concrete syntax” Abstract Syntax Tree Parse Tree (Concrete Syntax) Discards (abstracts) unneeded information
How to build an AST • Need to find a derivation for the program in the grammar • Want an efficient algorithm • should only read token stream once • exponential brute-force search out of question • even CKY is too slow • Two main ways to parse: • top-down parsing (recursive descent) • bottom-up parsing (shift-reduce)
Parsing Top-down S E + S | E E num | ( S ) Goal: construct a leftmost derivation of string while reading in token stream Partly-derived String Lookahead S ( (1+2+(3+4))+5 E+S ( (1+2+(3+4))+5 (S) +S 1 (1+2+(3+4))+5 • (E+S)+S 1 (1+2+(3+4))+5 (1+S)+S 2 (1+2+(3+4))+5 (1+E+S)+S 2 (1+2+(3+4))+5 (1+2+S)+S 2 (1+2+(3+4))+5 • (1+2+E)+S ( (1+2+(3+4))+5 • (1+2+(S))+S 3 (1+2+(3+4))+5 • (1+2+(E+S))+S 3 (1+2+(3+4))+5 parsed partunparsed part
Problem S E + S | E E num | ( S ) • Want to decide which production to apply based on next symbol (1) S E (S) (E) (1) (1)+2 S E+ S (S) + S (E) + S (1)+E (1)+2 • Why is this hard?
Grammar is Problem • This grammar cannot be parsed top-down with only a single look-ahead symbol • NotLL(1)=Left-to-right-scanning, Left-most derivation, 1 look-ahead symbol • Is it LL(k) for some k? • Can rewrite grammar to allow top-down parsing: create LL(1) grammar for same language
Making a grammar LL(1) S E + S S E E num E ( S ) • Problem: can’t decide which S production to apply until we see symbol after first expression • Left-factoring: Factor common S prefix, add new non-terminal S' at decision point. S' derives (+E)* S ES' S' S' + S E num E ( S )
Parsing with new grammar S ( (1+2+(3+4))+5 E S' ( (1+2+(3+4))+5 (S) S' 1 (1+2+(3+4))+5 (ES') S' 1 (1+2+(3+4))+5 (1 S') S' + (1+2+(3+4))+5 (1+ES') S' 2 (1+2+(3+4))+5 (1+2 S') S' + (1+2+(3+4))+5 (1+2 + S) S' ( (1+2+(3+4))+5 (1+2 + ES') S' ( (1+2+(3+4))+5 (1+2 + (S) S') S' 3 (1+2+(3+4))+5 (1+2 + (ES') S') S' 3 (1+2+(3+4))+5 (1+2 + (3 S') S') S' + (1+2+(3+4))+5 (1+2 + (3 + E) S') S' 4 (1+2+(3+4))+5 S ES ' S ' | + S E num | ( S )
Predictive Parsing • LL(1) grammar: • for a given non-terminal, the look-ahead symbol uniquely determines the production to apply • top-down parsing = predictive parsing • Driven by predictive parsing table of non-terminals terminals productions
Using Table S E S ' S ' | + S E num | ( S ) S ( (1+2+(3+4))+5 E S' ( (1+2+(3+4))+5 (S) S' 1 (1+2+(3+4))+5 (ES') S' 1 (1+2+(3+4))+5 (1 S') S' + (1+2+(3+4))+5 (1 + S) S' 2 (1+2+(3+4))+5 (1+ES') S' 2 (1+2+(3+4))+5 (1+2 S') S' + (1+2+(3+4))+5 num + ( ) $ S E S 'E S ' S ' +S E num ( S )
How to Implement? • Table can be converted easily into a recursive-descent parser num + ( ) $ S E S 'E S ' S ' +S E num ( S ) • Three procedures: parse_S, parse_S’, parse_E
Recursive-Descent Parser void parse_S () { switch (token) { case num: parse_E(); parse_S’(); return; case ‘(’: parse_E(); parse_S’(); return; default: throw new ParseError(); } } number + ( ) $ S ES’ ES’ S’ +S E number ( S ) lookahead token
Recursive-Descent Parser void parse_S’() { switch (token) { case ‘+’: token = input.read(); parse_S(); return; case ‘)’: return; case EOF: return; default: throw new ParseError(); } } number + ( )$ S ES’ ES’ S’ +S E number ( S )
Recursive-Descent Parser void parse_E() { switch (token) { case number: token = input.read(); return; case ‘(‘: token = input.read(); parse_S(); if (token != ‘)’) throw new ParseError(); token = input.read(); return; default: throw new ParseError(); } } number + ( ) $ S ES’ ES’ S’ +S E number ( S )
Call Tree = Parse Tree S E S’ ( S ) + S E S’ 5 1 + S E S’ 2 + S E S’ ( S ) E S’ + S 3 E 4 (1 + 2 + (3 + 4)) + 5 parse_S parse_S’ parse_E parse_S parse_S parse_E parse_S’ parse_S parse_S’ parse_E parse_S parse_S’ parse_E parse_S
How to Construct Parsing Tables • There exists an algorithm for automatically generating a predictive parse table from a grammar (take 412 for details) N + ( ) $ S ES’ES’ S’ +S E N ( S ) S ES’ S’ | + S E number | ( S )
Summary for top-down parsing • LL(k) grammars • left-to-right scanning • leftmost derivation • can determine what production to apply from the next k symbols • Can automatically build predictive parsing tables • Predictive parsers • Can be easily built for LL(k) grammars from the parsing tables • Also called recursive-descent, or top-down parsers
Top-Down Parsing Summary Language grammar Left-recursion elimination Left-factoring LL(1) grammar predictive parsing table recursive-descent parser parser with AST generation
Now: Bottom-up Parsing • A more powerful parsing technology • LR grammars -- more expressive than LL • construct right-most derivation of program • virtually all programming languages • easier to express programming language syntax • Shift-reduce parsers • Parsers for LR grammars • automatic parser generators (e.g. yacc,CUP)
Bottom-up Parsing • Right-most derivation -- backward • Start with the tokens • End with the start symbol (1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5 (S+(3+4))+5 (S+(E+4))+5 (S+(S+4))+5 (S+(S+E))+5 (S+(S))+5 (S+E)+5 (S)+5 E+5 S+ES S S + E | E E num | ( S )
Progress of Bottom-up Parsing (1+2+(3+4))+5 (1+2+(3+4))+5 (E+2+(3+4))+5 (1 +2+(3+4))+5 (S+2+(3+4))+5 (1 +2+(3+4))+5 (S+E+(3+4))+5 (1+2 +(3+4))+5 (S+(3+4))+5 (1+2+(3+4))+5 (S+(E+4))+5 (1+2+(3+4))+5 (S+(S+4))+5 (1+2+(3+4))+5 (S+(S+E))+5 (1+2+(3+4))+5 (S+(S))+5 (1+2+(3+4))+5 (S+E)+5 (1+2+(3+4))+5 (S)+5 (1+2+(3+4) )+5 E+5 (1+2+(3+4))+5 S+E(1+2+(3+4))+5 S(1+2+(3+4))+5 right-most derivation
Bottom-up Parsing • (1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5 … • Advantage of bottom-up parsing: can postpone the selection of productions until more of the input is scanned S S + E | E E num | ( S ) S S + E E 5 ( S ) S + E ( S ) S+E E S + E 2 4 1 E 3
Top-down Parsing (1+2+(3+4))+5 S S+E E+E (S)+E (S+E)+E (S+E+E)+E (E+E+E)+E (1+E+E)+E (1+2+E)+E ... • In left-most derivation, entire tree above a token (2) has been expanded when encountered S S + E | E E num | ( S ) S S + E E 5 ( S ) S + E ( S ) S + E E S + E 2 4 1 E 3
Top-down vs. Bottom-up Bottom-up: Don’t need to figure out as much of the parse tree for a given amount of input scanned unscanned scanned unscanned Top-down Bottom-up
Shift-reduce Parsing • Parsing actions: is a sequence of shift and reduce operations • Parser state: a stack of terminals and non-terminals (grows to the right) • Current derivation step = always stack+input Derivation step stack unconsumed input (1+2+(3+4))+5 (1+2+(3+4))+5 (E+2+(3+4))+5 (E +2+(3+4))+5 (S+2+(3+4))+5 (S +2+(3+4))+5 (S+E+(3+4))+5 (S+E +(3+4))+5
Shift-reduce Parsing • Parsing is a sequence of shifts and reduces • Shift : move look-ahead token to stack stack input action ( 1+2+(3+4))+5 shift 1 (1 +2+(3+4))+5 • Reduce : Replace symbols from top of stack with non-terminal symbol X, corresponding to production X (pop , push X) stack input action (S+E +(3+4))+5 reduce S S+E (S +(3+4))+5
Shift-reduce Parsing S S + E | E E num | ( S ) (1+2+(3+4))+5 (1+2+(3+4))+5 shift (1+2+(3+4))+5 ( 1+2+(3+4))+5 shift (1+2+(3+4))+5 (1 +2+(3+4))+5 reduce Enum (E+2+(3+4))+5 (E +2+(3+4))+5 reduce S E (S+2+(3+4))+5 (S +2+(3+4))+5 shift (S+2+(3+4))+5 (S+2+(3+4))+5 shift (S+2+(3+4))+5 (S+2 +(3+4))+5 reduce Enum (S+E+(3+4))+5 (S+E+(3+4))+5 reduce S S+E (S+(3+4))+5 (S+(3+4))+5 shift (S+(3+4))+5 (S+ (3+4))+5 shift (S+(3+4))+5 (S+(3+4))+5 shift (S+(3+4))+5 (S+(3+4))+5 reduce Enum derivation input stream action stack
Problem • How do we know which action to take: whether to shift or reduce, and which production? • Issues: • Sometimes can reduce but shouldn’t • Sometimes can reduce in different ways
Action Selection Problem • Given stack and look-ahead symbol b, should parser: • shift b onto the stack (making it b) • reduceX assuming that stack has the form (making it X) • If stack has form , should apply reduction X (or shift) depending on stack prefix • is different for different possible reductions, since ’s have different length.
LR Parsing Engine • Basic mechanism: • Use a set of parser states • Use a stack with alternating symbols and states • E.g: 1(6S10+5 • Use a parsing table to: • Determine what action to apply (shift/reduce) • Determine the next state • The parser actions can be precisely determined from the table
The LR Parsing Table Terminals Non-terminals Next action and next state Next state State Goto table Action table • Algorithm: look at entry for current state S and input terminal C • If Table[S,C] = s(S’) thenshift: • push(C), push(S’) • If Table[S,C] = Xa then reduce: • pop(2*|a|), S’=top(), push(X), push(Table[S’,X])
LR Parsing Table Example ( ) id , $ S L 1 s3 s2 g4 2Sid Sid Sid Sid Sid 3 s3 s2 g7 g5 4accept 5 s6 s8 6S(L) S(L) S(L) S(L) S(L) 7LS LS LS LS LS 8s3 s2 g9 9LL,S LL,S LL,S LL,S LL,S
LR(k) Grammars • LR(k) = Left-to-right scanning, Right-most derivation, k look-ahead characters • Main cases: LR(0), LR(1), and some variations (SLR and LALR(1)) • Parsers for LR(0) Grammars: • Determine the actions without any lookahead symbol
Building LR(0) Parsing Tables • To build the parsing table: • Define states of the parser • Build a DFA to describe the transitions between states • Use the DFA to build the parsing table
Summary for bottom-up parsing • LR(k) grammars • left-to-right scanning • rightmost derivation • can determine whether to shift or reduce from the next k symbols • Can automatically build predictive parsing tables • Shift-reduce parsers • Can be built for LR(k) grammars using automated parser generator tools, eg. CUP, yacc.
Top-down vs. Bottom-up again LL(k), recursive descent LR(k), shift-reduce scanned unscanned scanned unscanned Top-down Bottom-up