Parsing

CIS 461Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and Linda TorczonLecture-Module #13 Parsing

Top-down Parsing S A fringe of the parse tree B start symbol ? D C S left-to-right scan ? left-most derivation lookahead Bottom-up Parsing S input string lookahead upper fringe of the parse tree ? A D C right-most derivation in reverse lookahead

LR Parsers LR(k) parsers are table-driven, bottom-up, shift-reduce parsers that use a limited right context (k-token lookahead) for handle recognition LR(k): Left-to-right scan of the input, Rightmost derivation in reverse with k token lookahead A grammar is LR(k) if, given a rightmost derivation S 0 1 2 …  n-1 nsentence We can 1. isolate the handle of each right-sentential form i , and 2. determine the production by which to reduce, by scanning i from left-to-right, going at most k symbols beyond the right end of the handle of i

LR Parsers A table-driven LR parser looks like Scanner Table-driven Parser ACTION & GOTO Tables Parser Generator Stack source code IR grammar

LR Shift-Reduce Parsers push($); // $ is the end-of-file symbol push(s0); // s0 is the start state of the DFA that recognizes handles lookahead = get_next_token(); repeat forever s = top_of_stack(); if ( ACTION[s,lookahead] == reduce  ) then pop 2*|| symbols; s = top_of_stack(); push(); push(GOTO[s,]); else if ( ACTION[s,lookahead] == shift si ) then push(lookahead); push(si); lookahead = get_next_token(); else if ( ACTION[s,lookahead] == accept and lookahead == $ ) then return success; else error(); • The skeleton parser • uses ACTION & GOTO • does |words| shifts • does |derivation| • reductions • does 1 accept

LR Parsers How does this LR stuff work? Unambiguous grammar  unique rightmost derivation Keep upper fringe on a stack All active handles include TOS Shift inputs until TOS is right end of a handle Language of handles is regular Build a handle-recognizing DFA ACTION & GOTO tables encode the DFA To match subterms, recurse & leave DFA’s state on stack Final states of the DFA correspond to reduce actions New state is GOTO[lhs , state at TOS]

Building LR Parsers How do we generate the ACTION and GOTO tables? Use the grammar to build a model of the handle recognizing DFA Use the DFA model to build ACTION & GOTO tables If construction succeeds, the grammar is LR How do we build the handle recognizing DFA ? Encode the set of productions that can be used as handles in the DFA state: Use LR(k) items Use two functions goto( s,  ) and closure( s ) goto() is analogous to move() in the DFA to NFA conversion closure() is analogous to -closure Build up the states and transition functions of the DFA Use this information to fill in the ACTION and GOTO tables

LR(k) items An LR(k) item is a pair [A , B], where A is a production  with a • at some position in the rhs B is a lookahead string of length ≤ k (terminal symbols or $) Examples: [• , a], [• , a], [• , a], & [• , a] The • in an item indicates the position of the top of the stack LR(0) items [  •  ] (no lookahead symbol) LR(1) items [  •  , a ] (one token lookahead) LR(2) items [  •  , a b ] (two token lookahead) ...

LR(1) Items The production •, with lookahead a, generates 4 items [• , a], [• , a], [• , a], & [• , a] The set of LR(1) items for a grammar is finite What’s the point of all these lookahead symbols? Carry them along to choose correct reduction Lookaheads are bookkeeping, unless item has • at right end Has no direct use in [• , a] In [• , a], a lookahead of a implies a reduction by  For { [• , a],[• , b] } lookahead = areduce to ; lookahead  FIRST() shift Limited right context is enough to pick the actions

LR(1) Table Construction High-level overview • Build the handle recognizing DFA (aka Canonical Collection of sets of LR(1) items), C = { I0 , I1 , ... , In} a Introduce a new start symbol S’ which has only one production S’  S b Initial state, I0 should include • [S’ •S, $], along with any equivalent items • Derive equivalent items as closure( I0 ) c Repeatedly compute, for each Ik , and each grammar symbol , goto(Ik , ) • If the set is not already in the collection, add it • Record all the transitions created by goto( ) This eventually reaches a fixed point • Fill in the ACTION and GOTO tables using the DFA The canonical collection completely encodes the transition diagram for the handle-finding DFA

Computing Closures closure(I) adds all the items implied by items already in I • Any item [ , a] implies [  , x] for each production with  on the lhs, and x  FIRST(a) • Since  is valid, any way to derive  is valid, too The algorithm Closure( I ) while ( I is still changing ) for each item [   •  , a]  I for each production     P for each terminal b  FIRST(a) if [  •  , b]  I then add [  •  , b] to I Fixpoint computation

Computing Gotos goto(I , x) computes the state that the parser would reach if it recognized an x while in state I • goto( { [   , a] },  ) produces [   , a] • It also includes closure( [   , a] ) to fill out the state The algorithm Goto( I, x ) new = Ø for each [   • x  , a]  I new = new  [  x •  , a] return closure(new) • Not a fixpoint method • Uses closure

Building the Canonical Collection of LR(1) Items This is where we build the handle recognizing DFA! Start from I0= closure( [S’ • S , $] ) Repeatedly construct new states, until ano new states are generated The algorithm I0 = closure( [S’ • S , $] ) C= { I0 } while ( C is still changing ) for each Ii  C and for each x  ( TNT ) Inew = goto(Ii, x) if Inew  C then C = C  Inew record transition Ii Inew on x • Fixed-point computation • Loop adds to C • C  2ITEMS, so C is finite

Algorithms for LR(1) Parsers Computing closure of set of LR(1) items: Computing goto for set of LR(1) items: Closure( I ) while ( I is still changing ) for each item [   •  , a]  I for each production     P for each terminal b  FIRST(a) if [  •  , b]  I then add [  •  , b] to I Goto( I, x ) new = Ø for each [   • x  , a]  I new = new  [  x •  , a] return closure(new) Constructing canonical collection of LR(1) items: I0 = closure( [S’ • S , $] ) C= { I0 } while ( C is still changing ) for each Ii  C and for each x  ( TNT ) Inew = goto(Ii, x) if Inew  C then C = C  Inew record transition Ii Inew on x • Canonical collection construction • algorithm is the algorithm for • constructing handle recognizing DFA • Uses Closure to compute the states • of the DFA • Uses Goto to compute the transitions • of the DFA

Example Simplified, right recursive expression grammar S  Expr Expr  Term - Expr Expr  Term Term  Factor * Term Term  Factor Factor  id Symbol FIRST S { id } Expr{ id} Term{ id } Factor{ id} -{ - } *{ * } id{ id}

I1= {[S  Expr •, $]} I0= {[S  •Expr ,$], [Expr  •Term - Expr , $], [Expr  •Term , $], [Term  •Factor * Term , {$,-}], [Term  •Factor , {$,-}], [Factor  • id ,{$,-,*}]} Expr Term id Factor I2 ={ [Expr  Term • - Expr , $], [Expr  Term •, $] } I4= { [Factor id •, {$,-,*}] } I3 = {[Term  Factor •* Term , {$,-}], [Term  Factor •, {$,-}]} id Factor * • I6={[Term  Factor * • Term , {$,-}], [Term • Factor * Term , {$,-}], • [Term  • Factor , {$,-}], [Factor• id , {$, -, *}]} Factor - Term id I5={[Expr  Term - •Expr , $], [Expr  •Term - Expr , $], [Expr  •Term , $], [Term  •Factor * Term , {$,-}], [Term  •Factor , {$,-}], [Factor  • id , {$,-,*}] } Expr I8= { [Term  Factor * Term•, {$,-}] } I7= { [Expr  Term - Expr •, $] }

Constructing the ACTION and GOTO Tables x is the state number Each state corresponds to a set of LR(1) items The algorithm for each set of items Ix  C for each item  Ix if item is [ •a,b] and a  T and goto(Ix,a) = Ik , then ACTION[x,a]  “shift k” else if item is [S’S •,$] then ACTION[x ,$]  “accept” else if item is [ •,a] then ACTION[x,a]  “reduce ” for each n  NT if goto(Ix ,n) = Ik then GOTO[x,n]  k

Example (Constructing the LR(1) tables) The algorithm produces the following table

What can go wrong in LR(1) parsing? What if state s contains [   • a , b] and [   • , a] ? First item generates “shift”, second generates “reduce” Both define ACTION[s,a] — cannot do both actions This is called a shift/reduce conflict Modify the grammar to eliminate it Shifting will often resolve it correctly (remember the danglin else problem?) What if set s contains [   • , a] and [   • , a] ? Each generates “reduce”, but with a different production Both define ACTION[s,a] — cannot do both reductions This is called a reduce/reduce conflict Modify the grammar to eliminate it In either case, the grammar is not LR(1)

Algorithms for LR(0) Parsers Computing closure of set of LR(0) items: Computing goto for set of LR(0) items: Closure( I ) while ( I is still changing ) for each item [   •  ]  I for each production     P if [  •  ]  I then add [  •  ] to I Goto( I, x ) new = Ø for each [   • x  ]  I new = new  [  x •  ] return closure(new) • Basically the same algorithms we use • for LR(1) • The only difference is there is no • lookahead symbol in LR(0) items and therefore there is no lookahead propagation • LR(0) parsers generate less states • They cannot recognize all the grammars that LR(1) parsers recognize • They are more likely to get conflicts in the parse table since there is no lookahead Constructing canonical collection of LR(0) items: I0 = closure( [S’ • S] ) C= { I0 } while ( C is still changing ) for each Ii  C and for each x  ( TNT ) Inew = goto(Ii, x) if Inew  C then C = C  Inew record transition Ii Inew on x

SLR (Simple LR) Table Construction • SLR(1) algorithm uses canonical collection of LR(0) items • It also uses the FOLLOW sets to decide when to do a reduction for each set of items Ix  C for each item  Ix if item is [ •a] and a  T and goto(Ix,a) = Ik , then ACTION[x,a]  “shift k” else if item is [S’S •] then ACTION[x ,$]  “accept” else if item is [ •] then for each a  FOLLOW() then ACTION[x,a]  “reduce ” for each n  NT if goto(Ix ,n) = Ik then GOTO[x,n]  k

LALR (LookAhead LR) Table Construction • LR(1) parsers have many more states than SLR(1) parsers • Idea: Merge the LR(1) states • Look at the LR(0) core of the LR(1) items (meaning ignore the lookahead) • If two sets of LR(1) items have the same core, merge them and change the ACTION and GOTO tables accordingly • LALR(1) parsers can be constructed in two ways • Build the canonical collection of LR(1) items and then merge the sets with the same core ( you should be able to do this) • Ignore the items with the dot at the beginning of the rhs and construct kernels of sets of LR(0) items. Use a lookahead propagation algorithm to compute lookahead symbols. The second approach is more efficient since it avoids the construction of the large LR(1) table.

Properties of LALR(1) Parsers • An LALR(1) parser for a grammar G has same number of states as the SLR(1) parser for that grammar • If an LR(1) parser for a grammar G does not have any shift/reduce conflicts, then the LALR(1) parser for that grammar will not have any shift/reduce conflicts • It is possible for LALR(1) parser for a grammar G to have a reduce/reduce conflict while an LR(1) parser for that grammar does not have any conflicts

Error recovery • Panic-mode recovery: On discovering an error discard input symbols one at a time until one synchronizing token is found • For example delimiters such as “;” or “}” can be used as synchronizing tokens • Phrase-level recovery: On discovering an error make local corrections to the input • For example replace “,” with “;” • Error-productions: If we have a good idea about what type of errors occur we can augment the grammar with error productions and generate appropriate error messages when an error production is used • Global correction: Given an incorrect input string try to find a correct string which will require minimum changes to the input string • In general too costly

Direct Encoding of Parse Tables Rather than using a table-driven interpreter … Generate spaghetti code that implements the logic Each state becomes a small case statement or if-then-else Analogous to direct coding a scanner Advantages No table lookups and address calculations No representation for don’t care states No outer loop —it is implicit in the code for the states This produces a faster parser with more code but no table

LR(k) versus LL(k) (Top-down Recursive Descent ) Finding Reductions LR(k)  Each reduction in the parse is detectable with the complete left context, the reducible phrase, itself, and the k terminal symbols to its right LL(k)  Parser must select the reduction based on The complete left context The next k terminals Thus, LR(k) examines more context

Left Recursion versus Right Recursion Right recursion Required for termination in top-down parsers Uses (on average) more stack space Produces right-associative operators Left recursion More efficient bottom-up parsers Limits required stack space Produces left-associative operators Rule of thumb Left recursion for bottom-up parsers (Bottom-up parsers can also deal with right-recursion but left recursion is more efficient) Right recursion for top-down parsers * * * * w * * x z y w * ( x * ( y * z ) ) z y w x ( (w * x ) * y ) * z

Hierarchy of Context-Free Languages Context-free languages Deterministic languages (LR(k)) LL(k) languages Simple precedence languages LL(1) languages Operator precedence languages LR(k)  LR(1) The inclusion hierarchy for context-free languages

Hierarchy of Context-Free Grammars Context-free grammars Unambiguous CFGs Operator Precedence LR(k) Inclusion hierarchy for context-free grammars LR(1) LL(k) • Operator precedence • includes some ambiguous • grammars • LL(1) is a subset of SLR(1) LALR(1) SLR(1) LL(1) LR(0)

Summary Top-down recursive descent Advantages Fast Simplicity Good error detection Fast Deterministic langs. Automatable Left associativity Disadvantages Hand-coded High maintenance Right associativity Large working sets Poor error messages Large table sizes LR(1)

Parsing

Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing