300 likes | 467 Views
CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #13. Parsing. Top-down Parsing. S. A. fringe of the parse tree. B. start symbol. ?. D. C. S. left-to-right scan. ?. left-most derivation.
E N D
CIS 461Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and Linda TorczonLecture-Module #13 Parsing
Top-down Parsing S A fringe of the parse tree B start symbol ? D C S left-to-right scan ? left-most derivation lookahead Bottom-up Parsing S input string lookahead upper fringe of the parse tree ? A D C right-most derivation in reverse lookahead
LR Parsers LR(k) parsers are table-driven, bottom-up, shift-reduce parsers that use a limited right context (k-token lookahead) for handle recognition LR(k): Left-to-right scan of the input, Rightmost derivation in reverse with k token lookahead A grammar is LR(k) if, given a rightmost derivation S 0 1 2 … n-1 nsentence We can 1. isolate the handle of each right-sentential form i , and 2. determine the production by which to reduce, by scanning i from left-to-right, going at most k symbols beyond the right end of the handle of i
LR Parsers A table-driven LR parser looks like Scanner Table-driven Parser ACTION & GOTO Tables Parser Generator Stack source code IR grammar
LR Shift-Reduce Parsers push($); // $ is the end-of-file symbol push(s0); // s0 is the start state of the DFA that recognizes handles lookahead = get_next_token(); repeat forever s = top_of_stack(); if ( ACTION[s,lookahead] == reduce ) then pop 2*|| symbols; s = top_of_stack(); push(); push(GOTO[s,]); else if ( ACTION[s,lookahead] == shift si ) then push(lookahead); push(si); lookahead = get_next_token(); else if ( ACTION[s,lookahead] == accept and lookahead == $ ) then return success; else error(); • The skeleton parser • uses ACTION & GOTO • does |words| shifts • does |derivation| • reductions • does 1 accept
LR Parsers How does this LR stuff work? Unambiguous grammar unique rightmost derivation Keep upper fringe on a stack All active handles include TOS Shift inputs until TOS is right end of a handle Language of handles is regular Build a handle-recognizing DFA ACTION & GOTO tables encode the DFA To match subterms, recurse & leave DFA’s state on stack Final states of the DFA correspond to reduce actions New state is GOTO[lhs , state at TOS]
Building LR Parsers How do we generate the ACTION and GOTO tables? Use the grammar to build a model of the handle recognizing DFA Use the DFA model to build ACTION & GOTO tables If construction succeeds, the grammar is LR How do we build the handle recognizing DFA ? Encode the set of productions that can be used as handles in the DFA state: Use LR(k) items Use two functions goto( s, ) and closure( s ) goto() is analogous to move() in the DFA to NFA conversion closure() is analogous to -closure Build up the states and transition functions of the DFA Use this information to fill in the ACTION and GOTO tables
LR(k) items An LR(k) item is a pair [A , B], where A is a production with a • at some position in the rhs B is a lookahead string of length ≤ k (terminal symbols or $) Examples: [• , a], [• , a], [• , a], & [• , a] The • in an item indicates the position of the top of the stack LR(0) items [ • ] (no lookahead symbol) LR(1) items [ • , a ] (one token lookahead) LR(2) items [ • , a b ] (two token lookahead) ...
LR(1) Items The production •, with lookahead a, generates 4 items [• , a], [• , a], [• , a], & [• , a] The set of LR(1) items for a grammar is finite What’s the point of all these lookahead symbols? Carry them along to choose correct reduction Lookaheads are bookkeeping, unless item has • at right end Has no direct use in [• , a] In [• , a], a lookahead of a implies a reduction by For { [• , a],[• , b] } lookahead = areduce to ; lookahead FIRST() shift Limited right context is enough to pick the actions
LR(1) Table Construction High-level overview • Build the handle recognizing DFA (aka Canonical Collection of sets of LR(1) items), C = { I0 , I1 , ... , In} a Introduce a new start symbol S’ which has only one production S’ S b Initial state, I0 should include • [S’ •S, $], along with any equivalent items • Derive equivalent items as closure( I0 ) c Repeatedly compute, for each Ik , and each grammar symbol , goto(Ik , ) • If the set is not already in the collection, add it • Record all the transitions created by goto( ) This eventually reaches a fixed point • Fill in the ACTION and GOTO tables using the DFA The canonical collection completely encodes the transition diagram for the handle-finding DFA
Computing Closures closure(I) adds all the items implied by items already in I • Any item [ , a] implies [ , x] for each production with on the lhs, and x FIRST(a) • Since is valid, any way to derive is valid, too The algorithm Closure( I ) while ( I is still changing ) for each item [ • , a] I for each production P for each terminal b FIRST(a) if [ • , b] I then add [ • , b] to I Fixpoint computation
Computing Gotos goto(I , x) computes the state that the parser would reach if it recognized an x while in state I • goto( { [ , a] }, ) produces [ , a] • It also includes closure( [ , a] ) to fill out the state The algorithm Goto( I, x ) new = Ø for each [ • x , a] I new = new [ x • , a] return closure(new) • Not a fixpoint method • Uses closure
Building the Canonical Collection of LR(1) Items This is where we build the handle recognizing DFA! Start from I0= closure( [S’ • S , $] ) Repeatedly construct new states, until ano new states are generated The algorithm I0 = closure( [S’ • S , $] ) C= { I0 } while ( C is still changing ) for each Ii C and for each x ( TNT ) Inew = goto(Ii, x) if Inew C then C = C Inew record transition Ii Inew on x • Fixed-point computation • Loop adds to C • C 2ITEMS, so C is finite
Algorithms for LR(1) Parsers Computing closure of set of LR(1) items: Computing goto for set of LR(1) items: Closure( I ) while ( I is still changing ) for each item [ • , a] I for each production P for each terminal b FIRST(a) if [ • , b] I then add [ • , b] to I Goto( I, x ) new = Ø for each [ • x , a] I new = new [ x • , a] return closure(new) Constructing canonical collection of LR(1) items: I0 = closure( [S’ • S , $] ) C= { I0 } while ( C is still changing ) for each Ii C and for each x ( TNT ) Inew = goto(Ii, x) if Inew C then C = C Inew record transition Ii Inew on x • Canonical collection construction • algorithm is the algorithm for • constructing handle recognizing DFA • Uses Closure to compute the states • of the DFA • Uses Goto to compute the transitions • of the DFA
Example Simplified, right recursive expression grammar S Expr Expr Term - Expr Expr Term Term Factor * Term Term Factor Factor id Symbol FIRST S { id } Expr{ id} Term{ id } Factor{ id} -{ - } *{ * } id{ id}
I1= {[S Expr •, $]} I0= {[S •Expr ,$], [Expr •Term - Expr , $], [Expr •Term , $], [Term •Factor * Term , {$,-}], [Term •Factor , {$,-}], [Factor • id ,{$,-,*}]} Expr Term id Factor I2 ={ [Expr Term • - Expr , $], [Expr Term •, $] } I4= { [Factor id •, {$,-,*}] } I3 = {[Term Factor •* Term , {$,-}], [Term Factor •, {$,-}]} id Factor * • I6={[Term Factor * • Term , {$,-}], [Term • Factor * Term , {$,-}], • [Term • Factor , {$,-}], [Factor• id , {$, -, *}]} Factor - Term id I5={[Expr Term - •Expr , $], [Expr •Term - Expr , $], [Expr •Term , $], [Term •Factor * Term , {$,-}], [Term •Factor , {$,-}], [Factor • id , {$,-,*}] } Expr I8= { [Term Factor * Term•, {$,-}] } I7= { [Expr Term - Expr •, $] }
Constructing the ACTION and GOTO Tables x is the state number Each state corresponds to a set of LR(1) items The algorithm for each set of items Ix C for each item Ix if item is [ •a,b] and a T and goto(Ix,a) = Ik , then ACTION[x,a] “shift k” else if item is [S’S •,$] then ACTION[x ,$] “accept” else if item is [ •,a] then ACTION[x,a] “reduce ” for each n NT if goto(Ix ,n) = Ik then GOTO[x,n] k
Example (Constructing the LR(1) tables) The algorithm produces the following table
What can go wrong in LR(1) parsing? What if state s contains [ • a , b] and [ • , a] ? First item generates “shift”, second generates “reduce” Both define ACTION[s,a] — cannot do both actions This is called a shift/reduce conflict Modify the grammar to eliminate it Shifting will often resolve it correctly (remember the danglin else problem?) What if set s contains [ • , a] and [ • , a] ? Each generates “reduce”, but with a different production Both define ACTION[s,a] — cannot do both reductions This is called a reduce/reduce conflict Modify the grammar to eliminate it In either case, the grammar is not LR(1)
Algorithms for LR(0) Parsers Computing closure of set of LR(0) items: Computing goto for set of LR(0) items: Closure( I ) while ( I is still changing ) for each item [ • ] I for each production P if [ • ] I then add [ • ] to I Goto( I, x ) new = Ø for each [ • x ] I new = new [ x • ] return closure(new) • Basically the same algorithms we use • for LR(1) • The only difference is there is no • lookahead symbol in LR(0) items and therefore there is no lookahead propagation • LR(0) parsers generate less states • They cannot recognize all the grammars that LR(1) parsers recognize • They are more likely to get conflicts in the parse table since there is no lookahead Constructing canonical collection of LR(0) items: I0 = closure( [S’ • S] ) C= { I0 } while ( C is still changing ) for each Ii C and for each x ( TNT ) Inew = goto(Ii, x) if Inew C then C = C Inew record transition Ii Inew on x
SLR (Simple LR) Table Construction • SLR(1) algorithm uses canonical collection of LR(0) items • It also uses the FOLLOW sets to decide when to do a reduction for each set of items Ix C for each item Ix if item is [ •a] and a T and goto(Ix,a) = Ik , then ACTION[x,a] “shift k” else if item is [S’S •] then ACTION[x ,$] “accept” else if item is [ •] then for each a FOLLOW() then ACTION[x,a] “reduce ” for each n NT if goto(Ix ,n) = Ik then GOTO[x,n] k
LALR (LookAhead LR) Table Construction • LR(1) parsers have many more states than SLR(1) parsers • Idea: Merge the LR(1) states • Look at the LR(0) core of the LR(1) items (meaning ignore the lookahead) • If two sets of LR(1) items have the same core, merge them and change the ACTION and GOTO tables accordingly • LALR(1) parsers can be constructed in two ways • Build the canonical collection of LR(1) items and then merge the sets with the same core ( you should be able to do this) • Ignore the items with the dot at the beginning of the rhs and construct kernels of sets of LR(0) items. Use a lookahead propagation algorithm to compute lookahead symbols. The second approach is more efficient since it avoids the construction of the large LR(1) table.
Properties of LALR(1) Parsers • An LALR(1) parser for a grammar G has same number of states as the SLR(1) parser for that grammar • If an LR(1) parser for a grammar G does not have any shift/reduce conflicts, then the LALR(1) parser for that grammar will not have any shift/reduce conflicts • It is possible for LALR(1) parser for a grammar G to have a reduce/reduce conflict while an LR(1) parser for that grammar does not have any conflicts
Error recovery • Panic-mode recovery: On discovering an error discard input symbols one at a time until one synchronizing token is found • For example delimiters such as “;” or “}” can be used as synchronizing tokens • Phrase-level recovery: On discovering an error make local corrections to the input • For example replace “,” with “;” • Error-productions: If we have a good idea about what type of errors occur we can augment the grammar with error productions and generate appropriate error messages when an error production is used • Global correction: Given an incorrect input string try to find a correct string which will require minimum changes to the input string • In general too costly
Direct Encoding of Parse Tables Rather than using a table-driven interpreter … Generate spaghetti code that implements the logic Each state becomes a small case statement or if-then-else Analogous to direct coding a scanner Advantages No table lookups and address calculations No representation for don’t care states No outer loop —it is implicit in the code for the states This produces a faster parser with more code but no table
LR(k) versus LL(k) (Top-down Recursive Descent ) Finding Reductions LR(k) Each reduction in the parse is detectable with the complete left context, the reducible phrase, itself, and the k terminal symbols to its right LL(k) Parser must select the reduction based on The complete left context The next k terminals Thus, LR(k) examines more context
Left Recursion versus Right Recursion Right recursion Required for termination in top-down parsers Uses (on average) more stack space Produces right-associative operators Left recursion More efficient bottom-up parsers Limits required stack space Produces left-associative operators Rule of thumb Left recursion for bottom-up parsers (Bottom-up parsers can also deal with right-recursion but left recursion is more efficient) Right recursion for top-down parsers * * * * w * * x z y w * ( x * ( y * z ) ) z y w x ( (w * x ) * y ) * z
Hierarchy of Context-Free Languages Context-free languages Deterministic languages (LR(k)) LL(k) languages Simple precedence languages LL(1) languages Operator precedence languages LR(k) LR(1) The inclusion hierarchy for context-free languages
Hierarchy of Context-Free Grammars Context-free grammars Unambiguous CFGs Operator Precedence LR(k) Inclusion hierarchy for context-free grammars LR(1) LL(k) • Operator precedence • includes some ambiguous • grammars • LL(1) is a subset of SLR(1) LALR(1) SLR(1) LL(1) LR(0)
Summary Top-down recursive descent Advantages Fast Simplicity Good error detection Fast Deterministic langs. Automatable Left associativity Disadvantages Hand-coded High maintenance Right associativity Large working sets Poor error messages Large table sizes LR(1)