Parsing

CIS 461Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and Linda TorczonLecture-Module #9 Parsing

Announcements • Professor Christian Trefftz will substitute on Friday, October 5. • Read through Chapter 5. • Writing Assignment #5 is due Monday, October 8. • Programming Assignment #2 will be posted on the class webpage; it will be due in approximately two weeks, tentatively October 17. • For this assignment you should work in teams of two. Please find a teammate. • You’ll use the JJTree utility associated with JavaCC; start reading its documentation. • I’ll explain it more starting Monday, October 8. • Start the project early; please don’t leave it to the last weekend! (;-) • Midterm will be tentatively next Wednesday, October 10 • in class • closed books, closed notes

Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree from the start symbol and grow toward leaves (similar to a derivation) Pick a production and try to match the input Bad “pick”  may need to backtrack Some grammars are backtrack-free (predictive parsing) Bottom-up parsers (LR(1), operator precedence) Start at the leaves and grow toward root We can think of the process as reducing the input string to the start symbol At each reduction step a particular substring matching the right-side of a production is replaced by the symbol on the left-side of the production Bottom-up parsers handle a large class of grammars

Eliminating Immediate Left Recursion To remove left recursion, we can transform the grammar Consider a grammar fragment of the form A A  |  where  or  are strings of terminal and nonterminal symbols and neither  nor  start with A We can rewrite this as A   R R   R | where R is a new non-terminal This accepts the same language, but uses only right recursion A A  A   A  R  R  R 

Predictive Parsing Basic idea Given A    , the parser should be able to choose between & FIRST sets For a string of grammar symbols , define FIRST() as the set of tokens that appear as the first symbol in some string that derives from  That is, x  FIRST() iff* x , for some  The LL(1) Property If A   and A  both appear in the grammar, we would like FIRST()  FIRST() =  This would allow the parser to make a correct choice with a lookahead of exactly one symbol ! (Pursuing this idea leads to LL(1) parser generators...)

Recursive Descent Parsing Recursive-descent parsing • A top-down parsing method • The term descent refers to the direction in which the parse tree is traversed (or built). • Use a set of mutually recursive procedures (one procedure for each nonterminal symbol) • Start the parsing process by calling the procedure that corresponds to the start symbol • Each production becomes one clause in procedure • We consider a special type of recursive-descent parsing called predictive parsing • Use a lookahead symbol to decide which production to use

Recursive Descent Parsing: Expression Grammar int PLUS=1, MINUS=2, ... int lookahead; void match(int token) { if (lookahead==token) lookahead=getNextToken(); else error(); } void main() { lookahead=getNextToken(); S(); match(EOF); } void S() { Expr(); } void Expr() { Term(); ExprPrime(); } void ExprPrime() { switch(lookahead) { case PLUS : match(PLUS); Term(); ExprPrime(); break; case MINUS : match(MINUS); Term(); ExprPrime(); break; default: return; } } void Term() { Factor(); TermPrime(); } void TermPrime() { switch(lookahead) { case TIMES: match(TIMES); Factor(); TermPrime(); break; case DIV: match(DIV); Factor(); TermPrime(); break; default: return; } } void Factor() { switch(lookahead) { case ID : match(ID); break; case NUMBER: match(NUMBER); break; default: error(); }

Recursive Descent Parsing: Another Grammar 1 S if E then S else S 2 | begin S L 3 | print E 4 L end 5 | ; S L 6 E num = num void main() { lookahead=getNextToken(); S(); match(EOF); } void S() { switch(lookahead) { case IF: match(IF); E(); match(THEN); S(); match(ELSE); S(); break; case BEGIN: matvh(BEGIN); S(); L(); break; case PRINT: match(PRINT); E(); break; default: error(); } } void E() { match(NUM); match(EQ); match(NUM); } void L() { switch(lookahead) { case END: match(END); break; case SEMI: match(SEMI); S(); L(); break; default: error(); } }

Example Execution For Input: if 2=2 then print 5=5 else print 1=1 main: call S(); S1: find the production for (S, IF) : Sif E then S else S S1: match(IF); S1: call E(); E1: find the production for (E, NUM): Enum = num E1: match(NUM); match(EQ); match(NUM); E1: return from E1 to S1 S1: match(THEN); S1:call S(); S2: find the production for (S, PRINT):Sprint E S2: match(PRINT); S2: call E(); E2: find the production for (E, NUM): Enum = num E2: match(NUM); match(EQ); match(NUM); E2: return from E2 to S2 S2: return from S2 to S1 S1: match(ELSE); S1: call S(); S3: find the production for (S, PRINT): Sprint E S3: match(PRINT); S3: call E(); E3: find the production for (E, NUM): Enum = num E3: match(NUM); match(EQ); match(NUM); E3: return from E2 to S3 S3: return from S3 to S1 S1: return from S1 to main main: match(EOF); return success;

Another Approach: Stack-Based Table-Driven Parsing The parsing table • A two dimensional array M[A, a]  gives a production • A: a nonterminal symbol • a: a terminal symbol • What does it mean? • If top of the stack is A and the lookahead symbol is a then we apply the production M[A, a] IF BEGIN PRINT END SEMI NUM S Sif E then S else SSbegin S LSprint E LL endL ; S L EEnum = num

Table-driven Parsers A table-driven parser looks like Parsing tables canbe built automatically! Stack source code Scanner Table-driven Parser IR Parsing Table grammar Parser Generator

Table-Driven Predictive Parsing Algorithm • Push the end-of-file symbol ($) and the start symbol onto the stack • Consider the symbol X on the top of the stack and lookahead symbol a • If X = a = $ announce successful parse and halt • If X = a $ pop X off the stack and advance the input pointer to the next input symbol • If X is a nonterminal, look at the production M[X, a] • If there is no such production (M[X, a] = error), then call an error routine • If M[X, a] is a production X Y1 Y2 ... Yk ,then pop X and push Yk , Yk-1 , ..., Y1 onto the stack with Y1 on top • If none of the cases above apply, then call an error routine

Table-Driven Predictive Parsing Algorithm Push($); // $ is the end-of-file symbol Push(S); // S is the start symbol of the grammar lookahead = get_next_token(); repeat X = top_of_stack(); if (X is a terminal or X == $) then if (X == lookahead) then pop(X); lookahead = get_next_token(); else error(); else // X is a non-terminal if ( M[X, lookahead] == X Y1 Y2 ... Yk) then pop(X); push(Yk); push(Yk-1); ... push(Y1); else error(); until (X = $)

Recursive Descent Parser On: if 2=2 then print 5=5 else print 1=1 main: call S(); S1: find the production for (S, IF) : Sif E then S else S S1: match(IF); S1: call E(); E1: find the production for (E, NUM): Enum = num E1: match(NUM); match(EQ); match(NUM); E1: return from E1 to S1 S1: match(THEN); S1:call S(); S2: find the production for (S, PRINT):Sprint E S2: match(PRINT); S2: call E(); E2: find the production for (E, NUM): Enum = num E2: match(NUM); match(EQ); match(NUM); E2: return from E2 to S2 S2: return from S2 to S1 S1: match(ELSE); S1: call S(); S3: find the production for (S, PRINT): Sprint E S3: match(PRINT); S3: call E(); E3: find the production for (E, NUM): Enum = num E3: match(NUM); match(EQ); match(NUM); E3: return from E2 to S3 S3: return from S3 to S1 S1: return from S1 to main main: match(EOF); return success;

Table Driven Parser On: if 2=2 then print 5=5 else print 1=1$ Stacklookahead Parse-table lookup $S IF M[S,IF]:Sif E then S else S $S,ELSE,S,THEN,E,IF IF $S,ELSE,S,THEN,E NUM M[E,NUM]: Enum = num $S,ELSE,S,THEN,NUM,EQ,NUM NUM $S,ELSE,S,THEN,NUM,EQ EQ $S,ELSE,S,THEN,NUM NUM $S,ELSE,S,THEN THEN $S,ELSE,S PRINT M[S,PRINT]: Sprint E $S,ELSE,E,PRINT PRINT $S,ELSE,E NUM M[E,NUM]: Enum = num $S,ELSE,NUM,EQ,NUM NUM $S,ELSE,NUM,EQ EQ $S,ELSE,NUM NUM $S,ELSE ELSE $S PRINT M[S,PRINT]: Sprint E $E,PRINT PRINT $E NUM M[E,NUM]: Enum = num $NUM,EQ,NUM NUM $NUM,EQ EQ $NUM NUM $ $ report success!

How to Build Parse Tables? FIRST Sets For a string of grammar symbols  define FIRST() as The set of tokens that appear as the first symbol in some string that derives from  If  *, then  is in FIRST() To construct FIRST(X) for a grammar symbol X, apply the following rules until no more symbols can be added to FIRST(X) If X is a terminal FIRST(X) is {X} If X  is a production then  is in FIRST(X) If X is a nonterminal and X Y1Y2 ... Yk is a production then put every symbol in FIRST(Y1) other than  to FIRST(X) If X is a nonterminal and X Y1Y2 ... Yk is a production, then put terminal a in FIRST(X) if a is in FIRST(Yi) and  is in FIRST(Yj) for all 1  j  i If X is a nonterminal and X Y1Y2 ... Yk is a production, then put  in FIRST(X) if  is in FIRST(Yi) for all 1  i  k

Computing FIRST Sets for Strings of Symbols To construct the FIRST set for any string of grammar symbols X1X2 ... Xn(given the FIRST sets for symbols X1 , X2 , ... Xn) apply the following rules. FIRST(X1X2 ... Xn) contains: • Any symbol in FIRST(X1) other than  • Any symbol in FIRST(Xi) other than , if  is in FIRST(Xj) for all 1  j  i •  , if  is in FIRST(Xj) for all 1  i  n

FIRST Sets 1S Expr 2ExprTerm Expr 3Expr+ Term Expr 4|- Term Expr 5| 6TermFactor Term 7Term* Factor Term 8|/ Factor Term 9| 10Factor num 11|id Symbol FIRST S {num, id} Expr {num, id} Expr { , +, - } Term {num, id} Term { , *, / } Factor {num, id} num {num} id {id} + {+} - {-} * {*} / {/}

How to build Parse Tables? FOLLOW Sets For a non-terminal symbol A, define FOLLOW(A) as: The set of terminal symbols that can appear immediately to the right of A in some sentential form To construct FOLLOW(A) for a non-terminal symbol A apply the following rules until no more symbols can be added to FOLLOW(A) • Place $ in FOLLOW(S) ($ is the end-of-file symbol, S is the start symbol) • If there is a production A   B , then everything in FIRST() except  is placed in FOLLOW(B) • If there is a production A   B, then everything in FOLLOW(A) is placed in FOLLOW(B) • If there is a production A   B , and  is in FIRST() then everything in FOLLOW(A) is placed in FOLLOW(B)

FOLLOW Sets 1S Expr 2ExprTerm Expr 3Expr+ Term Expr 4|- Term Expr 5| 6TermFactor Term 7Term* Factor Term 8|/ Factor Term 9| 10Factor num 11|id Symbol FOLLOW S { $ } Expr { $ } Expr { $ } Term { $, +, - } Term {$, +, - } Factor { $, +, -, *, / }

LL(1) Parse Table Construction • For all productions A , perform the following steps: • For each terminal symbol a in FIRST(), add A  to M[A, a] • If  is in FIRST(), then add A   to M[A, b] for each terminal symbol b in FOLLOW(A) and add A   to M[A, $] if $ is in FOLLOW(A) • Set all the undefined entries in M to error

Grammar: 1S Expr 2ExprTerm Expr 3Expr+ Term Expr 4|- Term Expr 5| 6TermFactor Term 7Term* Factor Term 8|/ Factor Term 9| 10Factor num 11|id LL(1) Parse table: id num + - * / $ SSE SE E ET EET E E’E + T E E - T E E  T TF T TF T T’ T’  T’  T * F T T / F T T’  F F  id F  num

LL(1) grammars Left-to-right scan of the input, Leftmost derivation, 1-token lookahead Two alternative definitions of LL(1) grammars: • A grammar G is LL(1) if there are no multiple entries in its LL(1) parse table • A grammar G is LL(1) if for each set of its productions A 1 | 2 | ... | n • FIRST(1), FIRST(2), ..., FIRST(n), are all pairwise disjoint • If i *  ,then FIRST (j)  FOLLOW (A) =  for all 1  i  n, i  j

Parsing

Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing