Recursive Descent Algorithm Guide: Predictive Parser

The Recursive Descent Algorithm A useful predictive parser for many applications. Under Construction (Nov 16)

The Recursive Descent Algorithm • The recursive descent algorithm directly implements a grammar written as EBNF rules. • The rules should not contain left recursion • There is one function (method) for each EBNF rule. • Each method parses the input corresponding to its EBNF rule, and returns a value. The value may be: • a node on the abstract syntax tree of the input • value computed by evaluating the input (e.g. a calculator) • Recursive descent is a predictive parser. • Limited look-ahead ("peek" at the next token) can be incorporated.

Recursive-descent intro (0) Grammar:expr=>expr+term | expr-term | termterm=>termfactor | factorfactor=>'('expr')' | number

Recursive-descent intro (0.5) Grammar in EBNF (no "self-recursion"): expr=>term {(+ | -)term } term=>factor{factor} factor=>'('expr')' | number

Recursive-descent intro (1) • Grammar:expr=>term{ +term }term=>factor{ factor}factor=>'('expr')' | number • Generic C code for concept only (don't use this): expr() { term(); while(token=='+') { match('+'); term(); } } term() { factor(); while(token=='*') { match('*'); factor(); } }

Recursive-descent intro (2) • Grammar:expr=>term{ +term }term=>factor{ factor}factor=>'('expr')'| number • Factor and number: factor() { if (token == ‘(‘) { match('('); expr( ); match(‘)’); } else number( ); } number() { if ( isNumber(token) ) { add_to_parse_tree(); nextToken( ); } else error("invalid number"); }

Recursive-descent intro (3) • match(value) is a utility that requires a match: if current token matches the argument, consume the token and get next token. • Otherwise print an error. ... and then what? void match(char what) { if ( *token == what ) { nextToken( ); } else { /* 'printf' style error function */ error("expected %c got %s", what, token); } }

Where's the token? • In this algorithm, token is a global variable that always contains the next unread token. • nextToken() returns true if there are more tokens, and also sets the token variable. boolean nextToken( ) { token = scanner( ); return ( token != EOF ); } • Another utility function is match(value): 1) if value matches token, get a new token 2) if value doesn't match, raise an error condition.

Where's the output? • In the generic algorithm, the result is a global variable. • The methods must either return a value or accumulate value as a side effect. • Rules which have terminal values should return the terminal value. • factor=>(expr ) | number number() { if ( isNumber(token) ) { // add token to the parse tree // or return a value } else error("invalid number"); }

Recursive Descent Example (1) Let's look at a recursive descent code for a calculator. We will modify the generic algorithm so that each function returns a double value. input: expr '\n' expr: term { (+|-) term } term: factor { (*|/) factor} factor: '(' expr ')' | number

Recursive Descent Example (1) Let's look at a recursive descent code for a calculator. We will modify the generic algorithm so that each function returns a double value. Example: here is a modified expr( ) function double expr() { double expr = term(); while( token =='+' || token =='-') ) { if (token == '+') { match('+'); expr = expr+ term(); } else { match('-'); expr= expr- term(); } } return expr; } Grammar Rule: expr: term { (+|-) term }

Recursive Descent Example (2) The rule for factor is more interesting: we must check the first token to decide which alternative to use, then double factor() { double fact; if ( token == '(' ) { nextToken( ); fact = expr( ); match( ')' ); return fact; } else { fact = number( ); return fact; } } Grammar Rule: factor: '(' expr ')' | number

Recursive Descent Example (3) Input:2 * 3 + ( 4 - 5 ) / 6 Progress: token = "2" token = nextToken(); ans = expr( ); input line

Recursive Descent Example (4) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "2" input line ans = expr( ); expr( ) { expr = term( ); while ( token=='+'|| token='-') { if ( token=='+' ) { match('+'); expr = expr + term( ); } else { match('-'); expr = expr - term( ); } return expr; expr expr = term( );

Recursive Descent Example (5) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "2" input line ans = expr( ); expr expr = term( ); term term = factor( ); term( ) { term = factor( ); while ( token=='*' || token=='/' ) { if ( token=='*' ) { match('*'); term = term * factor( ); } else { match('/'); term = term / factor( ); }

Recursive Descent Example (6) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "*" input line ans = expr( ); expr expr = term( ); term term = factor( ); factor( ) { if ( token=='(' ) { match('('); fact = expr( ); match(')'); } else { fact = number( ); } factor fact = number( ); /* token = '*' */ return fact

Recursive Descent Example (7) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "3" input line ans = expr( ); expr expr = term( ); term term = 2; term = term * factor( ); term( ) { term = factor( ); while ( token=='*' || token='/' ) { if ( token=='*' ) { match('*'); term = term * factor( ); } else { match('/'); term = term / factor( ); }

Recursive Descent Example (8) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "+" input line ans = expr( ); expr expr = term( ); term term = term * factor( ); factor( ) { if ( token=='(' ) { match('('); fact = expr( ); match(')'); } else { fact = number( ); } factor fact = number( ); /* token = '*' */ return fact

Recursive Descent Example (9) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "+" input line ans = expr( ); expr expr = term( ); term term = term * 3; return term term( ) { term = factor( ); while ( token=='*' || token=='/' ) { if ( token=='*' ) { match('*'); term = term * factor( ); } else { match('/'); term = term / factor( ); }

Recursive Descent Example (10) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "(" input line ans = expr( ); expr( ) { expr = term( ); while( token=='+'|| token=='-') { if ( token=='+' ) { match('+'); expr = expr + term( ); } else { match('-'); expr = expr - term( ); } return expr; expr expr = 6;token = '+'match('+')expr = expr + term( )

Recursive Descent Example (11) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "(" input line ans = expr( ); expr expr = term( ); term term = factor( ); term( ) { term = factor( ); while ( token=='*' || token=='/' ) { if ( token=='*' ) { match('*'); term = term * factor( ); } else { match('/'); term = term / factor( ); }

Recursive Descent Example (12) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "4" input line ans = expr( ); expr expr = term( ); term term = term * factor( ); factor( ) { if ( token=='(' ) { match('('); fact = expr( ); match(')'); } else { fact = number( ); } return fact; factor match('(') fact = expr( );

Recursive Descent Example (13) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "4" input line ans = expr( ); expr expr = term( ); expr( ) { expr = term( ); while (token=='+'|| token=='-') { if ( token=='+' ) { match('+'); expr = expr + term( ); } else { match('-'); expr = expr - term( ); } return expr; term term = term * factor( ); factor fact = expr( ); expr expr = term( );

Recursive Descent Example (14) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "-" input line ans = expr( ); expr expr = term( ); term term = term * factor( ); factor fact = expr( ); expr expr = term( ); term term = factor( ); factor fact = number( ); /* = 4, token = "-" */

Recursive Descent Example (15) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "-", then token = "5" input line ans = expr( ); expr expr = term( ); expr( ) { expr = term( ); while (token=='+'|| token=='-') { if ( token=='+' ) { match('+'); expr = expr + term( ); } else { match('-'); expr = expr - term( ); } return expr; term term = term * factor( ); factor fact = expr( ); expr match('-')expr = expr - term( );

Recursive Descent Example (16) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "5" input line ans = expr( ); expr expr = term( ); term term = term * factor( ); factor fact = expr( ); expr expr = expr - term( ); term term = factor( ); factor fact = number( ); /* = 5 . token = ")" */

Recursive Descent Example (16) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "/" input line ans = expr( ); expr expr = term( ); term( ) { term = factor( ); while ( token=='*' || token=='/' ) { if ( token = '*' ) { ... } term term = factor( ); factor fact = expr( ); match(')'); return fact; expr expr = 4 - 5; return expr term term = 5; return term; factor return 5

Recursive Descent Example (17) Input: 2 * 3 + ( 4 - 5 ) / 6 Progress: token = "6" input line ans = expr( ); expr expr = term( ); term term = -1; match('/') term = term / factor( ); term( ) { term = factor( ); while ( token=='*' || token=='/' ) { if ( token=='*' ) { match('*'); term = term * factor( ); } else { match('/'); term = term / factor( ); }

Imperative Approach to Parsing • In the generic algorithm, the token is a global variable, and the results of the parse are a side effect (a change to global variables or structures) • bison and flex operate this way, too. • Programs difficult to understand and maintain. • No error recovery in generic algorithm. /* yylex uses global variables / constants. */ int yylex( ) { ... if ( isdigit(c) ) { ungetc(c, stdin); scanf("%lf", &yylval); return INT; } }

O-O Approach to Parsing • In O-O approach, we can return an object to allow a scanner and parser without global variables. • First, let's look at the overall design. <<interface>>Iterator <<enum>> TokenType refex : Patterm IDENTIFIER OPERATOR NUMBER hasNext() next() Parser Scanner parseTree: TreeSet token: Token scanner: Iterator instream: InputStream token: Token hasNext( ) : boolean expression( ) : Node Token next( ) : Token term( ) : Node type value factor( ) : Node match( String ) : boolean

O-O Scanner • The Scanner should provide two services: test for more tokens and return the next token. • In this view, a Scanner looks like an Iterator<Token>. • A "token" has both a type and a value. /** Token class */ public class Token { Type type; /* consider an enumeration */ public Object value; /* can be anything */ public Token(Type type, Object value) {...} public Object getValue( ) { ... } }

O-O Parser • The Parser implements the parsing algorithm. • Result is either a parse tree or a value (calculator application). • Use an attribute to represent next token. /** Parser class */ public class Parser { Iterator<Token> scanner; private Token token; private TreeNode result; /* parse tree */ TreeNode expression( ) { ... }; TreeNode term( ) { ... }; TreeNode factor( ) { ... }; boolean match( String what ) { ... }; boolean match( Type what ) { ... }; }

O-O Parser for Calculator • For a calculator, the parser can compute result. • Can use a primitive data type for expression, etc. /** Parser class */ public class Parser { Iterator<Token> scanner; private Token token; private double result; double expression( ) { ... }; double term( ) { ... }; double factor( ) { ... }; boolean match( String what ) {...}; }

Observation: match • If the generic algorithm, the token is almost always tested before calling match. • Eliminate redundancy by redefining match(value) to return a boolean value if token matches. • if match, then consume the token. private boolean match( String what ) { if ( ! (token.value instanceof String) ) return false; if ( what.equals( (String)(token.value) ) ) { token = scanner.next( ); return true; } return false; }

O-O Parser for Calculator (2) • Example method: expression • EBNF: expr ::= term { (+ | -) term } private double expression( ) { double result = term( ); while( true ) { if ( match("+") ) result += term( ); else if ( match("-") ) result -= term( ); else break; /* why not error( )? */ } return result; }

O-O Parser : Top-Level • What is the top-level routine of the parser? • Look at standard bison code for inspiration: %% /* Bison grammar rules */ input : /* empty input */ | input line ; line : expr '\n' { output( $1 ); } ;

Parsing Errors • How are you going to handle parsing errors? • You might have many levels of function calls... input line result = expr( ); expr expr = term( ) { +|- term( ) }; term term = factor( ) { *|/ factor( ) }; factor factor = '(' expr() ')' | number() ...; Using recursive-decent, parse errors are usually detected at the bottom of the tree: in factor, number, etc. expr term factor Parse error found here

Parsing Errors • If you set an error flag or return an error result, then all the methods must check for this condition... input line if ( error ) print "parse error"; expr if ( error ) return /* what value? */; This error checking will make your methods longer and harder to understand. term if ( error ) return /* what value? */; factor if ( error ) return /* what value? */; expr if ( error ) return /* what value? */; term if ( error ) return /* what value? */; factor Parse error found here

Throwing an Exception • Your code will be simpler if the methods simply throw an exception, and let the top-most method catch it. input line try { result = expr( ); } catch (ParseException e) {/*error*/} expr expr( ) throws ParseException { ... } term term( ) throws ParseException { ... } Let someone else handle it! factor factor( ) throws ParseException { ... } expr expr( ) throws ParseException { ... } term term( ) throws ParseException { ... } factor throw new ParseException( )

Using Java's ParseException • Java has a ParseException class you can use:java.text.ParseException • the constructor requires two parameters: new ParseException("error message", offset); • Example: number( ) { /* parse a number */ whitespace(); token = tokenizer.next(); if ( token.type != TokenType.NUMBER ) throw new ParseException( "invalid number", cptr);

Defining your own ParseException • You can define a new Exception type for your own use import java.io.IOException; class ParseException extends IOException { /* constructors */ ParseException() { super("Parse Error"); } ParseException(String msg) { super(msg); } ParseException(String msg, int column) { super(msg + " in column " + column); } }

Using ParseException • You should try to return useful error messages, such as... factor( ) { if ( match('(') ) { result = expr( ); if ( ! match(')') ) throw new ParseException("missing right parenthesis"); } • The getMessage( ) method returns the error message... try { result = expr( ); } catch(ParseException e) { println( e.getMessage() ); } • Including the column number in error messages can be helpful.

Parsing Unary Minus Sign • Parsing negative numbers and unary minus can also be tricky. The following are valid expressions in most languages: sum = sum + -1; sum = sum - -2; sum = sum * -x; • The GNU C compiler (gcc) allows a space after the unary "-" : sum = sum - - 2; • Exponentiation has higher precedence than unary minus, so it should be incorporated in a rule at the bottom of your grammar rules: -2 ^ 3 means - (2^3)

What's Next? Later we will add to the implementation... • symbol table and assignments x = 3.5E7 a = 5 b = 0.1 y = ( a*x + b ) / ( a*x - b ) • built-in functions y = sqrt( x ) • user defined functions function f(x) = a*x + b f(0.5)

Recursive Descent Algorithm Guide: Predictive Parser