Syntax Description and Parsing in Programming Languages

Programming Languages (CS 550)Lecture 1 SummaryGrammars and Parsing Jeremy R. Johnson

Theme • Context free grammars provide a nice formalism for describing syntax of programming languages. Moreover, there is a mechanism for automatically constructing a parser (a recognizer of valid strings in the grammar) from context free grammars (typically a few additional restrictions are enforced to make it easier to construct the parser and the parser more efficient). In this lecture we review grammars as a means of describing syntax and show how, either by hand or using automated tools such as bison, to construct a parser from the grammar.

Outline • Motivating Example • Regular Expressions and Scanning • Context Free Grammars • Derivations and Parse Trees • Ambiguous Grammars • Parsing • Recursive Decent Parsing • Shift Reduce Parsing • Parser Generators • Syntax Directed Translation and Attribute Grammars

Motivating Example • Write a function, L = ReadList(), that reads an arbitrary order list and constructs a recursive data structure L to represent it • (a1,…,an), ai an integer or recursively a list • Assume the input is a stream of tokens - e.g. ‘(‘, integer, ‘,’, ‘)’ and the variable Token contains the current token • Assume the functions • GetToken() – advance to the next token • Match(token) – if token = Token then GetToken() else error • M = Comp(e,L) – construct list M by inserting element e in the front of L. E.g. Comp(1,(2,3)) = (1,2,3) • M = Reverse(L) – M = the reverse of the list L.

List Grammar • < list > → ( < sequence > ) | ( ) • < sequence > → < listelement > , < sequence > | < listelement > • < listelement > → < list > | NUMBER

Derivation and Parse Tree <list> → ( < sequence > ) → ( < listelement > , < sequence > ) → ( NUMBER, < sequence > ) = (1, < sequence > ) → (1, < listelement > , < sequence >) → (1, NUMBER, < sequence >) = (1, 2,< sequence > ) → (1, 2, < listelement>) → (1, 2, NUMBER) = (1,2,3)

Derivation and Parse Tree <list> <sequence> ) ( <sequence> <listelement> , 1 <sequence> <listelement> , <listelement> 2 3

Parsing and Scanning • Recognizing valid programming language syntax is split into two stages • scanning - group input character stream into tokens • parsing – group tokens into programming language structures • Tokens are described by regular expressions • Programming language structures by context free grammars • Separating into parsing and scanning simplifies both the description and recognition and makes maintenance easier

Regular Expressions • Alphabet =  • A language over  is subset of strings in  • Regular expressions describe certain types of languages •  is a regular expression •  = {} is a regular expression • For each a in , a denoting {a} is a regular expression • If r and s are regular expressions denoting languages R and S respectively then (r + s), (rs), and (r*) are regular expressions • E.G. 00, (0+1)*, (0+1)*00(0+1)*, 00*11*22*, (1+10)*

Grammar • Non-terminal symbols • Terminal symbols • Start symbol • Productions (rules) • Context-Free Grammars (rule can not depend on context) • Regular grammar

Exercise 1 • Show a derivation and corresponding parse tree, using the first expression grammar, for the string • A = B*(A+C) • Show that the second expression grammar is ambiguous by showing two distinct parse trees for the string • A = B+C*A

Parse Tree <assign> <expr> = <id> A <id> * <expr> B ) ( <expr> <id> + <expr> A <id> A = B * (A + C) C

Ambiguous Grammar A = B + C * A <assign> <assign> <expr> = <id> <expr> = <id> A <expr> + <expr> A <expr> * <expr> <expr> <expr> <id> * <expr> <expr> + <id> <id> <id> B <id> <id> A C A B C

Unambiguous Expression Grammar • <expr>  <expr> + <term> | <term> • <term>  <term> * <factor> | <factor> • <factor>  ( <expr> ) | <id>

Exercise 2 • Show the derivation and parse tree using the unambiguous expression grammar for • A = B+C*A • Convince yourself that this grammar is unambiguous (ideally give a proof)

Recursive Descent Parser list() { match(‘(‘); if token  ‘)’ then seq(); endif; match(‘)’); }

Recursive Descent Parser seq() { elt(); if token = ‘,’ then match(‘,’); seq(); endif }

Recursive Descent Parser elt() { if token = ‘(‘ then list(); else match(NUMBER); endif; }

Parser and Scanner Generators • Tools exist (e.g. yacc/bison1 for C/C++, PLY for python, CUP for Java) to automatically construct a parser from a restricted set of context free grammars (LALR(1) grammars for yacc/bison and the derivatives CUP and PLY) • These tools use table driven bottom up parsing techniques (commonly shift/reduce parsing) • Similar tools (e.g. lex/flex for C/C++, Jflex for Java) exist, based on the theory of finite automata, to automatically construct scanners from regular expressions 1bison in the GNU version of yacc

Yacc (bison) Example %token NUMBER /* needed to communicate with scanner */ %% list: '(' sequence ')' { printf("L -> ( seq )\n"); } | '(' ')' { printf("L -> () \n "); } sequence: listelement ',' sequence { printf("seq -> LE,seq\n"); } | listelement { printf("seq -> LE\n"); } ; listelement: NUMBER { printf("LE -> %d\n",$1); } | list { printf("LE -> L\n"); } ; %% /* since no code here, default main constructed that simply calls parser. */

Lex (flex) Example %{ #include "list.tab.h" extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [ \t\n] ; "(" return yytext[0]; ")" return yytext[0]; "," return yytext[0]; "$" return 0; %%

Building bison/flex Parse • Tools available on tux • You can download them for free • Available as part of many linux distributions (if not installed get the appropriate package) • Can be used through cygwin under windows • Build instructions • bison -d paren.y => paren.tab.c and paren.tab.h • flex paren.l => lex.yy.c • gcc paren.tab.c lex.yy.c -ly -lfl => a.out or a.exe

Executing Parser Program expects user to enter string followed by ctrl D indicating end of file, or to redirect input from a file. E.G. with valid input $ ./a.exe (1,2,3) LE -> 1 LE -> 2 LE -> 3 seq -> LE seq -> LE,seq seq -> LE,seq L -> ( seq ) E.G. input with syntax error $ ./a.exe (1,2,3( LE -> 1 LE -> 2 LE -> 3 seq -> LE seq -> LE,seq seq -> LE,seq syntax error

Recursive Descent Reader List list() { match(‘(‘); if token = ‘)’ then L = seq(); endif; match(‘)’); L = NULL; return L; }

Recursive Descent Reader List seq() { x = elt(); if token = ‘,’ then match(‘,’); M = seq(); L = Comp(x,M); else L = Comp(x,NULL) endif return L }

Recursive Descent Reader Element elt() { if token = ‘(‘ then x = list(); else match(NUMBER); x = NUMBER.val; endif; return x; }

Attribute Grammars • Associate attributes with symbols • Associate attribute computation rules with productions • Fill in values as input parsed (decorate parse tree) • Synthesized vs. inherited attributes

Example Attribute Grammar • < list > → ( < sequence > ) | ( ) • list.val = NULL • list.val = sequence.val • < sequence > → < listelement > , < sequence > | < listelement > • seq0.val = Comp(listelement.val,seq1.val) • seq0.val = Comp(listelement.val,NULL) • < listelement > → < list > | NUMBER • listelement.val = list.val • listelement.val = NUMBER.val

Decorated Parse Tree Val = (1,2,3) <list> Val = (1,2,3) <sequence> ) ( Val = 1 Val = (2,3) <sequence> <listelement> , Val = 2 Val = 1 <sequence> <listelement> , Val = (3) <listelement> Val = 3 Val = 2 Val = 3

Yacc Example with Attributes /* This grammar is ambiguous and will cause shift/reduce conflits */ %token NUMBER %% statement_list: statement '\n' | statement_list statement '\n' ; statement: expression { printf("= %d\n", $1); }; expression: expression '+' expression { $$ = $1 + $3; } | expression '-' expression { $$ = $1 - $3; } | expression '*' expression { $$ = $1 * $3; } | expression '/' expression { if ($3 == 0) yyerror("division by zero"); else $$ = $1 / $3; } | '(' expression ')'{ $$ = $2; } | NUMBER { $$ = $1; } ; %%

Yacc Example (precedence rules) /* precedence rules added to resolve conflicts and remove ambiguity */ %token NUMBER %left '-' '+' %left '*' '/' %nonassoc UMINUS %% statement_list: statement '\n' | statement_list statement '\n' ; statement: expression { printf("= %d\n", $1); }; expression: expression '+' expression { $$ = $1 + $3; } | expression '-' expression { $$ = $1 - $3; } | expression '*' expression { $$ = $1 * $3; } | expression '/' expression { if ($3 == 0) yyerror("division by zero"); else $$ = $1 / $3; } | '-' expression %prec UMINUS { $$ = -$2; } | '(' expression ')'{ $$ = $2; } | NUMBER { $$ = $1; } ;

Exercise 3 • Removing left recursion • Rules S → S  [left recursive] cause an infinite loop for a recursive decent parser • Left recursion can be systematically removed • <fee> → <fee>  • |  •  • <fee> →  <fie> • <fie> →  <fie> • |  • Remove left recursion from the unambiguous expression grammar

Exercise 4 • Show that the following grammar is ambiguous. • <stmt> → <ifstmt> | <basicstmt> • <ifstmt> → IF <expr> THEN <stmt> • | → IF <expr> THEN <stmt> ELSE <stmt> • This is called the “dangling else” problem • See if.y for a yacc/bison version of this grammar <expr> and <basicstmt> are replaced by the tokens EXP and BS stmt: ifstmt { printf("stmt -> ifstmt\n"); } | BS { printf("stmt -> BS\n"); } ; ifstmt: IF EXP THEN stmt { printf("ifstmt -> IF EXP THEN stmt\n"); } | IF EXP THEN stmt ELSE stmt { printf("ifstmt -> IF EXP THEN stmt ELSE stmt\n"); }

First Parse Tree

Second Parse Tree

Shift/Reduce Conflict

Output from bison $ bison -d if.y if.y: conflicts: 1 shift/reduce

Exercise 5 • Can you use yacc's precedence rules to remove the ambiguity?

Solution 5 • Convention is to associate the ELSE clause with the nearest if statement. • Force ELSE to have higher precedence than THEN • This removes the shift/reduce conflict and forces yacc to shift on the previous example %token IF THEN ELSE EXP BS %nonassoc THEN %nonassoc ELSE

Shift/Reduce Conflict Removed

Exercise 6 • Can you come up with an unambigous grammar for if statements that always associates the else with the closest if?

Solution 6 • Separate if statements into matched (with ELSE clause and recursively matched stmts) and unmatched • This forces the matched if statement to the end stmt: matched { printf("stmt -> matched \n "); } | unmatched { printf("stmt -> unmatched \n "); } ; matched: BS { printf("matched -> BS \n"); } | IF EXP THEN matched ELSE matched { printf("matched -> IF EXP THEN matched ELSE matched \n"); } ; unmatched: IF EXP THEN stmt { printf("unmatched -> IF EXP THEN stmt \n"); } | IF EXP THEN matched ELSE unmatched { printf("unmatched -> IF EXP THEN matched ELSE unmatched \n"); } ;

Unambiguous Parse Tree

No Shift/Reduce Conflict

Exercise 6 • Can you change the syntax for if statements to remove the ambiguity. Hint - try to use syntax to denote the begin and end of the statements in the if statement?

Solution 6 • This is the best solution since the matching IF statement and ELSE clause is visually clear. You do not have to remember unnatural precedence rules. • Such a language choice helps prevent logic bugs stmt: ifstmt { printf("stmt -> ifstmt\n"); } | BS { printf("stmt -> BS\n"); } ; ifstmt: IF EXP THEN '{' stmt '}' { printf("ifstmt -> IF EXP THEN { stmt} \n"); } | IF EXP THEN '{' stmt '}' ELSE '{' stmt '}' { printf("ifstmt -> IF EXP THEN { stmt } ELSE { stmt }\n"); }

Syntax Description and Parsing in Programming Languages

Syntax Description and Parsing in Programming Languages

Presentation Transcript

CMSC 330: Organization of Programming Languages

CS 355 – Programming Languages

CS 360 Programming Languages

CS 355 – Programming Languages

CS 355 – Programming Languages

Programming Languages

CS 363 Comparative Programming Languages

Programming Languages

CS 363 Comparative Programming Languages

Comparative Programming Languages

CS 403 - Programming Languages

CS 415: Programming Languages

CS 262: Programming Languages

CS 424/524 PROGRAMMING LANGUAGES

CS 214 Programming Languages

CS 363 Comparative Programming Languages

Programming Languages

CS 363 Comparative Programming Languages

CS 403 - Programming Languages