490 likes | 501 Views
Programming Languages (CS 550) Lecture 1 Summary Grammars and Parsing. Jeremy R. Johnson. Theme.
E N D
Programming Languages (CS 550)Lecture 1 SummaryGrammars and Parsing Jeremy R. Johnson
Theme • Context free grammars provide a nice formalism for describing syntax of programming languages. Moreover, there is a mechanism for automatically constructing a parser (a recognizer of valid strings in the grammar) from context free grammars (typically a few additional restrictions are enforced to make it easier to construct the parser and the parser more efficient). In this lecture we review grammars as a means of describing syntax and show how, either by hand or using automated tools such as bison, to construct a parser from the grammar.
Outline • Motivating Example • Regular Expressions and Scanning • Context Free Grammars • Derivations and Parse Trees • Ambiguous Grammars • Parsing • Recursive Decent Parsing • Shift Reduce Parsing • Parser Generators • Syntax Directed Translation and Attribute Grammars
Motivating Example • Write a function, L = ReadList(), that reads an arbitrary order list and constructs a recursive data structure L to represent it • (a1,…,an), ai an integer or recursively a list • Assume the input is a stream of tokens - e.g. ‘(‘, integer, ‘,’, ‘)’ and the variable Token contains the current token • Assume the functions • GetToken() – advance to the next token • Match(token) – if token = Token then GetToken() else error • M = Comp(e,L) – construct list M by inserting element e in the front of L. E.g. Comp(1,(2,3)) = (1,2,3) • M = Reverse(L) – M = the reverse of the list L.
List Grammar • < list > → ( < sequence > ) | ( ) • < sequence > → < listelement > , < sequence > | < listelement > • < listelement > → < list > | NUMBER
Derivation and Parse Tree <list> → ( < sequence > ) → ( < listelement > , < sequence > ) → ( NUMBER, < sequence > ) = (1, < sequence > ) → (1, < listelement > , < sequence >) → (1, NUMBER, < sequence >) = (1, 2,< sequence > ) → (1, 2, < listelement>) → (1, 2, NUMBER) = (1,2,3)
Derivation and Parse Tree <list> <sequence> ) ( <sequence> <listelement> , 1 <sequence> <listelement> , <listelement> 2 3
Parsing and Scanning • Recognizing valid programming language syntax is split into two stages • scanning - group input character stream into tokens • parsing – group tokens into programming language structures • Tokens are described by regular expressions • Programming language structures by context free grammars • Separating into parsing and scanning simplifies both the description and recognition and makes maintenance easier
Regular Expressions • Alphabet = • A language over is subset of strings in • Regular expressions describe certain types of languages • is a regular expression • = {} is a regular expression • For each a in , a denoting {a} is a regular expression • If r and s are regular expressions denoting languages R and S respectively then (r + s), (rs), and (r*) are regular expressions • E.G. 00, (0+1)*, (0+1)*00(0+1)*, 00*11*22*, (1+10)*
Grammar • Non-terminal symbols • Terminal symbols • Start symbol • Productions (rules) • Context-Free Grammars (rule can not depend on context) • Regular grammar
Example • <if_stmt> if <logic_expr> then <stmt> | if <logic_expr> then <stmt> else <stmt> • <ident_list> identifier | identifier, <ident_list> • <program> begin <stmt_list> end • <stmt_list> <stmt> | <stmt> ; <stmt_list> • <stmt> <var> = <expression> • <var> A | B | C • <expression> <var> + <var> | <var> - <var> | <var>
Expression Grammars • <assign> <id> = <expr> • <id> A | B | C • <expr> <id> + <expr> | <id> * <expr> | ( <expr> ) | <id> • <expr> <expr> + <expr> | <expr> * <expr> | ( <expr> ) | <id>
Exercise 1 • Show a derivation and corresponding parse tree, using the first expression grammar, for the string • A = B*(A+C) • Show that the second expression grammar is ambiguous by showing two distinct parse trees for the string • A = B+C*A
Parse Tree <assign> <expr> = <id> A <id> * <expr> B ) ( <expr> <id> + <expr> A <id> A = B * (A + C) C
Ambiguous Grammar A = B + C * A <assign> <assign> <expr> = <id> <expr> = <id> A <expr> + <expr> A <expr> * <expr> <expr> <expr> <id> * <expr> <expr> + <id> <id> <id> B <id> <id> A C A B C
Unambiguous Expression Grammar • <expr> <expr> + <term> | <term> • <term> <term> * <factor> | <factor> • <factor> ( <expr> ) | <id>
Exercise 2 • Show the derivation and parse tree using the unambiguous expression grammar for • A = B+C*A • Convince yourself that this grammar is unambiguous (ideally give a proof)
Recursive Descent Parser list() { match(‘(‘); if token ‘)’ then seq(); endif; match(‘)’); }
Recursive Descent Parser seq() { elt(); if token = ‘,’ then match(‘,’); seq(); endif }
Recursive Descent Parser elt() { if token = ‘(‘ then list(); else match(NUMBER); endif; }
Parser and Scanner Generators • Tools exist (e.g. yacc/bison1 for C/C++, PLY for python, CUP for Java) to automatically construct a parser from a restricted set of context free grammars (LALR(1) grammars for yacc/bison and the derivatives CUP and PLY) • These tools use table driven bottom up parsing techniques (commonly shift/reduce parsing) • Similar tools (e.g. lex/flex for C/C++, Jflex for Java) exist, based on the theory of finite automata, to automatically construct scanners from regular expressions 1bison in the GNU version of yacc
Yacc (bison) Example %token NUMBER /* needed to communicate with scanner */ %% list: '(' sequence ')' { printf("L -> ( seq )\n"); } | '(' ')' { printf("L -> () \n "); } sequence: listelement ',' sequence { printf("seq -> LE,seq\n"); } | listelement { printf("seq -> LE\n"); } ; listelement: NUMBER { printf("LE -> %d\n",$1); } | list { printf("LE -> L\n"); } ; %% /* since no code here, default main constructed that simply calls parser. */
Lex (flex) Example %{ #include "list.tab.h" extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [ \t\n] ; "(" return yytext[0]; ")" return yytext[0]; "," return yytext[0]; "$" return 0; %%
Building bison/flex Parse • Tools available on tux • You can download them for free • Available as part of many linux distributions (if not installed get the appropriate package) • Can be used through cygwin under windows • Build instructions • bison -d paren.y => paren.tab.c and paren.tab.h • flex paren.l => lex.yy.c • gcc paren.tab.c lex.yy.c -ly -lfl => a.out or a.exe
Executing Parser Program expects user to enter string followed by ctrl D indicating end of file, or to redirect input from a file. E.G. with valid input $ ./a.exe (1,2,3) LE -> 1 LE -> 2 LE -> 3 seq -> LE seq -> LE,seq seq -> LE,seq L -> ( seq ) E.G. input with syntax error $ ./a.exe (1,2,3( LE -> 1 LE -> 2 LE -> 3 seq -> LE seq -> LE,seq seq -> LE,seq syntax error
Recursive Descent Reader List list() { match(‘(‘); if token = ‘)’ then L = seq(); endif; match(‘)’); L = NULL; return L; }
Recursive Descent Reader List seq() { x = elt(); if token = ‘,’ then match(‘,’); M = seq(); L = Comp(x,M); else L = Comp(x,NULL) endif return L }
Recursive Descent Reader Element elt() { if token = ‘(‘ then x = list(); else match(NUMBER); x = NUMBER.val; endif; return x; }
Attribute Grammars • Associate attributes with symbols • Associate attribute computation rules with productions • Fill in values as input parsed (decorate parse tree) • Synthesized vs. inherited attributes
Example Attribute Grammar • < list > → ( < sequence > ) | ( ) • list.val = NULL • list.val = sequence.val • < sequence > → < listelement > , < sequence > | < listelement > • seq0.val = Comp(listelement.val,seq1.val) • seq0.val = Comp(listelement.val,NULL) • < listelement > → < list > | NUMBER • listelement.val = list.val • listelement.val = NUMBER.val
Decorated Parse Tree Val = (1,2,3) <list> Val = (1,2,3) <sequence> ) ( Val = 1 Val = (2,3) <sequence> <listelement> , Val = 2 Val = 1 <sequence> <listelement> , Val = (3) <listelement> Val = 3 Val = 2 Val = 3
Yacc Example with Attributes /* This grammar is ambiguous and will cause shift/reduce conflits */ %token NUMBER %% statement_list: statement '\n' | statement_list statement '\n' ; statement: expression { printf("= %d\n", $1); }; expression: expression '+' expression { $$ = $1 + $3; } | expression '-' expression { $$ = $1 - $3; } | expression '*' expression { $$ = $1 * $3; } | expression '/' expression { if ($3 == 0) yyerror("division by zero"); else $$ = $1 / $3; } | '(' expression ')'{ $$ = $2; } | NUMBER { $$ = $1; } ; %%
Shift Reduce Parsing • Bottom up parsing • LR(1), LALR(1) • Conflicts & ambiguities • |1+2*3 • 1|+2*3 [shift] • <exp>|+2*3 [reduce] • <exp>+|2*3 [shift] • <exp>+2|*3 [shift] • <exp>+<exp>|*3 [reduce] • <exp>+<exp>|*3 [shift/reduce conflict] • <exp>+<exp>*|3 [shift] • <exp>+<exp>*3| [shift] • <exp>+<exp>*<exp> [reduce] • <exp>+<exp>| [reduce] • <exp> [reduce & accept]
Yacc Example (precedence rules) /* precedence rules added to resolve conflicts and remove ambiguity */ %token NUMBER %left '-' '+' %left '*' '/' %nonassoc UMINUS %% statement_list: statement '\n' | statement_list statement '\n' ; statement: expression { printf("= %d\n", $1); }; expression: expression '+' expression { $$ = $1 + $3; } | expression '-' expression { $$ = $1 - $3; } | expression '*' expression { $$ = $1 * $3; } | expression '/' expression { if ($3 == 0) yyerror("division by zero"); else $$ = $1 / $3; } | '-' expression %prec UMINUS { $$ = -$2; } | '(' expression ')'{ $$ = $2; } | NUMBER { $$ = $1; } ;
Exercise 3 • Removing left recursion • Rules S → S [left recursive] cause an infinite loop for a recursive decent parser • Left recursion can be systematically removed • <fee> → <fee> • | • • <fee> → <fie> • <fie> → <fie> • | • Remove left recursion from the unambiguous expression grammar
Exercise 4 • Show that the following grammar is ambiguous. • <stmt> → <ifstmt> | <basicstmt> • <ifstmt> → IF <expr> THEN <stmt> • | → IF <expr> THEN <stmt> ELSE <stmt> • This is called the “dangling else” problem • See if.y for a yacc/bison version of this grammar <expr> and <basicstmt> are replaced by the tokens EXP and BS stmt: ifstmt { printf("stmt -> ifstmt\n"); } | BS { printf("stmt -> BS\n"); } ; ifstmt: IF EXP THEN stmt { printf("ifstmt -> IF EXP THEN stmt\n"); } | IF EXP THEN stmt ELSE stmt { printf("ifstmt -> IF EXP THEN stmt ELSE stmt\n"); }
Output from bison $ bison -d if.y if.y: conflicts: 1 shift/reduce
Exercise 5 • Can you use yacc's precedence rules to remove the ambiguity?
Solution 5 • Convention is to associate the ELSE clause with the nearest if statement. • Force ELSE to have higher precedence than THEN • This removes the shift/reduce conflict and forces yacc to shift on the previous example %token IF THEN ELSE EXP BS %nonassoc THEN %nonassoc ELSE
Exercise 6 • Can you come up with an unambigous grammar for if statements that always associates the else with the closest if?
Solution 6 • Separate if statements into matched (with ELSE clause and recursively matched stmts) and unmatched • This forces the matched if statement to the end stmt: matched { printf("stmt -> matched \n "); } | unmatched { printf("stmt -> unmatched \n "); } ; matched: BS { printf("matched -> BS \n"); } | IF EXP THEN matched ELSE matched { printf("matched -> IF EXP THEN matched ELSE matched \n"); } ; unmatched: IF EXP THEN stmt { printf("unmatched -> IF EXP THEN stmt \n"); } | IF EXP THEN matched ELSE unmatched { printf("unmatched -> IF EXP THEN matched ELSE unmatched \n"); } ;
Exercise 6 • Can you change the syntax for if statements to remove the ambiguity. Hint - try to use syntax to denote the begin and end of the statements in the if statement?
Solution 6 • This is the best solution since the matching IF statement and ELSE clause is visually clear. You do not have to remember unnatural precedence rules. • Such a language choice helps prevent logic bugs stmt: ifstmt { printf("stmt -> ifstmt\n"); } | BS { printf("stmt -> BS\n"); } ; ifstmt: IF EXP THEN '{' stmt '}' { printf("ifstmt -> IF EXP THEN { stmt} \n"); } | IF EXP THEN '{' stmt '}' ELSE '{' stmt '}' { printf("ifstmt -> IF EXP THEN { stmt } ELSE { stmt }\n"); }