490 likes | 503 Views
This lecture explores context-free grammars for syntax description and parsing in programming languages. Learn how to construct parsers manually or using tools like Bison. Topics include regular expressions, ambiguous grammars, parsing techniques, and parser generators.
E N D
Programming Languages (CS 550)Lecture 1 SummaryGrammars and Parsing Jeremy R. Johnson
Theme • Context free grammars provide a nice formalism for describing syntax of programming languages. Moreover, there is a mechanism for automatically constructing a parser (a recognizer of valid strings in the grammar) from context free grammars (typically a few additional restrictions are enforced to make it easier to construct the parser and the parser more efficient). In this lecture we review grammars as a means of describing syntax and show how, either by hand or using automated tools such as bison, to construct a parser from the grammar.
Outline • Motivating Example • Regular Expressions and Scanning • Context Free Grammars • Derivations and Parse Trees • Ambiguous Grammars • Parsing • Recursive Decent Parsing • Shift Reduce Parsing • Parser Generators • Syntax Directed Translation and Attribute Grammars
Motivating Example • Write a function, L = ReadList(), that reads an arbitrary order list and constructs a recursive data structure L to represent it • (a1,…,an), ai an integer or recursively a list • Assume the input is a stream of tokens - e.g. ‘(‘, integer, ‘,’, ‘)’ and the variable Token contains the current token • Assume the functions • GetToken() – advance to the next token • Match(token) – if token = Token then GetToken() else error • M = Comp(e,L) – construct list M by inserting element e in the front of L. E.g. Comp(1,(2,3)) = (1,2,3) • M = Reverse(L) – M = the reverse of the list L.
List Grammar • < list > → ( < sequence > ) | ( ) • < sequence > → < listelement > , < sequence > | < listelement > • < listelement > → < list > | NUMBER
Derivation and Parse Tree <list> → ( < sequence > ) → ( < listelement > , < sequence > ) → ( NUMBER, < sequence > ) = (1, < sequence > ) → (1, < listelement > , < sequence >) → (1, NUMBER, < sequence >) = (1, 2,< sequence > ) → (1, 2, < listelement>) → (1, 2, NUMBER) = (1,2,3)
Derivation and Parse Tree <list> <sequence> ) ( <sequence> <listelement> , 1 <sequence> <listelement> , <listelement> 2 3
Parsing and Scanning • Recognizing valid programming language syntax is split into two stages • scanning - group input character stream into tokens • parsing – group tokens into programming language structures • Tokens are described by regular expressions • Programming language structures by context free grammars • Separating into parsing and scanning simplifies both the description and recognition and makes maintenance easier
Regular Expressions • Alphabet = • A language over is subset of strings in • Regular expressions describe certain types of languages • is a regular expression • = {} is a regular expression • For each a in , a denoting {a} is a regular expression • If r and s are regular expressions denoting languages R and S respectively then (r + s), (rs), and (r*) are regular expressions • E.G. 00, (0+1)*, (0+1)*00(0+1)*, 00*11*22*, (1+10)*
Grammar • Non-terminal symbols • Terminal symbols • Start symbol • Productions (rules) • Context-Free Grammars (rule can not depend on context) • Regular grammar
Example • <if_stmt> if <logic_expr> then <stmt> | if <logic_expr> then <stmt> else <stmt> • <ident_list> identifier | identifier, <ident_list> • <program> begin <stmt_list> end • <stmt_list> <stmt> | <stmt> ; <stmt_list> • <stmt> <var> = <expression> • <var> A | B | C • <expression> <var> + <var> | <var> - <var> | <var>
Expression Grammars • <assign> <id> = <expr> • <id> A | B | C • <expr> <id> + <expr> | <id> * <expr> | ( <expr> ) | <id> • <expr> <expr> + <expr> | <expr> * <expr> | ( <expr> ) | <id>
Exercise 1 • Show a derivation and corresponding parse tree, using the first expression grammar, for the string • A = B*(A+C) • Show that the second expression grammar is ambiguous by showing two distinct parse trees for the string • A = B+C*A
Parse Tree <assign> <expr> = <id> A <id> * <expr> B ) ( <expr> <id> + <expr> A <id> A = B * (A + C) C
Ambiguous Grammar A = B + C * A <assign> <assign> <expr> = <id> <expr> = <id> A <expr> + <expr> A <expr> * <expr> <expr> <expr> <id> * <expr> <expr> + <id> <id> <id> B <id> <id> A C A B C
Unambiguous Expression Grammar • <expr> <expr> + <term> | <term> • <term> <term> * <factor> | <factor> • <factor> ( <expr> ) | <id>
Exercise 2 • Show the derivation and parse tree using the unambiguous expression grammar for • A = B+C*A • Convince yourself that this grammar is unambiguous (ideally give a proof)
Recursive Descent Parser list() { match(‘(‘); if token ‘)’ then seq(); endif; match(‘)’); }
Recursive Descent Parser seq() { elt(); if token = ‘,’ then match(‘,’); seq(); endif }
Recursive Descent Parser elt() { if token = ‘(‘ then list(); else match(NUMBER); endif; }
Parser and Scanner Generators • Tools exist (e.g. yacc/bison1 for C/C++, PLY for python, CUP for Java) to automatically construct a parser from a restricted set of context free grammars (LALR(1) grammars for yacc/bison and the derivatives CUP and PLY) • These tools use table driven bottom up parsing techniques (commonly shift/reduce parsing) • Similar tools (e.g. lex/flex for C/C++, Jflex for Java) exist, based on the theory of finite automata, to automatically construct scanners from regular expressions 1bison in the GNU version of yacc
Yacc (bison) Example %token NUMBER /* needed to communicate with scanner */ %% list: '(' sequence ')' { printf("L -> ( seq )\n"); } | '(' ')' { printf("L -> () \n "); } sequence: listelement ',' sequence { printf("seq -> LE,seq\n"); } | listelement { printf("seq -> LE\n"); } ; listelement: NUMBER { printf("LE -> %d\n",$1); } | list { printf("LE -> L\n"); } ; %% /* since no code here, default main constructed that simply calls parser. */
Lex (flex) Example %{ #include "list.tab.h" extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [ \t\n] ; "(" return yytext[0]; ")" return yytext[0]; "," return yytext[0]; "$" return 0; %%
Building bison/flex Parse • Tools available on tux • You can download them for free • Available as part of many linux distributions (if not installed get the appropriate package) • Can be used through cygwin under windows • Build instructions • bison -d paren.y => paren.tab.c and paren.tab.h • flex paren.l => lex.yy.c • gcc paren.tab.c lex.yy.c -ly -lfl => a.out or a.exe
Executing Parser Program expects user to enter string followed by ctrl D indicating end of file, or to redirect input from a file. E.G. with valid input $ ./a.exe (1,2,3) LE -> 1 LE -> 2 LE -> 3 seq -> LE seq -> LE,seq seq -> LE,seq L -> ( seq ) E.G. input with syntax error $ ./a.exe (1,2,3( LE -> 1 LE -> 2 LE -> 3 seq -> LE seq -> LE,seq seq -> LE,seq syntax error
Recursive Descent Reader List list() { match(‘(‘); if token = ‘)’ then L = seq(); endif; match(‘)’); L = NULL; return L; }
Recursive Descent Reader List seq() { x = elt(); if token = ‘,’ then match(‘,’); M = seq(); L = Comp(x,M); else L = Comp(x,NULL) endif return L }
Recursive Descent Reader Element elt() { if token = ‘(‘ then x = list(); else match(NUMBER); x = NUMBER.val; endif; return x; }
Attribute Grammars • Associate attributes with symbols • Associate attribute computation rules with productions • Fill in values as input parsed (decorate parse tree) • Synthesized vs. inherited attributes
Example Attribute Grammar • < list > → ( < sequence > ) | ( ) • list.val = NULL • list.val = sequence.val • < sequence > → < listelement > , < sequence > | < listelement > • seq0.val = Comp(listelement.val,seq1.val) • seq0.val = Comp(listelement.val,NULL) • < listelement > → < list > | NUMBER • listelement.val = list.val • listelement.val = NUMBER.val
Decorated Parse Tree Val = (1,2,3) <list> Val = (1,2,3) <sequence> ) ( Val = 1 Val = (2,3) <sequence> <listelement> , Val = 2 Val = 1 <sequence> <listelement> , Val = (3) <listelement> Val = 3 Val = 2 Val = 3
Yacc Example with Attributes /* This grammar is ambiguous and will cause shift/reduce conflits */ %token NUMBER %% statement_list: statement '\n' | statement_list statement '\n' ; statement: expression { printf("= %d\n", $1); }; expression: expression '+' expression { $$ = $1 + $3; } | expression '-' expression { $$ = $1 - $3; } | expression '*' expression { $$ = $1 * $3; } | expression '/' expression { if ($3 == 0) yyerror("division by zero"); else $$ = $1 / $3; } | '(' expression ')'{ $$ = $2; } | NUMBER { $$ = $1; } ; %%
Shift Reduce Parsing • Bottom up parsing • LR(1), LALR(1) • Conflicts & ambiguities • |1+2*3 • 1|+2*3 [shift] • <exp>|+2*3 [reduce] • <exp>+|2*3 [shift] • <exp>+2|*3 [shift] • <exp>+<exp>|*3 [reduce] • <exp>+<exp>|*3 [shift/reduce conflict] • <exp>+<exp>*|3 [shift] • <exp>+<exp>*3| [shift] • <exp>+<exp>*<exp> [reduce] • <exp>+<exp>| [reduce] • <exp> [reduce & accept]
Yacc Example (precedence rules) /* precedence rules added to resolve conflicts and remove ambiguity */ %token NUMBER %left '-' '+' %left '*' '/' %nonassoc UMINUS %% statement_list: statement '\n' | statement_list statement '\n' ; statement: expression { printf("= %d\n", $1); }; expression: expression '+' expression { $$ = $1 + $3; } | expression '-' expression { $$ = $1 - $3; } | expression '*' expression { $$ = $1 * $3; } | expression '/' expression { if ($3 == 0) yyerror("division by zero"); else $$ = $1 / $3; } | '-' expression %prec UMINUS { $$ = -$2; } | '(' expression ')'{ $$ = $2; } | NUMBER { $$ = $1; } ;
Exercise 3 • Removing left recursion • Rules S → S [left recursive] cause an infinite loop for a recursive decent parser • Left recursion can be systematically removed • <fee> → <fee> • | • • <fee> → <fie> • <fie> → <fie> • | • Remove left recursion from the unambiguous expression grammar
Exercise 4 • Show that the following grammar is ambiguous. • <stmt> → <ifstmt> | <basicstmt> • <ifstmt> → IF <expr> THEN <stmt> • | → IF <expr> THEN <stmt> ELSE <stmt> • This is called the “dangling else” problem • See if.y for a yacc/bison version of this grammar <expr> and <basicstmt> are replaced by the tokens EXP and BS stmt: ifstmt { printf("stmt -> ifstmt\n"); } | BS { printf("stmt -> BS\n"); } ; ifstmt: IF EXP THEN stmt { printf("ifstmt -> IF EXP THEN stmt\n"); } | IF EXP THEN stmt ELSE stmt { printf("ifstmt -> IF EXP THEN stmt ELSE stmt\n"); }
Output from bison $ bison -d if.y if.y: conflicts: 1 shift/reduce
Exercise 5 • Can you use yacc's precedence rules to remove the ambiguity?
Solution 5 • Convention is to associate the ELSE clause with the nearest if statement. • Force ELSE to have higher precedence than THEN • This removes the shift/reduce conflict and forces yacc to shift on the previous example %token IF THEN ELSE EXP BS %nonassoc THEN %nonassoc ELSE
Exercise 6 • Can you come up with an unambigous grammar for if statements that always associates the else with the closest if?
Solution 6 • Separate if statements into matched (with ELSE clause and recursively matched stmts) and unmatched • This forces the matched if statement to the end stmt: matched { printf("stmt -> matched \n "); } | unmatched { printf("stmt -> unmatched \n "); } ; matched: BS { printf("matched -> BS \n"); } | IF EXP THEN matched ELSE matched { printf("matched -> IF EXP THEN matched ELSE matched \n"); } ; unmatched: IF EXP THEN stmt { printf("unmatched -> IF EXP THEN stmt \n"); } | IF EXP THEN matched ELSE unmatched { printf("unmatched -> IF EXP THEN matched ELSE unmatched \n"); } ;
Exercise 6 • Can you change the syntax for if statements to remove the ambiguity. Hint - try to use syntax to denote the begin and end of the statements in the if statement?
Solution 6 • This is the best solution since the matching IF statement and ELSE clause is visually clear. You do not have to remember unnatural precedence rules. • Such a language choice helps prevent logic bugs stmt: ifstmt { printf("stmt -> ifstmt\n"); } | BS { printf("stmt -> BS\n"); } ; ifstmt: IF EXP THEN '{' stmt '}' { printf("ifstmt -> IF EXP THEN { stmt} \n"); } | IF EXP THEN '{' stmt '}' ELSE '{' stmt '}' { printf("ifstmt -> IF EXP THEN { stmt } ELSE { stmt }\n"); }