450 likes | 574 Views
Lesson 10. CDT301 – Compiler Theory , Spring 2011 Teacher : Linus Källberg. Outline. Flex Bison Abstract syntax trees. Flex. Flex. Tool for automatic generation of scanners Open-source version of Lex Takes regular expressions as input Outputs a C (or C++) file for the scanner. Flex.
E N D
Lesson 10 CDT301 – CompilerTheory, Spring 2011 Teacher: Linus Källberg
Outline • Flex • Bison • Abstract syntax trees
Flex • Tool for automatic generation of scanners • Open-source version of Lex • Takes regular expressions as input • Outputs a C (or C++) file for the scanner
Flex mylexer.l mylexer.obj mylexer.c Regexps intyylex() … 01101000110101010… Flex C compiler
The input file to Flex Definitions %% Rules %% Usercode
The definitions section • Macro definitions: • Specify a letter: letter [A-Za-z] • Specify a delimiter: delimiter [ ,:;.] • Specify a digit: digit [0-9] • Specify an identifier: id letter(letter|digit)*
The definitions section • User code: %{ #include <stdio.h> inta_nice_global_variable = 0; intmy_favourite_function(void) {return 42;} %}
The rulessection • Rule = regexp + C code • Longest matching pattern is used • If two equally long patterns match, the first one in the file is used • Examples: =|>=?|<(=|>)? { return RELOP; } {id} { return ID; }
The regexplanguage of Flex ? Previous regexp is optional {} Macro expansion (defined in the definitions section) . Matches any character that is not end of line $ Matches the end of a line ^ Matches the beginning of a line [] Matches any enclosed character
The [] syntax • Similar to | but more powerful • Example: digit [0123456789] is the same as digit 0|1|2|3|4|5|6|7|8|9 • Special characters inside the brackets: – and ^ digit [0-9] letter [A-Za-z] non_digit [^0-9]
The user code section • Only C code valid here • Will be copied unchanged to the generated C file
The generated scanner • By default, a function called yylex() is defined • Works similar to your GetNextToken() from lab 1 • The name can be changed with options • Some globals are defined as well (can be changed into local variables with options): yyin The file to read from yytext The matched lexeme (char*) yyleng The length of yytext yylineno Line number of the match
The yywrap() function • Calleduponend-of-file • Should be supplied by the user • Suppressed with %option noyywrapor --noyywrap
Scanner states in Flex • Affectswhat tokens should be recognized • Example from the language ALF: { fref 32 DEADC0DE } <- Identifier { hex_val DEADC0DE } <- Hexconstant
Scanner states in Flex • Declarestate: %x READ_HEX • Use the state to make rulesconditional: hex_val { BEGIN(READ_HEX);return HEX_VAL_KW; } [a-zA-Z_][a-zA-Z0-9_]* { return ID; } <READ_HEX>[0-9a-fA-F]+ { BEGIN(INITIAL);return NUM; }
Online resources http://flex.sourceforge.net/manual/index.html
Bison • Tool for automatic generation of parsers • Open-source alternative to Yacc • Takes an SDT scheme as input • Outputs C (or C++) source code for an LALR parser • Commonly used together with Flex
Bison myparser.y myparser.obj myparser.c myparser.h SDT scheme intparse() … 01101000110101010… Token definitions Bison C compiler
The input file to Bison Definitions %% SDT scheme %% Usercode
Definitions section • Define tokens • Define operator precedence • Define operator associativity • Define the types of grammar symbol attributes • Write C codebetween %{ and %} • Issuecertaincommands to Bison
Token definition • Normal case: %token IDENTIFIER %token WHILE • Token, precedence, associativity, and type: %left <Operator> RELOP %left <Operator> MINUSOP PLUSOP %right <Operator> NOTOP • Enables use of ambiguous grammars!
Definingtypes • Just enter the type inside <> before the list of tokens: %left <Operator> RELOP %left <Operator> MULOP %right <Operator> NOTOP UNOP %token <String> ID STRING • Or the same for non-terminals: %type <Node> stmntexpractualsexprs
The variable yylval • Used by the lexical analyzer to store token attributes • Default type is int • May be given another type(s) using %union: %union { int Operator; char *String; NODE_TYPE Node; } • The type (member name) is then used like this: %token <String> ID STRING
Code provided by the user • yyerror(char* msg) • Function called on syntax errors • yylex() • Function called to get the next token
Options to Bison • Given on the command line or in the grammar file • --defines or %defines: Output a C header file with definitions useful to a scanner • Tokens (#defines) and the type on yylval • %error-verbose: More detailed error messages • --name-prefix or %name-prefix: Change the default “yy” prefix on all names • %define api.pure: Do not use globals • --verbose or %verbose: Writedetailed information to extra output file
Translationschemesection decl : BASIC_TYPE idents ';' ; idents : idents ',' ident | ident ; ident : ID ;
Semanticactions • Written in C • Executed when the production is used in a reduction • $$, $1, $2, etc. refer to the attributes of the grammar symbols • Can be used as regular C variables • $$ refer to the attribute of the head, $1 to the attribute of the first symbol in the body, etc. E : E '+' T { $$ = $1 + $3; } ;
Using ambiguousgrammars in Bison • Default actions: • Reduce/reduce: choose first rule in file • Shift/reduce: alwaysshift • With explicit precedence and associativity: • Shift/reduce: Compareprec/ass of rule with that of lookahead token
The %expectdeclaration • To suppress shift/reducewarnings: %expect n where n is the exact nr of conflicts
Contextualprecedence • Same token mighthave different precedencedepending on context: expr→ expr – expr | expr * expr | – expr | id • StackInput • … – expr * expr …
Contextualprecedence • Define dummy token: %left '-' %left '*' %left UMINUS • Use the %precmodifier: expr→ – expr %prec UMINUS
Examples of parser configurations StackInputAction … if (cond) stmtelse … shift StackInputAction … expr + expr * … shift StackInputAction … expr * expr + … red. expr→ expr * expr StackInputAction … expr * expr * … red. expr→ expr * expr
Online resources http://www.gnu.org/software/bison/manual/html_node/index.html
Abstract syntax trees • “AST” or just “syntax tree” E + E E + a * E E a 5 b 5 b *
Syntax trees vs. parsetrees Parsetrees: Syntax trees: Interior nodes are “operators”, leaves are operands Commonly constructed as an explicit data structure Represents the abstract syntax • Interior nodes are nonterminals, leaves are terminals • Rarely constructed as an explicit data structure • Represents the concrete syntax
Why syntax trees? • Simplifies subsequent analyses • Independent on the parsing strategy • Makes it easier to add new analysis passes without having to modify the parser • More compact representation than parse trees
Syntax treeexample if (a < 1) b = 2 + 3; else { c = d * 4; e(f, 5); } if null < = call e null = null a 1 + f 5 null b c * 2 3 d 4
Exercise (1) • Draw an abstract syntax tree for the statement while (i < 100) { x = 2 * x; i = i + 1; }
Constructing a syntaxtree in Bison expr : expr '+' expr { $$ = createOpNode($1, '+' ,$3); } | expr '*' expr { $$ = createOpNode($1, '*' ,$3); } | ID { $$ = createIdNode($1.name); } ;
Constructing a syntaxtree in Bison stmt : RETURN expr ';' { $$ = mReturn($2, $1); } ; stmts : stmtsstmt { $$ = connectStmts($1, $2); } | { $$ = NULL; } ;
Conclusion • Flex generates C source code for a scanner given a set of regular expressions • Bison generates C source code for a bottom-up parser given a syntax-directed translation scheme • Building syntax trees simplifies subsequent analyses of the program • Syntax trees can be built in semantic actions
Next time • Syntax-directed definitions and translationschemes • Semanticanalysis and typeanalysis