370 likes | 776 Views
Compiler Tools. Lex/Yacc – Flex & Bison. Compiler Front End (from Engineering a Compiler). Scanner (Lexical Analyzer) Maps stream of characters into words Basic unit of syntax x = x + y ; becomes < id , x > < eq , = > < id , x > < plus_op , + > < id , y > < sc , ; >
E N D
Compiler Tools Lex/Yacc – Flex & Bison
Compiler Front End (from Engineering a Compiler) Scanner (Lexical Analyzer) • Maps stream of characters into words • Basic unit of syntax • x = x + y ;becomes <id,x> <eq,=> <id,x> <plus_op,+> <id,y> <sc,; > • The actual words are its lexeme • Its part of speech (or syntactic category) is called its token type • Scanner discards white space & (often)comments Intermediate Representation Source code tokens Parser Scanner Errors Speed is an issue in scanning use a specialized recognizer
Source code tokens Scanner Errors The Front End (from Engineering a Compiler) Parser • Checks stream of classified words(parts of speech) for grammatical correctness • Determines if code is syntactically well-formed • Guides checking at deeper levels than syntax • Builds an IR representation of the code Parsing is harder than scanning. Better to put more rules in scanner (whitespace etc). IR Parser
regular expressions & C-code rules Flex – Fast Lexical Analyzer Here’s where we’ll put the regular expressions to good use! lex.yy.c contains yylex() scanner (program to recognize patterns in text) FLEX (Scanner generator) compile executable – analyzes and executes input
Flex input file • 3 sections definitions %% rules %% user code
Definition Section Examples • name definition DIGIT [0-9] ID [a-z][a-z0-9]* • A subsequent reference to {DIGIT}+"."{DIGIT}* is identical to: ([0-9])+"."([0-9])*
C Code • Can include C-code in definitions %{ /* This is a comment inside the definition */ #include <math.h> // may need headers #include <stdio.h> // for printf in BB #include <stdlib.h> // for exit(0) in BB %}
Rules • The rules section of the flex input contains a series of rules of the form: pattern action • In the definitions and rules sections, any indented text or text enclosed in %{ and %} is copied verbatim to the output (with the %{ %}'s removed). The %{ %}'s must appear unindented on lines by themselves.
Definitions section: /* scanner for a toy Pascal-like language */ %{ /* need for the call to atof() below */ #include <math.h> %} DIGIT [0-9] ID [a-z][a-z0-9]* Example: Simple Pascal-like recognizer Remember these are on a line by themselves, unindented! } Lines inserted as-is into resulting code } Definitions that can be used in rules section
Example continued text that matched the pattern (a char*) action pattern • Rules section: %% {DIGIT}+ { printf("An integer: %s (%d)\n", yytext, atoi(yytext ));} {DIGIT}+"."{DIGIT}* {printf("A float: %s (%g)\n", yytext, atof(yytext));} if|then|begin|end|procedure|function {printf("A keyword: %s\n", yytext);} {ID} { printf( "An identifier: %s\n", yytext ); } "+"|"-"|"*"|"/" { printf( "An operator: %s\n", yytext ); } "{"[^}\n]*"}" /* eat up one-line comments */ [ \t\n]+ /* eat up whitespace */ . { printf( "Unrecognized character: %s\n", yytext ); }
Example continued • User code (required for flex, in library for lex) %% yywrap() {} // needed to link, unless libfl.a is available // OR put %option noyywrap at the top of a flex file. int main(int argc, char ** argv ) { ++argv, --argc; /* skip over program name */ if ( argc > 0 ) yyin = fopen( argv[0], "r" ); else yyin = stdin; yylex(); } lex input file lexer function produced by lex
Lex techniques • Hardcoding lists not very effective. Often use symbol table. Example in lec & yacc, not covered in class but see me if you’re interested.
Context-free Grammar in BNF form, LALR(1)* Bison – like Yacc (yet another compiler compiler) Bison parser (c program) group tokens according to grammar rules Bison • Bison parser provides yyparse • You must provide: • the lexical analyzer (e.g., flex) • an error-handling routine named yyerror • a main routine that calls yyparse *LookAhead Left Recursive
Bison Parser • Same sections as flex (yacc came first): definitions, rules, C-Code • We’ll discuss rules first, then definitions and C-Code
Bison Parser – Rule Section • Consider CFG <statement> -> ID = <expression> • Would be written in bison “rules” section as: statement: NAME ‘=‘ expression | expression { printf("= %d\n", $1); } ; expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 + $3; } | NUMBER { $$ = $1; } ; • Use : between lhs and rhs, place ; at end. • What are $$? next slide… white space ; at end NOTE: The first rule in statement won’t be operational yet…
More on bison Rules and Actions • $1, $3 refer to RHS values. $$ sets value of LHS. • In expression, $$ = $1 + $3 means it sets the value of lhs (expression) to NUMBER ($1) + NUMBER ($3) • A rule action is executed when the parser reduces that rule (will have recognized both NUMBER symbols) • lexer should have returned a value via yylval (next slide) statement: NAME ‘=‘ expression | expression { printf("= %d\n", $1); } ; expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 - $3; } ; when is this executed? $$ $1 $2 $3
Coordinating flex and bison • Example to return int value: [0-9]+ { yylval = atoi(yytext); return NUMBER;} sets value for use in actions This one just returns the numeric value of the string stored in yytext atoi is C function to convert string to integer returns recognized token In prior flex examples we just returned tokens, not values • Also need to skip whitespace, return symbols • [ \t] ; /* ignore white space */ • \n return 0; /* logical EOF */ . return yytext[0];
Bison Rule Details • Unlike flex, bison doesn’t care about line boundaries, so add white space for readability • Symbol on lhs of first rule is start symbol, can override with %start declaration in definition section • Symbols in bison have values, must be “declared” as some type • YYSTYPE determines type • Default for all values is int • We’ll be using different types for YYSTYPE in the SimpleCalc exercises
Bison Parser – Definition Section • Definition Section • Tokens used in grammar should be defined. Example rule: • expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } • The token NUMBER should be defined. Later we’ll see cases where expression should also be defined, and how to define tokens with other data types. %token must be lowercase, e.g.,: • %token NUMBER • From the tokens that are defined, Bison will create an appropriate header file • Single quoted characters can be used as tokens without declaring them, e.g., ‘+’, ‘=‘ etc.
Lex - Definition Section Must include the header created by bison Must declare yylval as extern %{ #include "simpleCalc.tab.h extern int yylval; #include <math.h> %}
Bison Parser – C Section • At a minimum, provide yyerror and main routines yyerror(char *errmsg) { fprintf(stderr, "%s\n", errmsg); } main() { yyparse(); }
Bison Intro Exercise • Download SimpleCalc.y, SimpleCalc.l and mbison.bat • Create calculator executable • mbison simpleCalc • FYI, mbison includes these steps: • bison -d simpleCalc.y • flex -L -osimpleCalc.c simpleCalc.l • gcc -c simpleCalc.c • gcc -c simpleCalc.tab.c • gcc -Lc:\progra~1\gnuwin32\lib simpleCalc.o simpleCalc.tab.o -osimpleCalc.exe -lfl –ly • Test with valid sentences (e.g., 3+6-4) and invalid sentences.
%{ #include "simpleCalc.tab.h" extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [ \t] ; /* ignore white space */ \n return 0; /* logical EOF */ . return yytext[0]; %% /*---------------------------------------*/ /* 5. Other C code that we need. */ yyerror(char *errmsg) { fprintf(stderr, "%s\n", errmsg); } main() { yyparse(); } Understanding simpleCalc Explanation: When the lexer recognizes a number [0-9]+ it returns the token NUMBER and sets yylval to the corresponding integer value. When the lexer sees a carriage return it returns 0. If it sees a space or tab it ignores it. When it sees any other character it returns that character (the first character in the yytext buffer). If the yyparse recognizes it – good! Otherwise the parser can generate an error. #ifndef YYTOKENTYPE # define YYTOKENTYPE /* Put the tokens into the symbol table, so that GDB and other debuggers know about them. */ enum yytokentype { NAME = 258, NUMBER = 259 }; #endif /* Tokens. */ #define NAME 258 #define NUMBER 259 simpleCalc.l simpleCalc.tab.h
%token NAME NUMBER %% statement: NAME '=' expression | expression { printf("= %d\n", $1); } ; expression: expression '+' NUMBER { $$ = $1 + $3; } | expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ; Understanding simpleCalc, continued Explanation Execute simpleCalc and enter expression 1+2 main program calls yyparse. This calls lex to recognize 1 as a NUMBER (puts 1 in yylval), sets $$ = $1 Calls lex which returns +, matches ‘+’ in first expression rhs Calls lex to recognize 2 as a NUMBER (puts 2 in yylval) Recognize expression + NUMBER and “reduce” this rule, does action {$$ = $1 + $3}. Recognizes expression as a statement, so it does the printf action.
Adding other variable types* • YYSTYPE determines the data type of the values returned by the lexer. • If lexer returns different types depending on what is read, include a union: %union { // C feature, allows one memory area to char cval; // be interpreted in different ways. char *sval; // For bison, will be used with yylval int ival; } • The union will be placed at the top of your .y file (in the definitions section) • Tokens and non-terminals should be defined using the union * relates to SimpleCalc exercise 2
Adding other variable types - Example • Definitions in simpleCalc.y: %union { float fval; int ival; } %token <ival>NUMBER %token <fval>FNUMBER %type <fval> expression • Use union in rules in simpleCalc.l: {DIGIT}+ { yylval.ival = atoi(yytext); return NUMBER;}
Processing lexemes in flex* • Sometimes you want to modify a lexeme before it is passed to bison. This can be done by putting a function call in the flex rules • Example: to convert input to lower case • put a prototype for your function in the definition section (above first %%) • write the function definition in the C-code section (bottom of file) • call your function when the token is recognized. Use strdup to pass the value to bison. * relates to SimpleCalc exercise 3
Example continued %{ #include “example.tab.h“ void make_lower(char *text_in); %} %% [a-zA-Z]+ {make_lower(yytext); yylval.sval = strdup(yytext); return KEYWORD; } %% void make_lower(char *text_in) { int i; for (i=0; i<strlen(yytext); ++i) yytext[i]=tolower(yytext[i]); } need prototype here function call to process text make duplicate using strdup return token type function code in C section
Adding actions to rules * For more complex processing, functions can be added to bison. Remember to add a prototype at the top, and the function at the bottom * relates to SimpleCalc exercise 4
Processing more than one line * To process more than one line, ensure the \n is simply ignored Use a recursive rule to allow multiple inputs * relates to SimpleCalc exercise 4
Summary of steps (from online manual) The actual language-design process using Bison, from grammar specification to a working compiler or interpreter, has these parts: • Formally specify the grammar in a form recognized by Bison (i.e., machine-readable BNF). For each grammatical rule in the language, describe the action that is to be taken when an instance of that rule is recognized. The action is described by a sequence of C statements. • Write a lexical analyzer to process input and pass tokens to the parser. • Write a controlling function (main) that calls the Bison-produced parser. • Write error-reporting routines.