450 likes | 743 Views
Compiler Tools. Lex/Yacc – Flex & Bison. Compiler Front End (from Engineering a Compiler). Scanner (Lexical Analyzer) Maps stream of characters into words Basic unit of syntax x = x + y ; becomes < id , x > < eq , = > < id , x > < plus_op , + > < id , y > < sc , ; >
E N D
Compiler Tools Lex/Yacc – Flex & Bison
Compiler Front End (from Engineering a Compiler) Scanner (Lexical Analyzer) • Maps stream of characters into words • Basic unit of syntax • x = x + y ;becomes <id,x> <eq,=> <id,x> <plus_op,+> <id,y> <sc,; > • The actual words are its lexeme • Its part of speech (or syntactic category) is called its token type • Scanner discards white space & (often)comments Intermediate Representation Source code tokens Parser Scanner Errors Speed is an issue in scanning use a specialized recognizer
Source code tokens Scanner Errors The Front End (from Engineering a Compiler) Parser • Checks stream of classified words(parts of speech) for grammatical correctness • Determines if code is syntactically well-formed • Guides checking at deeper levels than syntax • Builds an IR representation of the code Parsing is harder than scanning. Better to put more rules in scanner (whitespace etc). IR Parser
The Big Picture • Language syntax is specified with parts of speech, not words • Syntax checking matches parts of speech against a grammar S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} 1. goalexpr 2. exprexpr op term 3. | term 4. termnumber 5. | id 6. op + 7. | – Parts of speech, not words! No words here!
Why study lexical analysis? • We want to avoid writing scanners by hand • Finite automata are used in other applications: grep, website filtering, various “find” commands Goals: • To simplify specification & implementation of scanners • To understand the underlying techniques and technologies source code parts of speech & words Scanner tables or code Represent words as indices into a global table specifications Scanner Generator Specifications written as “regular expressions”
Finite Automata Formally a finite automata is a five-tuple(S,S,, s0, SF) where • S is the set of states, including error state Se. S must be finite. • is the alphabet or character set used by recognizer. Typically union of edge labels (transitions between states). • (s,c) is a function that encodes transitions (i.e., character c in changes to state s in S. ) • s0 is the designated start state • SF is the set of final states, drawn with double circle in transition diagram
e S2 S3 e f e i S0 S1 S4 S5 Finite Automata Finite automata to recognize fee and fie: • S = {s0, s1, s2, s3, s4, s5, se} • = {f, e, i} • (s,c) set of transitions shown above • s0 = s0 • SF= { s3, s5} Set of words accepted by a finite automata F forms a language L(F). Can also be described by regular expressions.
e S2 S3 e f e i S0 S1 S4 S5 Finite Automata Quick Exercise • Draw a finite automata that can recognize CU | CSU | CSM | DU (drawing included below for reference)
Regular Expressions in Lex* The characters that form regular expressions include: • . matches any single character except newline • * matches zero or more copies of preceding expression • [] a character class that matches any character within the brackets. If first character is ^ will match any character except those within brackets. A dash can be used for character range, e.g., [0-4] is equivalent to [01234]. more in book… • ^ matches beginning of line as first character of expression (also negation within [], as listed above). • $ matches end of line as last character of expression • {} indicates how many times previous pattern is allowed to match, e.g., A{1,3} matches one to three occurrences of A. • \ used to escape metacharacters, e.g., \* is literal asterisk, \” is a literal quote, \{ is literal open brace, etc. * from lex & yacc by Levine, Mason & Brown
Regular Expressions, continued • + matches one or more occurrences of preceding expression, e.g., [0-9]+ matches “1” “11” or “1234” but not empty string • ? matches zero or one occurrence of preceding expression, e.g., -?[0-9]+ matches signed number with optional leading minus sign • | matches either preceding or following expression, e.g., cow|pig|sheep matches any of the three words • “…” interprets everything inside quotation marks literally • / matches preceding expression only if followed by following expression, e.g., 0/1 matches “0” in “01” but not in “02”. Material in pattern following the / is not “consumed” • () Groups a series of regular expressions into a new regular expression, e.g., (01) becomes character sequence 01. Useful when building up complex patterns with *, + and |.
Regular Expression Examples • digit: [0-9] • int with at least 1 digit: [0-9]+ • int that can have 0 digits: [0-9]* • What about float? • [0-9]*\.[0-9]+ // literal ., at least 1 digit after . – what about 0 or 2? • ([0-9]+)| ([0-9]*\.[0-9]+) // combine int and float, notice use of (), what about unary -? • -?(([0-9]+)| ([0-9]*\.[0-9]+))
More Regular Expression Examples • What’s a regular expression for matching quotes? • \”.*\” won’t work for lines like “mine” and “yours” because lex matches largest possible pattern. • \”[^”\n]*[“\n] will work by excluding “ (forces lex to stop as soon as “ is reached). The \n keeps a quoted string from exceeding one line.
regular expressions & C-code rules Flex – Fast Lexical Analyzer Here’s where we’ll put the regular expressions to good use! lex.yy.c, contains yylex() scanner (program to recognize patterns in text) FLEX compile executable – analyzes and executes input
Flex input file • 3 sections definitions %% rules %% user code
Definition Section Examples • name definition DIGIT [0-9] ID [a-z][a-z0-9]* • A subsequent reference to {DIGIT}+"."{DIGIT}* is identical to: ([0-9])+"."([0-9])*
C Code • Can include C-code in definitions %{ /* This is a comment inside the definition */ #include <math.h> // may need headers %}
Rules • The rules section of the flex input contains a series of rules of the form: pattern action • In the definitions and rules sections, any indented text or text enclosed in %{ and %} is copied verbatim to the output (with the %{ %}'s removed). The %{ %}'s must appear unindented on lines by themselves.
Definitions section: /* scanner for a toy Pascal-like language */ %{ /* need for the call to atof() below */ #include <math.h> %} DIGIT [0-9] ID [a-z][a-z0-9]* Example: Simple Pascal-like recognizer Remember these are on a line by themselves, unindented! } Lines inserted as-is into resulting code } Definitions that can be used in rules section
Example continued text that matched the pattern (a char*) action pattern • Rules section: %% {DIGIT}+ { printf("An integer: %s (%d)\n", yytext, atoi(yytext ));} {DIGIT}+"."{DIGIT}* {printf("A float: %s (%g)\n", yytext, atof(yytext));} if|then|begin|end|procedure|function {printf("A keyword: %s\n", yytext);} {ID} { printf( "An identifier: %s\n", yytext ); } "+"|"-"|"*"|"/" { printf( "An operator: %s\n", yytext ); } "{"[^}\n]*"}" /* eat up one-line comments */ [ \t\n]+ /* eat up whitespace */ . { printf( "Unrecognized character: %s\n", yytext ); }
Example continued • User code (required for flex, in library for lex) %% int main(int argc, char ** argv ) { ++argv, --argc; /* skip over program name */ if ( argc > 0 ) yyin = fopen( argv[0], "r" ); else yyin = stdin; yylex(); } lex input file lexer function produced by lex
Flex exercise #1 • Download pascal.l • From a command prompt (Start->Run->cmd): • Flex -opascal.c -L pascal.l • NOTE: without –o option output file will be called lex.yy.c • -L option suppresses #lines that cause problems with some compilers (e.g. DevC++) • Compile and execute pascal.c (batch on Blackboard) • gcc –opascal.exe –Lc:\progra~1\gnuwin32\lib pascal.c –lfl -ly • Execute program. Type in digits, ids, keywords etc. End program with Ctrl-Z
Flex exercise #2 • Copy words.l (from lex & yacc) • Use flex then compile and execute • What does it do? • Extend the example with 1 new part of speech. • Recognize lexemes R0-R9 as register names • Recognize complex numbers, including for example -3+4i, +5-6i, +7i, 8i, -12i, but not 3++4i (hint: print newline before displaying your complex number, lexer may display 3+ and then recognize +4i)
Lex techniques • Hardcoding lists not very effective. Often use symbol table. Example in lec & yacc, not covered in class but see me if you’re interested.
And now… • Let’s continue with chapter 4!
Context-free Grammar in BNF form, LALR(1)* Bison – like Yacc (yet another compiler compiler) Bison parser (c program) group tokens according to grammar rules Bison • Bison parser provides yyparse • You must provide: • the lexical analyzer (e.g., flex) • an error-handling routine named yyerror • a main routine that calls yyparse *LookAhead Left Recursive
Bison Parser • Same sections as flex (yacc came first): definitions, rules, C-Code
Bison Parser – Definition Section • Definition Section • Tokens used in grammar, values used on parser stack, may include C code within %{ }% • Single quoted characters can be used as tokens without declaring them, e.g., ‘+’, ‘=‘ etc. • List tokens, Bison will create header with defines %token NAME NUMBER • YYSTYPE determines the data type of the values returned by the lexer. If lexer returns different types depending on what is read, include a union: %union { char cval; char *sval; int ival; } • Types declared in union can be used to specify types for tokens and also for non-terminals %token <ival>NUMBER %type <sval>bibKey
Bison Parser – Rule Section • Use : between lhs and rhs, place ; at end. statement: NAME ‘=‘ expression | expression { printf("= %d\n", $1); } ; expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 + $3; } | NUMBER { $$ = $1; } ; • Unlike flex, bison doesn’t care about line boundaries, so add white space for readability • Symbol on lhs of first rule is start symbol, can override with %start declaration in definition section • $1, $3 refer to RHS values. $$ sets value of LHS. • In expression, $$ = $1 + $3 means it sets the value of lhs (expression) to NUMBER ($1) + NUMBER ($3) white space
More on Symbol Values and Actions • Symbols in bison have values. • YYSTYPE typedef contains value types • Default for all values is int • A rules action is executed when the parser reduces that rule (will have recognized both NUMBER symbols, lexer should have returned a value via yylval). expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 - $3; } ;
More on Symbol Values and Actions • Example to return int value: [0-9]+ { yylval = atoi(yytext); return NUMBER;} sets value for use in actions This one just returns the numeric value of the string stored in yytext returns recognized token In prior examples we just returned tokens, not values
Bison Parser – C Section • At a minimum, provide yyerror and main routines yyerror(char *errmsg) { fprintf(stderr, "%s\n", errmsg); } main() { yyparse(); }
Bison Intro Exercise • Download SimpleCalc.y and SimpleCalc.l • Create calculator program: • bison -d simpleCalc.y • flex -L -osimpleCalc.c simpleCalc.l • gcc -c simpleCalc.c • gcc -c simpleCalc.tab.c • gcc -Lc:\progra~1\gnuwin32\lib simpleCalc.o simpleCalc.tab.o -osimpleCalc.exe -lfl –ly • As a convenience, you can use the batch file mbison.bat instead of typing all the above: • mbison simpleCalc • Test with valid sentences (e.g., 3+6-4) and invalid sentences.
%{ #include "simpleCalc.tab.h" extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [ \t] ; /* ignore white space */ \n return 0; /* logical EOF */ . return yytext[0]; %% /*---------------------------------------*/ /* 5. Other C code that we need. */ yyerror(char *errmsg) { fprintf(stderr, "%s\n", errmsg); } main() { yyparse(); } Understanding simpleCalc Explanation: When the lexer recognizes a number [0-9]+ it returns the token NUMBER and sets yylval to the corresponding integer value. When the lexer sees a carriage return it returns 0. If it sees a space or tab it ignores it. When it sees any other character it returns that character (the first character in the yytext buffer). If the yyparse recognizes it – good! Otherwise the parser can generate an error. #ifndef YYTOKENTYPE # define YYTOKENTYPE /* Put the tokens into the symbol table, so that GDB and other debuggers know about them. */ enum yytokentype { NAME = 258, NUMBER = 259 }; #endif /* Tokens. */ #define NAME 258 #define NUMBER 259 simpleCalc.l simpleCalc.tab.h
%token NAME NUMBER %% statement: NAME '=' expression | expression { printf("= %d\n", $1); } ; expression: expression '+' NUMBER { $$ = $1 + $3; } | expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ; Understanding simpleCalc, continued Explanation When you execute simpleCalc and type an expression such as 1+2, the main program calls yyparse. This calls lex to recognize 1 as a NUMBER (puts 1 in yylval), calls lex which returns +, calls lex to recognize 2 as a NUMBER. At this point it will recognize expression + NUMBER and “reduce” this rule, meaning it does the action {$$ = $1 + $3}. It then recognizes expression as a statement, so it does the printf action.
Even more detail (if you’re curious) Running flex creates simpleCalc.c. This creates the following case statement (I added the printf statements: case 1: YY_RULE_SETUP printf("returning number value %d\n", atoi(yytext)); { yylval = atoi(yytext); return NUMBER; } YY_BREAK case 2: YY_RULE_SETUP printf("ignoring white space\n"); ; /* ignore white space */ YY_BREAK case 3: YY_RULE_SETUP printf("recognized eof\n"); return 0; /* logical EOF */ YY_BREAK case 4: YY_RULE_SETUP printf("returning other character %c\n", yytext[0]); return yytext[0]; YY_BREAK
Continuing more detail Running bison creates simpleCalc.tab.c switch (yyn) { case 3: #line 4 "simpleCalc.y" { printf("= %d\n", (yyvsp[0])); ;} break; case 4: #line 7 "simpleCalc.y" { (yyval) = (yyvsp[-2]) + (yyvsp[0]); ;} break; case 5: #line 8 "simpleCalc.y" { (yyval) = (yyvsp[-2]) - (yyvsp[0]); ;} break; case 6: #line 9 "simpleCalc.y" { (yyval) = (yyvsp[0]); ;} break; Notice use of stack pointer sp for $values NOTE: I added extra printf statements to each case, which is what you can see in the trace.
Continuing more detail • In exercise 2 you define a union. This gets translated to code within SimpleCalc.tab.h: #if ! defined (YYSTYPE) && ! defined (YYSTYPE_IS_DECLARED) #line 1 "simpleCalcEx2.y" typedef union YYSTYPE { float fval; int ival; } YYSTYPE; extern YYSTYPE yylval; This is what makes your yylval return part of the union
Continuing more detail • Symbols you define in bison’s CFG are added to a symbol table: static const char *const yytname[] = { "$end", "error", "$undefined", "NUMBER", "FNUMBER", "NAME", "'='", "'+'", "'*'", "'('", "')'", "$accept", "statement", "expression", "term", "factor", 0 };
Continuing more detail • New rules make use of union: switch (yyn) { case 3: #line 15 "simpleCalcEx2.y" { printf("= %f\n", (yyvsp[0].fval)); ;} break; case 4: #line 18 "simpleCalcEx2.y" { (yyval.fval) = (yyvsp[-2].fval) + (yyvsp[0].fval); ;} break; case 5: #line 19 "simpleCalcEx2.y" { (yyval.fval) = (yyvsp[0].fval); ;} break; expression is defined as <fval>, so is NUMBER
Bison Exercise #1 • Change simpleCalc to handle + and * with correct precedence using the grammar with terms and factors presented in chapter 4 of text: Expr -> Expr + Term | Term Term -> Term * Factor | Factor Factor -> (Expr) | NUMBER changed id to NUMBER for simplicity
Bison Exercise #2 • Change simpleCalc.l to accept floating point values OR integers. • Remove extern int yylval; (yylval is no longer simply an int) • Modify simpleCalc.tab.h if you change the name of your file. • use atof for floating point value • you will create a union in simpleCalc.y. Use the name of that union in simpleCalc.l, for example yylval.ival = atoi(yytext); would be used to set a named union of ival to an integer value. • Change simpleCalc.y to accept floating point values. • Create a union, example: %union { float fval; int ival; } • Add %token statements for every token and %type statements for your non-terminals, for example: %token <ival>NUMBER %type <fval> expression • Update factor to accept NUMBER or a floating point type of number (e.g., FNUMBER) • The printf in statement needs to print a floating point value (printf("= %f\n", $1);)
Bison Exercise #3 • Update simpleCalc to accept statements like @myVar = 3.4*4 • Output will be: myVar = 13.6 • Purpose: • adding another type to union (char*). I called this sval. • using a C-function as part of lexer to preprocess yytext before setting yylval. • Steps in simpleCalc.l: • add prototype for a function named extract_name. The parameter to this function is a char* (you will pass in yytext). You can either return a char* or just modify the parameter, since it’s an array. Prototype is in declaration section. • add function extract_name to C section. This function will just remove the @ from the front of the variable name. HINT: remember that c strings end in ‘\0’. • You can modify this string in place, but for more extensive processing you might need to create your own c-strings. You can use malloc, strdup and free in such a case. • When you have recognized a variable (@ followed by upper or lower case letters, in our simple example), you will set yylval.sval = extract_name(yytext); • Steps in simpleCalc.y: • Be sure you still have NAME = expression in your grammar, and add an action so it prints both the variable and the expression result. • Declare NAME as a token of type <sval> (or whatever name you used in your union)
Bison Exercise #4 • Modify simpleCalc.l so that it accepts input from a file. The last slide contains a main method that will read from a file. • Create a small input file with a single line of input, something like: @myVar = 8+3*2.5+6
Summary of steps (from online manual) The actual language-design process using Bison, from grammar specification to a working compiler or interpreter, has these parts: • Formally specify the grammar in a form recognized by Bison (i.e., machine-readable BNF). For each grammatical rule in the language, describe the action that is to be taken when an instance of that rule is recognized. The action is described by a sequence of C statements. • Write a lexical analyzer to process input and pass tokens to the parser. • Write a controlling function (main) that calls the Bison-produced parser. • Write error-reporting routines.
Using files with Bison • The standard file for Bison is yyin. The following code can be used to take an optional command-line argument: int main(argc, argv) int argc; char **argv; { FILE *file; if (argc == 2) { file = fopen(argv[1], "r"); if (!file) { fprintf(stderr, “Couldn't open %s\n", argv[1]); exit(1); } yyin = file; }