Tutorial On Lex & Yacc

Tutorial On Lex & Yacc Presented By Dewan Tanvir Ahmed Lecturer, CSE Bangladesh University of Engineering and Technology

Purpose of Tutorial • Provide a brief, non-technical, black-box introduction to lex and yacc. • How to run lex and yacc. • How to run them in Windows environment. More study needed Upcoming Assignments are on lex yacc May be included in CSE309

Lex: what is it? • Lex: a tool for automatically generating a lexer or scanner given a lex specification (.l file) • A lexer or scanner is used to perform lexical analysis, or the breaking up of an input stream into meaningful units, or tokens. • For example, consider breaking a text file up into individual words.

*.c is generated after running x.l %{ < C global variables, prototypes, comments > %} [DEFINITION SECTION] %% [RULES SECTION] %% < C auxiliary subroutines> This part will be embedded into *.c substitutions, code and start states; will be copied into *.c define how to scan and what action to take for each token any user code. For example, a main function to call the scanning function yylex(). Skeleton of a lex specification (.l file)

The rules section %% [RULES SECTION] <pattern> { <action to take when matched> } <pattern> { <action to take when matched> } … %% Patterns are specified by regular expressions. For example: %% [A-Za-z]* { printf(“this is a word”); } %%

Regular Expression Basics . : matches any single character except \n * : matches 0 or more instances of the preceding regular expression + : matches 1 or more instances of the preceding regular expression ? : matches 0 or 1 of the preceding regular expression | : matches the preceding or following regular expression [ ] : defines a character class () : groups enclosed regular expression into a new regular expression “…”: matches everything within the “ “ literally

Lex Reg Exp (cont) x|yx or y {i} definition of i x/yx, only if followed by y (y not removed from input) x{m,n} m to n occurrences of x xx, but only at beginning of line x$ x, but only at end of line "s" exactly what is in the quotes (except for "\" and following character) A regular expression finishes with a space, tab or newline

Meta-characters • meta-characters (do not match themselves, because they are used in the preceding reg exps): • ( ) [ ] { } < > + / , ^ * | . \ " $ ? - % • to match a meta-character, prefix with "\" • to match a backslash, tab or newline, use \\, \t, or \n

Regular Expression Examples • an integer: 12345 • [1-9][0-9]* • a word: cat • [a-zA-Z]+ • a (possibly) signed integer: 12345 or -12345 • [-+]?[1-9][0-9]* • a floating point number: 1.2345 • [0-9]*”.”[0-9]+

Lex Regular Expressions Lex uses an extended form of regular expression: (c: character, x,y: regular expressions, s: string, m,n integers and i: identifier). • c any character except meta-characters (see below) • [...] the list of enclosed chars (may be a range) • [...] the list of chars not enclosed • . any ASCII char except newline • xy concatenation of x and y • x* same as x* • x+ same as x+ (i.e. x* but not ) • x? an optional x (same as x+ )

Two Rules • lex will always match the longest (number of characters) token possible. • 2. If two or more possible tokens are of the same length, then the token with the regular expression that is defined first in the lex specification is favored.

Regular Expression Examples • a delimiter for an English sentence • “.” | “?” | ! OR • [“.””?”!] • C++ comment: // call foo() here!! • “//”.* • white space • [ \t]+ • English sentence: Look at this! • ([ \t]+|[a-zA-Z]+)+(“.”|”?”|!)

Special Functions • yytext • where text matched most recently is stored • yyleng • number of characters in text most recently matched • yylval • associated value of current token • yymore() • append next string matched to current contents of yytext • yyless(n) • remove from yytext all but the first n characters • unput(c) • return character c to input stream • yywrap() • may be replaced by user • The yywrap method is called by the lexical analyser whenever it inputs an EOF as the first character when trying to match a regular expression

Let us run a lex program

Yacc: what is it? Yacc: a tool for automatically generating a parser given a grammar written in a yacc specification (.y file) A grammar specifies a set of production rules, which define a language. A production rule specifies a sequence of symbols, sentences, which are legal in the language.

Skeleton of a yacc specification (.y file) *.c is generated after running x.y %{ < C global variables, prototypes, comments > %} [DEFINITION SECTION] %% [PRODUCTION RULES SECTION] %% < C auxiliary subroutines> This part will be embedded into *.c contains token declarations. Tokens are recognized in lexer. define how to “understand” the input language, and what actions to take for each “sentence”. any user code. For example, a main function to call the parser function yyparse()

Structure of yacc File • Definition section • declarations of tokens • type of values used on parser stack • Rules section • list of grammar rules with semantic routines • User code

The Production Rules Section %% production : symbol1 symbol2 … { action } | symbol3 symbol4 … { action } | … production: symbol1 symbol2 { action } %%

statement expression expression expression number expression expression number expression expression number number + 5 4 - + 3 2 An example %% statement : expression { printf (“ = %g\n”, $1); } expression : expression ‘+’ expression { $$ = $1 + $3; } | expression ‘-’ expression { $$ = $1 - $3; } | NUMBER { $$ = $1; } %% According these two productions, 5 + 4 – 3 + 2 is parsed into:

S -> E E -> E + T E -> E - T E -> T T -> T * F T -> T / F T -> F F -> ( E ) F -> ID S -> E E -> E + E E ->E - E E -> E * E E -> E / E E -> ( E ) E -> ID Choosing a Grammar

Precedence and Associativity %right ‘=' %left '-' '+' %left '*' '/' %right '^'

Defining Values expr : expr '+' term { $$ = $1 + $3; } | term { $$ = $1; } ; term : term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } ; factor : '(' expr ')' { $$ = $2; } | ID | NUM ;

Defining Values $1 expr : expr '+' term { $$ = $1 + $3; } | term { $$ = $1; } ; term : term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } ; factor : '(' expr ')' { $$ = $2; } | ID | NUM ;

Defining Values expr : expr '+' term { $$ = $1 + $3; } | term { $$ = $1; } ; term : term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } ; factor : '(' expr ')' { $$ = $2; } | ID | NUM ; $2

Defining Values expr : expr '+' term { $$ = $1 + $3; } | term { $$ = $1; } ; term : term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } ; factor : '(' expr ')' { $$ = $2; } | ID | NUM ; $3 Default: $$ = $1;

scanner.l Example: Lex %{ #include <stdio.h> #include "y.tab.h" %} id [_a-zA-Z][_a-zA-Z0-9]* wspc [ \t\n]+ semi [;] comma [,] %% int { return INT; } char { return CHAR; } float { return FLOAT; } {comma} { return COMMA; } /* Necessary? */ {semi} { return SEMI; } {id} { return ID;} {wspc} {;}

decl.y Example: Definitions %{ #include <stdio.h> #include <stdlib.h> %} %start line %token CHAR, COMMA, FLOAT, ID, INT, SEMI %%

decl.y Example: Rules /*This production is not part of the "official" grammar. It's primary purpose is to recover from parser errors, so it's probably best if you leave ot here. */ line : /* lambda */ | line decl | line error { printf("Failure :-(\n"); yyerrok; yyclearin; } ;

decl.y Example: Rules decl : type ID list { printf("Success!\n"); } ; list : COMMA ID list | SEMI ; type : INT | CHAR | FLOAT ; %%

decl.y Example: Supplementary Code extern FILE *yyin; main() { do { yyparse(); } while(!feof(yyin)); } yyerror(char *s) { /* Don't have to do anything! */ }

Let us Run a Program

Tutorial On Lex & Yacc