190 likes | 424 Views
Tools for building compilers. Clara Benac Earle. Tools to help building a compiler. C Lexical Analyzer generators: Lex, flex, Syntax Analyzer generator: yacc Java Lexical Analyzer generators: JLex, JFlex, Syntax Analyzer generator: CUP.
E N D
Tools for building compilers Clara Benac Earle
Tools to help building a compiler • C • Lexical Analyzer generators: Lex, flex, • Syntax Analyzer generator: yacc • Java • Lexical Analyzer generators: JLex, JFlex, • Syntax Analyzer generator: CUP These tools with their documentation can be found on the internet
Lex: Lexical Analyzer Generator example.l lex.yy.c a.exe C compiler Lex Compiler
Description • A tool for generating scanners • The scanner is described as pairs of regular expressions and C code • Flex generates as output a C source file, lex.yy.c, which defines a routine yylex(). This file produces an executable • When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code
Format of the input file The flex input file consists of three sections separated by %% Definitions %% Rules %% User Code
Skeleton of a lex specification (.l file) %{ < C global variables, prototypes, comments > %} [DEFINITION SECTION] %% [RULES SECTION] %% < C auxiliary subroutines> This part will be embedded into *.c substitutions, code and start states; will be copied into *.c define how to scan and what action to take for each token any user code. For example, a main function to call the scanning function yylex().
The definition section • Contains name definitions and declarations of start conditions • Name definitions have the form: name definition • Examples: DIGIT [0-9] ID [a-z][a-z0-9]*
The rules section • Form: %% <pattern> { <action to take when matched> } <pattern> { <action to take when matched> } … %% • Patterns are specified by regular expressions • Examples: %% [A-Za-z]* { printf(“this is a word”); } %%
Extended regular expressions x match the character “x” . any character except newline [] a character class [xy] match either an “x” or a “y” [a-z] match any letter from “a” to “z” [^a-z] any character but those in the class r* zero or more r´s r+ one or more r´s r? zero or one r {name} the expansion of the name definition
Extended regular expressions x|y x or y x/y x, only if followed by y (y not removed from input) x{m,n} m to n occurrences of x x x, but only at beginning of line x$ x, but only at end of line "s" exactly what is in the quotes (except for "\" and following character) A regular expression finishes with a space, tab or newline
Meta-characters • meta-characters (do not match themselves, because they are used in the preceding reg exps): • ( ) [ ] { } < > + / , ^ * | . \ " $ ? - % • to match a meta-character, prefix with "\" • to match a backslash, tab or newline, use \\, \t, or \n
Regular Expression Examples • an integer: 12345 • [1-9][0-9]* • a word: cat • [a-zA-Z]+ • a (possibly) signed integer: 12345 or -12345 • [-+]?[1-9][0-9]* • a floating point number: 1.2345 • [0-9]*”.”[0-9]+
Two Rules • lex will always match the longest (number of characters) token possible. • 2. If two or more possible tokens are of the same length, then the token with the regular expression that is defined first in the lex specification is favored.
How the input is matched Once the match is determined, the text corresponding to the match is made available in the global character pointer yytext, and its length in the global integer yyleng. The action corresponding to the matched pattern is then executed, and then the remaining input is scanned for another match
Actions • Can be any arbitrary C statement • Normally they are written between {} • If the action is empty, then when the pattern is matched the input token is simply discarded • The action “|” means “same as the action for the next rule”
Actions: examples %% [ \t \n]+ ; ":=" return ASIG; "<“ return MINOR; "if" return IF;
Start conditions A mechanism for conditionally activating rules %s comment %% “/*” { BEGIN comment; } <comment>”*/” { END comment; /* = BEGIN 0; */ } <comment>. { }
Special Functions • yytext • where text matched most recently is stored • yyleng • number of characters in text most recently matched • yylval • associated value of current token • yymore() • append next string matched to current contents of yytext • yyless(n) • remove from yytext all but the first n characters • unput(c) • return character c to input stream • yywrap() • may be replaced by user • The yywrap method is called by the lexical analyzer whenever it inputs an EOF as the first character when trying to match a regular expression