160 likes | 520 Views
Lex /Flex. COP 3402 – System Software. Lex : a tool for automatically generating a lexer or scanner given a lex specification (.l file ) Flex: Free software alternative to lex
E N D
Lex/Flex COP 3402 – System Software
Lex: a tool for automatically generating a lexer or scanner given a lex specification (.l file) Flex: Free software alternative to lex A lexer or scanner is used to perform lexical analysis, or the breaking up of an input stream into meaningful units, or tokens. For example, consider breaking a text file up into individual words. Lex/Flex: what is it?
*.c is generated after running lex/flex FILENAME.l %{ < C global variables, includes prototypes, comments > %} [DEFINITION SECTION] %% [RULES SECTION] %% < C auxiliary subroutines> This part will be embedded into *.c substitutions, code and start states; will be copied into *.c define how to scan and what action to take for each token any user code. For example, a main function to call the scanning function yylex(). Skeleton of a lex/flex specification (.l file)
The rules section %% [RULES SECTION] <pattern> { <action to take when matched> } <pattern> { <action to take when matched> } … %% Patterns are specified by regular expressions. For example: %% [A-Za-z]* { printf(“this is a word”); } %%
Regular Expression Basics . : matches any single character except \n *: matches 0 or more instances of the preceding regular expression +: matches 1 or more instances of the preceding regular expression ?: matches 0 or 1 of the preceding regular expression |: matches the preceding or following regular expression []: defines a character class (): groups enclosed regular expression into a new regular expression “…” : matches everything within the “ ”, literally
Lex/Flex RegExp (cont) x|yxor y {i} definition of i x/yx, only if followed by y (y not removed from input) x{m,n}mto n occurrences of x xx, but only at beginning of line x$x, but only at end of line "s" exactly what is in the quotes (except for "\" and following character) A regular expression finishes with a space, tab or newline
Meta-characters • meta-characters (do not match themselves, because they are used in the preceding reg exps): • ( ) [ ] { } < > + / , ^ * | . \ " $ ? - % • to match a meta-character, prefix with "\" • to match a backslash, tab or newline, use \\, \t, or \n
Regular Expression Examples • an integer: 12345 • [1-9][0-9]* • a word: cat • [a-zA-Z]+ • a (possibly) signed integer: 12345 or -12345 • [-+]?[1-9][0-9]* • a floating point number: 1.2345 • [0-9]*”.”[0-9]+
Lex/Flex Regular Expressions Lex/Flex uses an extended form of regular expressions: • c any character except meta-characters (see below) • [...] the list of enclosed chars (may be a range) • [...] the list of chars not enclosed • . any ASCII char except newline • xy concatenation of x and y • x* • x+x* but notε • x? an optional x (same as x|ε)
Two Rules • lex/flex will always match the longest (number of characters) token possible. • 2. If two or more possible tokens are of the same length, then the token with the regular expression that is defined first in the lex/flex specification is favored.
Regular Expression Examples • a delimiter for an English sentence • “.” | “?” | ! OR • [“.”“?”!] • C++ comment: // call foo() here • “//”.* • white space • [ \t]+ • English sentence: • ([ \t]+|[a-zA-Z]+)+(“.”|“?”|!)
Special Functions • yytext • where text matched most recently is stored • yyleng • number of characters in text most recently matched • yylval • associated value of current token • yymore () • append next string matched to current contents of yytext • yyless (n) • remove from yytext all but the first n characters • unput (c) • return character c to input stream • yywrap () • may be replaced by user • The yywrap method is called by the lexical analyzer whenever it inputs an EOF as the first character when trying to match a regular expression
Example Flex Code flex.l