160 likes | 476 Views
Lex. COP 3401, Fall 2009. What is Lex?. A tool for building lexical analyzers (lexers) lexer (scanner) is used to perform lexical analysis, or the breaking up of an input stream into meaningful units, or tokens. E.g., consider breaking a text file up into individual words. Usage.
E N D
Lex COP 3401, Fall 2009
What is Lex? • A tool for building lexical analyzers (lexers) • lexer (scanner) is used to perform lexical analysis, or the breaking up of an input stream into meaningful units, or tokens. • E.g., consider breaking a text file up into individual words.
Usage Lex source program Lex lex.yy.c C compiler a.out lex.yy.c a.out input tokens
*.c is generated after running x.l %{ < C global variables, prototypes, comments > %} [DEFINITION SECTION] %% [RULES SECTION] %% < C auxiliary subroutines> This part will be embedded into *.c substitutions, code and start states; will be copied into *.c define how to scan and what action to take for each token any user code. For example, a main function to call the scanning function yylex(). Skeleton of a lex specification (.l file)
The rules section %% [RULES SECTION] <pattern> { <action to take when matched> } <pattern> { <action to take when matched> } … %% Patterns are specified by regular expressions. For example: %% [A-Za-z]* { printf(“this is a word”); } %%
Input: Output:
Regular Expression Basics . : matches any single character except \n * : matches 0 or more instances of the preceding regular expression + : matches 1 or more instances of the preceding regular expression ? : matches 0 or 1 of the preceding regular expression | : matches the preceding or following regular expression [ ] : defines a character class () : groups enclosed regular expression into a new regular expression “…”: matches everything within the “ “ literally
Lex Reg Exp (cont) x|yx or y {i} definition of i x/yx, only if followed by y (y not removed from input) x{m,n} m to n occurrences of x xx, but only at beginning of line x$ x, but only at end of line "s" exactly what is in the quotes (except for "\" and following character) A regular expression finishes with a space, tab or newline
Meta-characters • meta-characters (do not match themselves, because they are used in the preceding reg exps): • ( ) [ ] { } < > + / , ^ * | . \ " $ ? - % • to match a meta-character, prefix with "\" • to match a backslash, tab or newline, use \\, \t, or \n
Regular Expression Examples • an integer: 12345 • [1-9][0-9]* • a word: cat • [a-zA-Z]+ • a (possibly) signed integer: 12345 or -12345 • [-+]?[1-9][0-9]* • a floating point number: 1.2345 • [0-9]*”.”[0-9]+
Lex Regular Expressions Lex uses an extended form of regular expression: (c: character, x,y: regular expressions, s: string, m,n integers and i: identifier). • c any character except meta-characters (see below) • [...] the list of enclosed chars (may be a range) • [...] the list of chars not enclosed • . any ASCII char except newline • xy concatenation of x and y • x* same as x* • x+ same as x+ (i.e. x* but not ) • x? an optional x (same as x+ )
Rules • Lex patterns only match a given input or string once • Lex executes the action for the longest possible match for the current input
Regular Expression Examples • a delimiter for an English sentence • “.” | “?” | ! OR • [“.””?”!] • C++ comment: // call foo() here!! • “//”.* • white space • [ \t]+ • English sentence: Look at this! • ([ \t]+|[a-zA-Z]+)+(“.”|”?”|!)
Special Functions • yytext • where text matched most recently is stored • yyleng • number of characters in text most recently matched • yylval • associated value of current token • yymore() • append next string matched to current contents of yytext • yyless(n) • remove from yytext all but the first n characters • unput(c) • return character c to input stream • yywrap() • may be replaced by user • The yywrap method is called by the lexical analyser whenever it inputs an EOF as the first character when trying to match a regular expression