Regular Expressions

Regular Expressions

The Role of the Lexical Analyzer • SKIP: Deletion of comments, compaction of whitespace characters • TOKEN: Use regular expressions to specify tokens • Reasons why the separation of lexical analysis and parsing • Simplicity of design is the most important consideration. • Compiler efficiency is improved. • Compiler portability is enhanced.

printf(“Total = %d\n”, score); lexeme token= id lexeme token= literal lexeme token= id 3.1.2 Tokens, Patterns, and Lexemes • A token is a pair consisting of a token name and an optional attribute value. • A pattern is a description of the form that the lexemes of token may take. • A lexeme is a sequence of characters in the source program that matches the patter for a token and is identified by the lexical analyzer as an instance of that token.

Attributes for Tokens • When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler phases additional information about the particular lexeme that matched. E = M * C ** 2 <id, pointer to symbol entry for E> <assign_op> <id, pointer to symbol entry forM> <mult_op> <id, pointer to symbol entry forC> <exp_op> <number, integer value 2> The token names and associated attribute values for the FORTRAN statement

Lexical Errors • It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error. E.g., fi ( a == f(x)) ... • The simplest recovery strategy is “panic mode” recovery. • Other possible error-recovery actions • Delete one character from the remaining input. • Insert a missing character into the remaining input. • Replace a character by another character. • Transpose two adjacent characters.

Languages • Analphabet is any finite set of symbols • Binary alphabet {0,1} • 256 ASCII characters • Unicode  100,000 characters from symbols of the world's languages • A string over an alphabet is a finite sequence of symbols drawn from that alphabet. • Synonyms in language theory: sentence, word • |s|: length of a string s • empty stringε • A language is any countable set of strings over some fixed alphabet. • Definition is broad • C, Java, English

Operations on Strings • The concatenation of two strings, x and y, is xy. • x = dog, y = house, xy = doghouse • The empty string is the identity under concatenation, εs = sε= s. • The exponentiation of strings: • s0 = ε • For all i > 0, si = si-1s

Operations on Languages • L ={A, B, ..., Z, a, b,...,z}, D={0, 1, ..., 9} • LD is the set of letters and digits • LD is the set of 520 strings of length 2, each consisting of one letter followed by one digit. • L4 is the set of all 4-letter strings. • L* is the set of all strings of letters, including the empty string, ε. • L(LD )* is the set of all strings of letters and digits beginning with a letter. • L+ is the set of all strings of one or more digits.

Regular Expressions • Rules define the regular expressions (RE) over some alphabet  and the languages those expressions denote. • Basis • ε is an RE, and L(ε) is {ε}. • If a is a symbol in , then a is an RE, and L(a)={a}. • Induction: Suppose r and s are REs denoting languages L(r) and L(s), respectively. • (r)|(s) is an RE denoting the language L(r)  L(s) . • (r)(s) is an RE denoting the language L(r)L(s) . • (r)* is an RE denoting the language (L(r))* . • (r) is an RE denoting language L(r). • Parentheses can be dropped by associating precedence and associatively. • (a)|((b)*(c)) is a|b*c

Regular Expressions

Regular Definitions • If  is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form: d1 r1 d2 r2 ... dn rn where • Each diis a new symbol, not in  and not the same as any other of the d’s and • Each riis a regular expression over the alphabet   {d1, d2, ..., di-1} • By restricting rito  and the previously defined d’s, we avoid recursive definitions, and we can construct a regular expression over  alone, for each ri.

Extensions of Regular Definitions • One or more instances • The postfix +: positive closure of regular expression and its language. • (r)+, (L(r))+ • Same precedence and associatively as the operator *. • r* = r+ |ε, r+ = rr* = r*r • Zero or one instance • The postfix ? means “zero or one occurrence.” • r? = r |ε • Character classes • a1|a2|... |an can be replaced by [a1a2...an] • Logical sequence a1, a2, ... an: [a-z]

Regular Expressions