1 / 12

Regular Expressions

Regular Expressions. The Role of the Lexical Analyzer. SKIP: Deletion of comments, compaction of whitespace characters TOKEN: Use regular expressions to specify tokens Reasons why the separation of lexical analysis and parsing Simplicity of design is the most important consideration.

breena
Download Presentation

Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular Expressions

  2. The Role of the Lexical Analyzer • SKIP: Deletion of comments, compaction of whitespace characters • TOKEN: Use regular expressions to specify tokens • Reasons why the separation of lexical analysis and parsing • Simplicity of design is the most important consideration. • Compiler efficiency is improved. • Compiler portability is enhanced.

  3. printf(“Total = %d\n”, score); lexeme token= id lexeme token= literal lexeme token= id 3.1.2 Tokens, Patterns, and Lexemes • A token is a pair consisting of a token name and an optional attribute value. • A pattern is a description of the form that the lexemes of token may take. • A lexeme is a sequence of characters in the source program that matches the patter for a token and is identified by the lexical analyzer as an instance of that token.

  4. Attributes for Tokens • When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler phases additional information about the particular lexeme that matched. E = M * C ** 2 <id, pointer to symbol entry for E> <assign_op> <id, pointer to symbol entry forM> <mult_op> <id, pointer to symbol entry forC> <exp_op> <number, integer value 2> The token names and associated attribute values for the FORTRAN statement

  5. Lexical Errors • It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error. E.g., fi ( a == f(x)) ... • The simplest recovery strategy is “panic mode” recovery. • Other possible error-recovery actions • Delete one character from the remaining input. • Insert a missing character into the remaining input. • Replace a character by another character. • Transpose two adjacent characters.

  6. Languages • Analphabet is any finite set of symbols • Binary alphabet {0,1} • 256 ASCII characters • Unicode  100,000 characters from symbols of the world's languages • A string over an alphabet is a finite sequence of symbols drawn from that alphabet. • Synonyms in language theory: sentence, word • |s|: length of a string s • empty stringε • A language is any countable set of strings over some fixed alphabet. • Definition is broad • C, Java, English

  7. Operations on Strings • The concatenation of two strings, x and y, is xy. • x = dog, y = house, xy = doghouse • The empty string is the identity under concatenation, εs = sε= s. • The exponentiation of strings: • s0 = ε • For all i > 0, si = si-1s

  8. Operations on Languages • L ={A, B, ..., Z, a, b,...,z}, D={0, 1, ..., 9} • LD is the set of letters and digits • LD is the set of 520 strings of length 2, each consisting of one letter followed by one digit. • L4 is the set of all 4-letter strings. • L* is the set of all strings of letters, including the empty string, ε. • L(LD )* is the set of all strings of letters and digits beginning with a letter. • L+ is the set of all strings of one or more digits.

  9. Regular Expressions • Rules define the regular expressions (RE) over some alphabet  and the languages those expressions denote. • Basis • ε is an RE, and L(ε) is {ε}. • If a is a symbol in , then a is an RE, and L(a)={a}. • Induction: Suppose r and s are REs denoting languages L(r) and L(s), respectively. • (r)|(s) is an RE denoting the language L(r)  L(s) . • (r)(s) is an RE denoting the language L(r)L(s) . • (r)* is an RE denoting the language (L(r))* . • (r) is an RE denoting language L(r). • Parentheses can be dropped by associating precedence and associatively. • (a)|((b)*(c)) is a|b*c

  10. Regular Expressions

  11. Regular Definitions • If  is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form: d1 r1 d2 r2 ... dn rn where • Each diis a new symbol, not in  and not the same as any other of the d’s and • Each riis a regular expression over the alphabet   {d1, d2, ..., di-1} • By restricting rito  and the previously defined d’s, we avoid recursive definitions, and we can construct a regular expression over  alone, for each ri.

  12. Extensions of Regular Definitions • One or more instances • The postfix +: positive closure of regular expression and its language. • (r)+, (L(r))+ • Same precedence and associatively as the operator *. • r* = r+ |ε, r+ = rr* = r*r • Zero or one instance • The postfix ? means “zero or one occurrence.” • r? = r |ε • Character classes • a1|a2|... |an can be replaced by [a1a2...an] • Logical sequence a1, a2, ... an: [a-z]

More Related