Specification of tokens using regular expressions

Specification oftokens using regular expressions

Strings and Languages • An alphabetis any finite set of symbols. • Examples of symbols are letters, digits, and punctuation. • The set {0,1} is the binary alphabet. • A stringover an alphabet is a finite sequence of symbols drawn from that alphabet. • "sentence" and "word" are often used as synonyms for "string.“ • The empty string, denoted , is the string of length zero. • A languageis any countable set of strings over some fixed alphabet. • The set containing only the empty string, are languages {}.

Regularexpression • Regular expressions are an important notation for specifying lexeme patterns. They are effective in specifying those types of patterns that we need for tokens • Regular expression notations for identifiers are identifier=letter (letter/digit)*

The regular expressions are built recursively out of smaller regular expressions, using the rules described below. • Regular expression construction rules • Є is a regular expression denoting {є}, that is, the language containing only the empty string • Ifais a symbol in ∑(alphabet), a is a regular expression denoting {a}, the language with only one string. • If r and s are regular expressions denoting languages L ( r) and L(s ) respectively, then • (r)|(s) is a regular expression denoting L( r) U L(s) • (r).(s) is a regular expression denoting L( r). L(s) • (r)* is a regular expression denoting (L(r ))*

Precedence of operations • The unary operator * has highest precedence and is left associative. • Concatenation has second highest precedence and is left associative. • | has lowest precedence and is left associative • For any regular expressions R , S and T the following axioms holds • R|S=S|R (| is commutative)‏ • R|(S|T)=(R|S)|T (| is assosiative)‏ • R(ST)=(RS)T (concatenation is assosiative)‏ • R(S|T)=RS|RT (concatenation distributes over |)‏ • ЄR=Rє=R (є is the identity for concatenation)‏

The regular expression a|bdenotes the language {a, b}. • (a|b)(a|b) denotes {aa, ah, ba, bb}, the set of all strings of length two over the alphabet. • a* denotes the language consisting of all strings of zero or more a's, that is, { , a , a a , a aa , . . . }. • (a|b)* denotes the set of all strings consisting of zero or more instances of a or b • a|a*b denotes the language {a, b, ab, aab, aaab,...}

Regular definition • We may wish to give names to certain regular expressions and use those names in subsequent expressions as if the names were themselves symbols • di-> ri • e.g. for language of C identifiers • letter_->A|B|…|Z|a|b|…|z|_ • digit -> 0|1|…|9 • id -> letter_(letter_|digit)*

Extension of regular expression • + one/ more instance • * zero/ more instance • ? Zero/ one instance • [ ] character classes e.g. [a-z] • ws -> (blank|tab|newline)+ • When ws is recognized , we do not return anything but restart to the character following white space

Specification of tokens using regular expressions

Specification of tokens using regular expressions

Presentation Transcript

Regular Expressions using Ruby

Regular Expressions

Regular Expressions

Using regular expressions

Regular Expressions

Regular Expressions

Regular expressions

Syntax Specification Regular Expressions

Regular Expressions

Validation using Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Text Extraction using Regular Expressions

Validation using Regular Expressions

Regular Expressions

Regular expressions

Text Extraction using Regular Expressions

Regular Expressions