170 likes | 468 Views
Specification of tokens using regular expressions. Strings and Languages An alphabet is any finite set of symbols. Examples of symbols are letters, digits, and punctuation. The set {0,1} is the binary alphabet.
E N D
Strings and Languages • An alphabetis any finite set of symbols. • Examples of symbols are letters, digits, and punctuation. • The set {0,1} is the binary alphabet. • A stringover an alphabet is a finite sequence of symbols drawn from that alphabet. • "sentence" and "word" are often used as synonyms for "string.“ • The empty string, denoted , is the string of length zero. • A languageis any countable set of strings over some fixed alphabet. • The set containing only the empty string, are languages {}.
Regularexpression • Regular expressions are an important notation for specifying lexeme patterns. They are effective in specifying those types of patterns that we need for tokens • Regular expression notations for identifiers are identifier=letter (letter/digit)*
The regular expressions are built recursively out of smaller regular expressions, using the rules described below. • Regular expression construction rules • Є is a regular expression denoting {є}, that is, the language containing only the empty string • Ifais a symbol in ∑(alphabet), a is a regular expression denoting {a}, the language with only one string. • If r and s are regular expressions denoting languages L ( r) and L(s ) respectively, then • (r)|(s) is a regular expression denoting L( r) U L(s) • (r).(s) is a regular expression denoting L( r). L(s) • (r)* is a regular expression denoting (L(r ))*
Precedence of operations • The unary operator * has highest precedence and is left associative. • Concatenation has second highest precedence and is left associative. • | has lowest precedence and is left associative • For any regular expressions R , S and T the following axioms holds • R|S=S|R (| is commutative) • R|(S|T)=(R|S)|T (| is assosiative) • R(ST)=(RS)T (concatenation is assosiative) • R(S|T)=RS|RT (concatenation distributes over |) • ЄR=Rє=R (є is the identity for concatenation)
The regular expression a|bdenotes the language {a, b}. • (a|b)(a|b) denotes {aa, ah, ba, bb}, the set of all strings of length two over the alphabet. • a* denotes the language consisting of all strings of zero or more a's, that is, { , a , a a , a aa , . . . }. • (a|b)* denotes the set of all strings consisting of zero or more instances of a or b • a|a*b denotes the language {a, b, ab, aab, aaab,...}
Regular definition • We may wish to give names to certain regular expressions and use those names in subsequent expressions as if the names were themselves symbols • di-> ri • e.g. for language of C identifiers • letter_->A|B|…|Z|a|b|…|z|_ • digit -> 0|1|…|9 • id -> letter_(letter_|digit)*
Extension of regular expression • + one/ more instance • * zero/ more instance • ? Zero/ one instance • [ ] character classes e.g. [a-z] • ws -> (blank|tab|newline)+ • When ws is recognized , we do not return anything but restart to the character following white space