230 likes | 322 Views
Scanning & Regular Expressions. CPSC 388 Ellen Walker Hiram College. Scanning. Input: characters from the source code Output: Tokens Keywords: IF, THEN, ELSE, FOR … Symbols: PLUS, LBRACE, SEMI … Variable tokens: ID, NUM Augment with string or numeric value. TokenType.
E N D
Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College
Scanning • Input: characters from the source code • Output: Tokens • Keywords: IF, THEN, ELSE, FOR … • Symbols: PLUS, LBRACE, SEMI … • Variable tokens: ID, NUM • Augment with string or numeric value
TokenType • Enumerated type (a c++ construct) Typedef enum {IF, THEN, ELSE …} TokenType • IF, THEN, ELSE (etc) are now literals of type TokenType
Using TokenType void someFun(TokenType tt){ … switch (tt){ case IF: … break; case THEN: … break; … }
Token Class (partial) class Token { public: TokenType tokenval; string tokenchars; double numval; }
Interlude: References and Pointers • Java has primitives and references • Primitives are int, char, double, etc. • References “point to” objects • C++ has only primitives • But, one of the primitives is “address”, which serves the purpose of a reference.
Interlude: References and Pointers • To declare a pointer, put * after the type char x; // a character char *y; // a pointer to a character • Using pointers: x = ‘a’; y = &x; //y gets the address of x *y = ‘b’; //thing pointed at by y becomes ‘b’; //note that x is now also b!
Interlude: References and Pointers • Continuing the example… cout << x << endl; // prints b cout << *y << endl; // prints b cout << y << endl; // prints a hex address cout << &x << endl; // same as above cout << &y << endl; // a different address - where the pointer is stored
GetToken(): A scanning function • Token *getToken(istream &sin) • Read characters from sin until a complete token is extracted, return (a pointer to) the token • Usually called by the parser • Note: version in the book uses global variables and returns only the token type
Using GetToken Token *myToken = GetToken(cin); while (myToken != NULL){ //process the token switch (myToken->TokenType){ //cases for each token type } myToken = GetToken(cin); }
Tokens and Languages • The set of valid tokens of a particular type is a Language (in the formal sense) • More specifically, it is a Regular Language
Language Formalities • Language: set of strings • String: sequence of symbols • Alphabet: set of legal symbols for strings • Generally is used to denote an alphabet
Example Languages • L1 = {aa, ab, bb} , S = {a, b} • L2 = {e,ab, abab, … }, S = {a, b} • L3 = {strings of N a’s where N is an odd integer}, S = {a} • L4 = { } (one string with no symbols) • L5 = { } (no strings at all) • L5 = Ø
Denoting Languages • Expressions (regular languages only) • Grammars • Set of rewrite rules that express all and only the strings in the language • Automata • Machines that “accept” all and only the strings in the language
Primitive Regular Expressions • • L() = {}(no strings) • e • L(e) = {e} (one string, no symbols) • a where a is a member of S • L(a) = {a} (one string, one symbol)
Combining Regular Expressions • Choice: r | s (sometimes r+s) • L(r | s) = L(r ) L(s) • Concatenation:rs • L(rs) = L(r)L(s) • All combinations of 1 from r and 1 from s • Repetition: r* • L(r*) = e L(r )L(rr)L(rrr ) … • 0 or more strings from r concatenated
Precedence • Repetition before concatenation • Concatenation before choice • Use parentheses to override • aa* vs. (aa)* • ab|c vs. a(b|c)
Example Languages • L1 = {aa, ab, bb} , S = {a, b} • L2 = {e,ab, abab, … }, S = {a, b} • L3 = {strings of N a’s where N is an odd integer}, S = {a} • L4 = { } (one string with no symbols) • L5 = { } (no strings at all) • L5 = Ø
R.E.’s for Examples • L1 = aa | ab | bb • L1 = a(a|b) | bb • L1 = aa | (a|b) b • L2 = (ab)* not ab* ! • L3 = a(aa)*
What are these languages? • a* | b* | c* • a*b*c* • (a*b*)* • a(a|b)*c • (a|b|c)*bab(a|b|c)*
What are the RE’s? • In the alphabet {a,b,c}: • All strings that are in alphabetical order • All strings that have the first a before the first b, before the first c, e.g. ababbabca • All strings that contain “abc” • All strings that do not contain “abc”
Extended Reg. Exp’s • Additional operations for convenience r+ = rr* (one or more reps) . ( any character in the alphabet) .* = any possible string from the alphabet [a-z] = a|b|c|…|z [^aeiou] = b|c|d|f|g|h|j...