Scanner. 中正理工學院 電算中心副教授 許良全. Overview of Scanning. The purpose of a scanner is to group input characters into tokens. A scanner is sometimes called a lexical analyzer A precise definition of tokens is necessary to ensure that lexical rules are properly enforced.

  1. Scanner 中正理工學院 電算中心副教授 許良全

  2. Overview of Scanning • The purpose of a scanner is to group input characters into tokens. • A scanner is sometimes called a lexical analyzer • A precise definition of tokens is necessary to ensure that lexical rules are properly enforced. • Scanners normally seek to make a token as long as possible. E.g. ABC is scanned as one identifier rather than three • All scanners perform much the same function • using scanner generator is to limit the effort in building a scanner from scratch

  3. Finite State Systems • The finite state automaton is a mathematical model of a system, with discrete input and outputs

  4. Examples of Finite State Systems • Elevators • do not remember all previous requests for service but only the current floor, the direction of motion, and the collection of not yet satisfied requests for service • Vending machines • insert enough coins and you’ll get a Pepsi eventually • Computers • the state of the CPU, main memory, and auxiliary storage at any time is one of a very large but finite number of states • Human brains • 235 cells or neurons at most

  5. Definition of Finite Automata • A finite automaton (FA) is an idealized 5-tuple computer that recognizes strings belonging to regular sets. (Q,,,q0,F) • A finite set of states, Q • A finite input alphabet, , or vocabulary, V. • A special start, or initial state, q0. q0Q. • A set of final, or accepting states, F. FQ. • A transition function, , that maps Q×F to Q.

  6. FA and Transition Diagrams

  7. FA and Transition Tables

  8. Regular Expressions • The languages accepted by finite automata are easily described by simple expressions called regular expressions. • Strings are built from characters in V via catenation • e.g., !=, for, while • An empty or null string, denoted by , is allowed • The characters, (, ), ‘, *, +, and | are called meta-characters. They must be be quoted when used in order to avoid ambiguity. E.g. Delim = (‘(‘|’)’|:=|;|,|’+’|-|’*’|/|=|$$$)

  9. Definition of Regular Expression • A regular expression denotes a set of strings: •  is a regular expression denoting the empty set (the set containing no strings). •  is a regular expression denoting the set that contains only the empty string. • Note that this set contains one element. • A string s is a regular expression denoting a set containing only s. If s contains meta-characters, s can be quoted to avoid ambiguity. • If A and B are regular expressions, then A|B, AB, and A* are also regular expressions, corresponding to alternation, catenation, and Kleeneclosure respectively.

  10. Properties of Regular Expressions • Let P and Q be a set of strings • The string s (P|Q) iff sP or sQ • The string s P* iff s can be broken into zero or more pieces: s = s1s2s3…sn such that each si P. • P+ denotes all strings consisting one or more strings in P catenated together • P* = (P+|) and P+ = PP* = P*P • If A is a set of characters, Not(A) denotes (V-A) • all characters in V not included in A. • If k is a constant, the set Ak represents all strings formed by catenating k strings from A, i.e., Ak= (AAA…) (k copies)

  11. Examples of Regular Expressions • Let D = (0|…|9), L = (A|…|Z) • A comment that begins with -- and ends with Eol • Comment = --Not(Eol)*Eol • A fixed decimal literal • Lit = D+.D+ • An identifier, composed of letters, digits, and underscores, that begins with a letter, ends with a letter or digit, and contains no consecutive underscores • ID = L(L|D)*(_(L|D)+)*

  12. Using a Scanner Generator: Lex • Lex is a lexical analyzer generator developed by Lesk and Schmidt of AT&T Bell Lab, written in C, running under UNIX. • Lex produces an entire scanner module that can be compiled and linked with other compiler modules. • Lex associates regular expressions with arbitrary code fragments. When an expression is matched, the code segment is executed. • A typical lex program contains three sections separated by %% delimiters.

  13. First Section of Lex • The first section define character classes and auxiliary regular expression. (Fig. 3.5 on p. 67) • [] delimits character classes • - denotes ranges: [xyz] = = [x-z] • \ denotes the escape character: as in C. • ^ complements a character class, (Not): • [^xy] denotes all characters except x and y. • |, *, and + (alternation, Kleene closure, and positive closure) are provided. • () can be used to control grouping of subexpressions. • (expr)? = = (expr)|, i.e. matches Expr zero times or once. • {} signals the macroexpansion of a symbol defined in the first section.

  14. First Section of Lex, cont. • Catenation is specified by the juxtaposition of two expressions; no explicit operator is used. • [ab][cd] will match any of ad, ac, bc, and bd. • begin = = “begin” = = [b][e][g][i][n]

  15. Second Section of Lex • The second section of lex defines a table of regular expressions and corresponding commands. • When an expression is matched, its associated command is executed. • Auxiliary functions may be defined in the third section. • Input that is matched is stored in the string variable yytext whose length is yyleng. • Lex creates an integer function yylex() that may be called from the parser. • The value returned is usually the token code of the token scanned by Lex. • When yylex() encounters end of file, it calls a use-supplied integer function named yywrap() to wrap up input processing.

  16. Dealing with Multiple Input Files • yylex() uses three user-defined functions to handle character I/O: • input(): retrieve a single character, 0 on EOF • output(c): write a single character to the output • unput(c): put a single character back on the input to be re-read

  17. Translating Regular Expressions into Finite Automata • Remember the relationship between RE and FA. • The main job of a scanner generator program is to transform a regular expression definition into an equivalent (D)FA. • A regular expression is first translated into a nondeterministic finite automaton (NFA), then translated from NFA into DFA. (2 steps) • An NFA, when reading a particular input is not required to make a unique (deterministic) choice of which state to visit.

  18. Translating RE into NFA • Any regular expression can be transformed into an NFA with the following properties: • There is a unique final state • The final state has no successors • Every other state has either one or two successors • Regular expressions are built out of the atomic regular expressions a (where a is a character in V) and  by using the three operations AB, A|B, and A*.

  19. NFA for a and l

  20. An NFA for A|B

  21. An NFA for A B

  22. An NFA for A*

  23. Translating NFA into DFA • Each state of DFA (M) corresponds to a set of states of NFA (N) • transforming N to M is done by subset construction • M will be in state {x,y,z} after reading a given input string if and only if N could be in any of the states x, y, or z, depending on the transitions it chooses. • M keeps track of all the possible routes N might take and runs them in parallel.

