830 likes | 1.1k Views
Chap. 3, Theory and Practice of Scanning. By J. H. Wang Mar. 8, 2011. Outline. Overview of a Scanner Regular Expressions Examples Finite Automata and Scanners The Lex Scanner Generator Other Scanner Generators Practical Considerations of Building Scanners
E N D
Chap. 3, Theory and Practice of Scanning By J. H. Wang Mar. 8, 2011
Outline • Overview of a Scanner • Regular Expressions • Examples • Finite Automata and Scanners • The Lex Scanner Generator • Other Scanner Generators • Practical Considerations of Building Scanners • Regular Expressions and Finite Automata • Summary
Overview of a Scanner • Interactions between the scanner and the parser token source program To semantic analysis Scanner Parser getNextToken Symbol Table
Overview of a Scanner • Lexical analyzer, or lexer • Token structure can be more detailed and subtle than one might expect • String constants: “” • Escape sequence: \”, \n, … • Null string • Rational constants • 0.1, 10.01, • .1, 10. vs. 1..10 • Possible to examine a language for design flaws • Scanner generator avoids reimplementing common components
Programming scanner generator: declarative programming • What to scan, not how to scan • E.g. database query language, Prolog, … • Performance of scanners important for production compilers, for example: • 30,000 lines per minute (500 lines per second) • 10,000 characters per second (for an average line of 20 characters) • For a processor that executes 10,000,000 instructions per second, 1,000 instructions per input character • Considering other tasks in compilers, 250 instructions per character is more realistic
Regular Expressions • Convenient way to specify various simple sets of strings • Search patterns in the Unix utility grep • Context search in most editors
Regular set: a set of strings defined by regular expressions • Lexeme: an instance of a token class • E.g.: identifier • Vocabulary (): a finite character set • ASCII, Unicode • Empty or null string (λ) • Meta-character: ()’*+| • E.g.: (‘(‘|’)’|;|,)
Operations • Catenation: joining individual characters to form a string • sλ≡λs≡ s • If s1P and s2Q, then s1s2(P Q) • Alternation (|): to separate alternatives • E.g. D=(0|1|2|3|4|5|6|7|8|9) • The string s(P|Q) iff sP or sQ • e.g. (LC|UC) • Kleene closure (*): postfix Kleene closure operator • P*: the catenation of zero or more selections from P • sP* iff s=s1s2…sn such that siP(1≤i≤n)
Regular expressions can be defined as follows: • is a regular expression for empty set • λ is a regular expression for the set that contains only the empty string • s is a regular expression denoting {s} • If A and B are regular expressions, then A|B, AB, and A* are also regular expressions
Additional operations • P+: positive closure • P*=(P+|λ), P+=P*P • E.g.: (0|1)+ • Not(A): all characters in not included in A (-A) • E.g. Not(Eol) • Not(S): (*-S) if S is a set of strings • Ak: all string formed by catenating k strings from A • E.g. (0|1)32
Examples • D: the set of the ten single digits • L: the set of all upper- and lower-case letters • Java or C++ single-line comment • Comment=//(Not(Eol))*Eol • Fixed-decimal literal • Lit=D+.D+ • Optionally signed integer literal • IntLiteral=(‘+’|-|λ)D+ • Comments delimited by ## markers which allows single #’s within the comment • Comment2=##((#|λ) Not(#))* ##
All finite sets are regular • Some infinite sets are regular • E.g.: {[m]m|m>=1} is not regular • All regular set can be defined by CFGs • Regular expressions are quite adequate for specifying token-level syntax • For every regular expression we can create an efficient device (finite automaton) that recognizes exactly those strings that match the regular expression’s pattern
Finite Automata and Scanners • A finite automaton (FA) can recognize the tokens specified by a regular expressions • A finite set of states • A finite vocabulary • A set of transitions (or moves) from one state to another • A start state • A subset of the states called the accepting (or final) states • E.g. Fig. 3.1 - (abc+)+
Deterministic Finite Automata • DFA: an FA that always has a unique transition • Transition table T: two-dimensional array indexed by a DFA state s and a vocabulary symbol c • T[s,c] • E.g.: Fig. 3.2 - // (Not(Eol))* Eol
Full transition table contains one column for each character • To save space, table compression is utilized where only nonerror entries are explicitly represented (using hashing or linked structures) • Any regular expression can be translated into a DFA that accepts the set of strings denoted by the regular expression
Coding the DFA • A DFA can be coded in one of two forms • Table-driven • Transition table is explicitly represented in a runtime table that is “interpreted” by a driver program • Token independent • E.g. Fig. 3.3 • Explicit control • Transition table appears implicitly as the control logic of the program • Easy to read, more efficient, but specific to a single token definition • E.g. Fig. 3.4
READ READ
Two more examples of regular expressions • Fortran-like real literal • RealLit=(D+(λ|.)) | (D*.D+) • Fig. 3.5(a) • Identifier • ID=L(L|D)*(_(L|D)+)* • Fig. 3.5(b)
Transducers • An FA that analyzes or transforms its input beyond simply accepting tokens • E.g. identifier processing in symbol table • An action table can be formulated that parallels the transition table
The Lex Scanner Generator • Lex • Developed by M.E. Lesk and E. Schimidt, AT&T Bell Lab. • Flex: a free reimplementation that produces faster and more reliable scanners • JFlex: for Java • (Fig. 3.6)
Steps • Scanner specification • Lex generates a scanner in C • The scanner is compiled and linked with other compiler components
Defining Tokens in Lex • Lex allows the user to associate regular expressions with commands coded in C (or C++) • Lex creates a file lex.yy.c that contains an integer function yylex() • It’s normally called from the parser when a token is needed • It returns the token code of the token scanned by Lex
It’s important that the token codes returned are identical to those expected by the parser • To share the definition of token codes in the file y.tab.h
The Character Class • A set of characters treated identically in a token definition • identifier, number • Delimited by [ ] • \, ^, ], - must be escaped • [\])] • Range: - • [x-z], [0-9], [a-zA-Z] • Escape character: \ • \t, \n, \\, \010 • Complement: ^ (Not() operation) • [^xy], [^0-9], [^]
Using Regular Expression to Define Tokens • Catenation: juxtaposition of two expressions • [ab][cd] • Alternation: | • Case is significant • (w|W)(h|H)(i|I)(l|L)(e|E) • Kleene clousre * and positive closure + • Optional inclusion: ? (zero times or once) • expr?, expr|λ • . (any single character other than a newline) • ^ (beginning of a line), $ (end of line) • ^A.*e$
Three sections • First section • symbolic names associated with character classes and regular expressions • Source code: %{ … %} • Variable, procedure, type declarations • E.g. %{ #include “tokens.h”%}
Second section: table of regular expressions and corresponding commands • Input that is matched is stored in a global string variable yytext (whose length is yyleng) • The default size of yytext is determined by YYLMAX (default: 200) • May need to redefine YYLMAX to avoid overflow • Content of yytext is overwritten as each new token is scanned • It’s safer to copy the contents of yytext (using strcpy()) before the next call to yylex() • In the case of overlap • The longest possible match • The earlier expression is preferred
Character Processing Using Lex • A general-purpose character processing tool • Definitions of subroutines may be placed in the final section • E.g. {Non_f_i_p} {insert(yytext); return(ID); } • Insert() could also be placed in a separate file • End-of-file not handled by regular expressions • A predefined token EndFile, with token code of zero, is automatically returned by yylex() • yylex() uses input(), output(), unput() • when end-of-file encountered, yylex() calls yywrap() • yywrap() returns 1 if there’s no more input
The longest possible match could sometimes be a problem • E.g. 1..10 vs. 1. and .10 • Lex allows us to define a regular expression that applies only if some other expression immediately follows it • r/s: to match r only if s immediately follows it • s: right-context • E.g. [0-9]+/”..” • Symbols might have different meanings in a regular expression and in a character class • Fig. 3.13
Summary of Lex • Lex is a very flexible generator • Difficult part: learning its notation and rules • Lex’s notation for representing regular expressions is used in other programs • E.g. grep utility • Lex can also transform input as a preprocessor • Code segments must be written in C • Not language-independent
Creating a Lexical Analyzer with Lex Lex compiler Lex source program lex.l lex.yy.c C compiler lex.yy.c a.out a.out Input stream Sequence of tokens
Another Example • Patterns for tokens in the grammar • digit [0-9]digits digit+number digits (. digits)? (E [+-]? digits )?letter [A-Za-z]id letter (letter |digit)*if ifthen thenelse elserelop < | > | <= | >= | = | <> • ws (blank | tab | newline)+
Example Lex Program • %{ LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP%}delim [ \t\n]ws {delim}+letter [A-Za-z]digit [0-9]id {letter}({letter}|{digit})*number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?%%{ws} {}if { return(IF); }then { return(THEN); }else { return(ELSE); }
{id} { yylval = (int) installID(); return (ID); }{number} { yylval = (int) installNum(); return (NUMBER); }“<“ {yylval = LT; return(RELOP); }“<=“ {yylval = LE; return(RELOP); }“=“ {yylval = EQ; return(RELOP); }“<>“ {yylval = NE; return(RELOP); }“>“ {yylval = GT; return(RELOP); }“>=“ {yylval = GE; return(RELOP); }%%int installID() {}int installNum() {}
Other Scanner Generators • Flex: free • It produces scanners than are faster than the ones produced by Lex • Options that allow tuning of the scanner size vs. speed • JFlex: in Java • GLA: Generator for Lexical Analyzers • It produces a directly executable scanner in C • It’s typically twice as fast as Flex, and it’s competitive with the best hand-written scanners • re2c • It produces directly executable scanners • Alex, Lexgen, … • Others are parts of complete suites of compiler development tools • DLG: part of PCCTS suites • Coco/R • Rex: part of Karlsruhe/CocoLab cocktail toolbox
Practical Considerations of Building Scanners • Finite automata sometimes fall short • Efficiency concerns • Error handling
Processing Identifiers and Literals • Identifiers can be used in many contexts • The scanner cannot know when to enter the identifier into the symbol table for the current scope or when to return a pointer to an instance from earlier scope • String space: an extendable block of memory used to store the text of identifiers • It avoids frequent calls to new or malloc, and space overhead of storing multiple copies of the same string • Hash table: to assign a unique serial number for each identifier
Literals require processing before they are returned • Numeric conversion can be tricky: overflow or roundoff errors • Standard library routines: atoi(), atof() • Ex.: (in C) • a (* b) • A call to procedure a • Declaration of an identifier b that is a pointer variable (if a has been declared in a ‘typedef’) • To create a table of currently visible identifiers and return a special token typeid for typedef declarations
Processing Reserved Words • Keywords: if, while, … • Most programming languages choose to make keywords reserved • To simplify parsing • To make programs more readable • Ex. (in Pascal and Ada) • begin begin; end; end; begin;end