1 / 83

Chap. 3, Theory and Practice of Scanning

Chap. 3, Theory and Practice of Scanning. By J. H. Wang Mar. 8, 2011. Outline. Overview of a Scanner Regular Expressions Examples Finite Automata and Scanners The Lex Scanner Generator Other Scanner Generators Practical Considerations of Building Scanners

eden
Download Presentation

Chap. 3, Theory and Practice of Scanning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chap. 3, Theory and Practice of Scanning By J. H. Wang Mar. 8, 2011

  2. Outline • Overview of a Scanner • Regular Expressions • Examples • Finite Automata and Scanners • The Lex Scanner Generator • Other Scanner Generators • Practical Considerations of Building Scanners • Regular Expressions and Finite Automata • Summary

  3. Overview of a Scanner • Interactions between the scanner and the parser token source program To semantic analysis Scanner Parser getNextToken Symbol Table

  4. Overview of a Scanner • Lexical analyzer, or lexer • Token structure can be more detailed and subtle than one might expect • String constants: “” • Escape sequence: \”, \n, … • Null string • Rational constants • 0.1, 10.01, • .1, 10. vs. 1..10 • Possible to examine a language for design flaws • Scanner generator avoids reimplementing common components

  5. Programming scanner generator: declarative programming • What to scan, not how to scan • E.g. database query language, Prolog, … • Performance of scanners important for production compilers, for example: • 30,000 lines per minute (500 lines per second) • 10,000 characters per second (for an average line of 20 characters) • For a processor that executes 10,000,000 instructions per second, 1,000 instructions per input character • Considering other tasks in compilers, 250 instructions per character is more realistic

  6. Regular Expressions • Convenient way to specify various simple sets of strings • Search patterns in the Unix utility grep • Context search in most editors

  7. Regular set: a set of strings defined by regular expressions • Lexeme: an instance of a token class • E.g.: identifier • Vocabulary (): a finite character set • ASCII, Unicode • Empty or null string (λ) • Meta-character: ()’*+| • E.g.: (‘(‘|’)’|;|,)

  8. Operations • Catenation: joining individual characters to form a string • sλ≡λs≡ s • If s1P and s2Q, then s1s2(P Q) • Alternation (|): to separate alternatives • E.g. D=(0|1|2|3|4|5|6|7|8|9) • The string s(P|Q) iff sP or sQ • e.g. (LC|UC) • Kleene closure (*): postfix Kleene closure operator • P*: the catenation of zero or more selections from P • sP* iff s=s1s2…sn such that siP(1≤i≤n)

  9. Regular expressions can be defined as follows: •  is a regular expression for empty set • λ is a regular expression for the set that contains only the empty string • s is a regular expression denoting {s} • If A and B are regular expressions, then A|B, AB, and A* are also regular expressions

  10. Additional operations • P+: positive closure • P*=(P+|λ), P+=P*P • E.g.: (0|1)+ • Not(A): all characters in  not included in A (-A) • E.g. Not(Eol) • Not(S): (*-S) if S is a set of strings • Ak: all string formed by catenating k strings from A • E.g. (0|1)32

  11. Examples • D: the set of the ten single digits • L: the set of all upper- and lower-case letters • Java or C++ single-line comment • Comment=//(Not(Eol))*Eol • Fixed-decimal literal • Lit=D+.D+ • Optionally signed integer literal • IntLiteral=(‘+’|-|λ)D+ • Comments delimited by ## markers which allows single #’s within the comment • Comment2=##((#|λ) Not(#))* ##

  12. All finite sets are regular • Some infinite sets are regular • E.g.: {[m]m|m>=1} is not regular • All regular set can be defined by CFGs • Regular expressions are quite adequate for specifying token-level syntax • For every regular expression we can create an efficient device (finite automaton) that recognizes exactly those strings that match the regular expression’s pattern

  13. Finite Automata and Scanners • A finite automaton (FA) can recognize the tokens specified by a regular expressions • A finite set of states • A finite vocabulary • A set of transitions (or moves) from one state to another • A start state • A subset of the states called the accepting (or final) states • E.g. Fig. 3.1 - (abc+)+

  14. Deterministic Finite Automata • DFA: an FA that always has a unique transition • Transition table T: two-dimensional array indexed by a DFA state s and a vocabulary symbol c • T[s,c] • E.g.: Fig. 3.2 - // (Not(Eol))* Eol

  15. Full transition table contains one column for each character • To save space, table compression is utilized where only nonerror entries are explicitly represented (using hashing or linked structures) • Any regular expression can be translated into a DFA that accepts the set of strings denoted by the regular expression

  16. Coding the DFA • A DFA can be coded in one of two forms • Table-driven • Transition table is explicitly represented in a runtime table that is “interpreted” by a driver program • Token independent • E.g. Fig. 3.3 • Explicit control • Transition table appears implicitly as the control logic of the program • Easy to read, more efficient, but specific to a single token definition • E.g. Fig. 3.4

  17. READ READ

  18. Two more examples of regular expressions • Fortran-like real literal • RealLit=(D+(λ|.)) | (D*.D+) • Fig. 3.5(a) • Identifier • ID=L(L|D)*(_(L|D)+)* • Fig. 3.5(b)

  19. Transducers • An FA that analyzes or transforms its input beyond simply accepting tokens • E.g. identifier processing in symbol table • An action table can be formulated that parallels the transition table

  20. The Lex Scanner Generator • Lex • Developed by M.E. Lesk and E. Schimidt, AT&T Bell Lab. • Flex: a free reimplementation that produces faster and more reliable scanners • JFlex: for Java • (Fig. 3.6)

  21. The Operation of the Lex Scanner Generator

  22. Steps • Scanner specification • Lex generates a scanner in C • The scanner is compiled and linked with other compiler components

  23. Defining Tokens in Lex • Lex allows the user to associate regular expressions with commands coded in C (or C++) • Lex creates a file lex.yy.c that contains an integer function yylex() • It’s normally called from the parser when a token is needed • It returns the token code of the token scanned by Lex

  24. It’s important that the token codes returned are identical to those expected by the parser • To share the definition of token codes in the file y.tab.h

  25. The Character Class • A set of characters treated identically in a token definition • identifier, number • Delimited by [ ] • \, ^, ], - must be escaped • [\])] • Range: - • [x-z], [0-9], [a-zA-Z] • Escape character: \ • \t, \n, \\, \010 • Complement: ^ (Not() operation) • [^xy], [^0-9], [^]

  26. Using Regular Expression to Define Tokens • Catenation: juxtaposition of two expressions • [ab][cd] • Alternation: | • Case is significant • (w|W)(h|H)(i|I)(l|L)(e|E) • Kleene clousre * and positive closure + • Optional inclusion: ? (zero times or once) • expr?, expr|λ • . (any single character other than a newline) • ^ (beginning of a line), $ (end of line) • ^A.*e$

  27. Three sections • First section • symbolic names associated with character classes and regular expressions • Source code: %{ … %} • Variable, procedure, type declarations • E.g. %{ #include “tokens.h”%}

  28. Second section: table of regular expressions and corresponding commands • Input that is matched is stored in a global string variable yytext (whose length is yyleng) • The default size of yytext is determined by YYLMAX (default: 200) • May need to redefine YYLMAX to avoid overflow • Content of yytext is overwritten as each new token is scanned • It’s safer to copy the contents of yytext (using strcpy()) before the next call to yylex() • In the case of overlap • The longest possible match • The earlier expression is preferred

  29. Character Processing Using Lex • A general-purpose character processing tool • Definitions of subroutines may be placed in the final section • E.g. {Non_f_i_p} {insert(yytext); return(ID); } • Insert() could also be placed in a separate file • End-of-file not handled by regular expressions • A predefined token EndFile, with token code of zero, is automatically returned by yylex() • yylex() uses input(), output(), unput() • when end-of-file encountered, yylex() calls yywrap() • yywrap() returns 1 if there’s no more input

  30. The longest possible match could sometimes be a problem • E.g. 1..10 vs. 1. and .10 • Lex allows us to define a regular expression that applies only if some other expression immediately follows it • r/s: to match r only if s immediately follows it • s: right-context • E.g. [0-9]+/”..” • Symbols might have different meanings in a regular expression and in a character class • Fig. 3.13

  31. Summary of Lex • Lex is a very flexible generator • Difficult part: learning its notation and rules • Lex’s notation for representing regular expressions is used in other programs • E.g. grep utility • Lex can also transform input as a preprocessor • Code segments must be written in C • Not language-independent

  32. Creating a Lexical Analyzer with Lex Lex compiler Lex source program lex.l lex.yy.c C compiler lex.yy.c a.out a.out Input stream Sequence of tokens

  33. Another Example • Patterns for tokens in the grammar • digit  [0-9]digits  digit+number digits (. digits)? (E [+-]? digits )?letter [A-Za-z]id  letter (letter |digit)*if  ifthen  thenelse  elserelop  < | > | <= | >= | = | <> • ws  (blank | tab | newline)+

  34. Example Lex Program • %{ LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP%}delim [ \t\n]ws {delim}+letter [A-Za-z]digit [0-9]id {letter}({letter}|{digit})*number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?%%{ws} {}if { return(IF); }then { return(THEN); }else { return(ELSE); }

  35. {id} { yylval = (int) installID(); return (ID); }{number} { yylval = (int) installNum(); return (NUMBER); }“<“ {yylval = LT; return(RELOP); }“<=“ {yylval = LE; return(RELOP); }“=“ {yylval = EQ; return(RELOP); }“<>“ {yylval = NE; return(RELOP); }“>“ {yylval = GT; return(RELOP); }“>=“ {yylval = GE; return(RELOP); }%%int installID() {}int installNum() {}

  36. Other Scanner Generators • Flex: free • It produces scanners than are faster than the ones produced by Lex • Options that allow tuning of the scanner size vs. speed • JFlex: in Java • GLA: Generator for Lexical Analyzers • It produces a directly executable scanner in C • It’s typically twice as fast as Flex, and it’s competitive with the best hand-written scanners • re2c • It produces directly executable scanners • Alex, Lexgen, … • Others are parts of complete suites of compiler development tools • DLG: part of PCCTS suites • Coco/R • Rex: part of Karlsruhe/CocoLab cocktail toolbox

  37. Practical Considerations of Building Scanners • Finite automata sometimes fall short • Efficiency concerns • Error handling

  38. Processing Identifiers and Literals • Identifiers can be used in many contexts • The scanner cannot know when to enter the identifier into the symbol table for the current scope or when to return a pointer to an instance from earlier scope • String space: an extendable block of memory used to store the text of identifiers • It avoids frequent calls to new or malloc, and space overhead of storing multiple copies of the same string • Hash table: to assign a unique serial number for each identifier

  39. Literals require processing before they are returned • Numeric conversion can be tricky: overflow or roundoff errors • Standard library routines: atoi(), atof() • Ex.: (in C) • a (* b) • A call to procedure a • Declaration of an identifier b that is a pointer variable (if a has been declared in a ‘typedef’) • To create a table of currently visible identifiers and return a special token typeid for typedef declarations

  40. Processing Reserved Words • Keywords: if, while, … • Most programming languages choose to make keywords reserved • To simplify parsing • To make programs more readable • Ex. (in Pascal and Ada) • begin begin; end; end; begin;end

More Related