210 likes | 592 Views
sc4312. Lecture 2b. 2. Scanner as a DFA. A scanner is a big DFA.. . 0. " ". . 1. . letter. letter. digit. . 2. . digit. digit. . . 3. . (. . 4. . >. . 5. . =. .... ident. number. lpar. gtr. geq. sc4312. Lecture 2b. 3. Scanning. Scanners tend to be built in two waysad-hoc or handwritten using goto or nested case statementsfor more details see Figure 2.11table-driven DFA.
E N D
1. sc4312 Lecture 2b 1 Implementing a Scanner SC4312 Compiler Construction
Originally prepared by Dr.Songsak Channarukul
Modified by
Dr.Kwankamol Nongpong
2. sc4312 Lecture 2b 2 Scanner as a DFA A scanner is a big DFA.
3. sc4312 Lecture 2b 3 Scanning Scanners tend to be built in two ways
ad-hoc or handwritten using goto or nested case statements
for more details see Figure 2.11
table-driven DFA
4. sc4312 Lecture 2b 4 Scanning Ad-hoc generally yields the fastest, most compact code by doing lots of special-purpose things, though good automatically-generated scanners come very close
Table-driven DFA is what scanner generators like lex and scangen produce
lex (flex) in the form of C code
scangen in the form of numeric tables and a separate driver (for details see Figure 2.12)
5. sc4312 Lecture 2b 5 The Rule (almost universal) The scanner always accepts the longest-possible token from the input.
One legitimate token could be a prefix of another token.
However, we cant tell whether a longer token is possible without peeking at more than one character ahead (lookahead).
6. sc4312 Lecture 2b 6 Lookahead Take Pascal for example, when you have a 3 and you a see a dot, you need to peek at the character beyond the dot.
3.14 (a single token designating a real number)
3..5 (three tokens designating a range)
7. sc4312 Lecture 2b 7 DFA Implementation (Hand-Coded) A DFA can be implemented by hand-coding the states in the source code.
8. sc4312 Lecture 2b 8 DFA Implementation (Table-Driven) A DFA can be implemented as a matrix of d.
9. sc4312 Lecture 2b 9 Scanner Generator A scanner generator is a program that generates a source code for a scanner from a specification file in a target language.
Compiler generators usually include a scanner generator (e.g., lex, flex) and a parser generator (e.g., yacc, bison).
In this class, we will use a compiler generator, Coco/R, which generates both scanner and parser in C++, C#, Java and etc.
10. sc4312 Lecture 2b 10 Overview of Coco/R Coco/R takes an attributed grammar of a source language as an input.
Its output includes Scanner, Parser, and related classes.
Other semantic classes (e.g., symbol table, code generator) will have to be hand-coded.
The main program will create a Parser object and start the compilation process from there.
11. sc4312 Lecture 2b 11 Generated Scanner The scanner generated by Coco/R is implemented as a DFA.
Therefore, the lexical rules must be specified by an EBNF grammar.
Tokens must be made up of characters from the extended ASCII set (256 values).
The scanner can be made case-sensitive or case-insensitive.
12. sc4312 Lecture 2b 12 EBNF in Coco/R = separates the sides of a production.
. terminates a production.
| separates alternatives.
() groups alternatives.
[] specifies an option.
{} specifies an iteration (zero or more).
13. sc4312 Lecture 2b 13 Structure of Coco/R Specification The Coco/R specification has the following structure:
Cocol =
[Imports]
COMPILER ident
[GlobalFieldsAndMethods]
ScannerSpecification
ParserSpecification
END ident .
14. sc4312 Lecture 2b 14 Scanner Specification The scanner specification consists of five optional parts:
ScannerSpecification =
[IGNORECASE]
[CHARACTERS {SetDecl}]
[TOKENS {TokenDecl}]
[PRAGMAS {PragmaDecl}]
{CommentDecl}
{WhiteSpaceDecl}.
15. sc4312 Lecture 2b 15 Character Sets Character sets are defined to be used in later sections.
Examples:
digit = 0123456789.
hexDigit = digit + ABCDEF.
letter = A .. Z.
eol = \r.
noDigit = ANY digit.
16. sc4312 Lecture 2b 16 Tokens Tokens may be divided into two groups.
Literals have a fixed representation in the source language (e.g., while, >=).
Token classes have a certain structure that must be explicitly declared by a regular expression in EBNF.
17. sc4312 Lecture 2b 17 Tokens Specification A token specification is as follow:
TokenDecl = Symbol [= TokenExpr .].
TokenExpr = TokenTerm {| TokenTerm}.
TokenTerm = TokenFactor {TokenFactor}
[CONTEXT ( TokenExpr )].
TokenFactor = Symbol
| ( TokenExpr )
| { TokenExpr }
| [ TokenExpr ].
Symbol = ident | string | char.
18. sc4312 Lecture 2b 18 Examples ident = letter {letter | digit | _}.
number = digit {digit}
| 0x hexDigit hexDigit hexDigit hexDigit.
float = digit {digit} . {digit}
[E [+|-] digit {digit}].
while = while.
public = public.
19. sc4312 Lecture 2b 19 Comments Comments in programming languages are usually hard to specify with regular expressions.
Nested comments are even harder.
Coco/R allows us to specify comments easily as follows:
COMMENTS FROM // TO eol
COMMENTS FROM (* TO *) NESTED
20. sc4312 Lecture 2b 20 White Spaces White spaces are not relevant to a source program therefore they must be discarded by a scanner.
The specification of white spaces in Coco/R is as follows:
IGNORE \t + \r + \n
21. sc4312 Lecture 2b 21 The Main Class of a Compiler public class Compiler {
public static void Main(string[] arg) {
Scanner scanner = new Scanner(arg[0]);
Parser parser = new Parser(scanner);
parser.Parse();
Console.WriteLine(parser.errors.count +
errors detected.);
}
}