370 likes | 444 Views
Automata and Regular Expression. Discrete Mathematics and Its Applications Baojian Hua bjhua@ustc.edu.cn. Nonterminal symbols. Terminal symbols. MiniC Formal Grammar. prog -> stm prog | stm stm -> id = exp ; | print ( exp ); exp -> exp + exp | exp - exp
E N D
Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua bjhua@ustc.edu.cn
Nonterminal symbols Terminal symbols MiniC Formal Grammar prog -> stm prog | stm stm -> id= exp; | print(exp); exp -> exp + exp | exp - exp | exp * exp | exp / exp | id | num | (exp)
ident and intconst • For most programming languages, the terminal symbols represent the basic punctuation symbols, keywords, and operators • <id> and <num> are special in that they represent (infinite) sets of terminal symbols id ::= letter idRest letter ::= _ | a |…| z | A |… | Z idRest ::= |letteridRest|digitidRest digit::= 0 | 1 | 2 | … | 9 <exp> ::= … | <ident> | <intconst>
Lexical Analysis • The lexical analyzer translates the source program into a stream of lexical tokens • Source program: • stream of (ASCII or Unicode) characters • Lexical token: • internal data structure that represents the occurrence of a terminal symbol
Example x = 11; y = 2; z = x + y; print (z); lexical analysis IDENT(x) ASSIGN INTCONST(11) SEMICOLON NEWLINE IDENT(y) ASSIGN INTCONST(2) SEMICOLON NEWLINE IDENT(z) ASSIGN IDENT(x) PLUS IDENT(y) SEMICOLON NEWLINE PRINT LPAREN IDENT(z) RPAREN SEMICOLON EOF
Practice Issues • What are tokens? • Recall the tagged union in our previous slides • What if the input characters are illegal? • Limited checking of the grammatical structure of input • only checks that input stream can be viewed as a stream of terminal symbols
Lexical Errors Not lexical errors x = 11 y = = = = = = 2; z =@ x + y; print (#z); Lexical errors
Position Info • For the purpose of later phases, it is useful to attach position information to each token • we’d see how to make use of such kind of info in later slides LPRREN(1,4) IDENT(x,4,5) MINUS(6,7) …
Tokens in C #ifndef TOKEN_H #define TOKEN_H enum tokenKind {ID, NUM, ASSIGN, LPAREN, …}; typedef tokenStruct *token; struct tokenStruct { enum tokenKind kind; union {…} u; int line; int column; }; #endif
Lexer Interface #ifndef LEXER_H #define LEXER_H #include “token.h” token nextToken (char *fileName); #endif
Client Code #include “lexer.h” int main() { // we want to analysis file “test.c” token t = nextToken (“test.c”); while (t!=EOF); { … t = nextToken (“test.c”); … } return 0; }
M Input String {Yes, No} Finite-state Automata (FAs) M = (, S, s0, F, f) Transition function Input alphabet State set Final states Initial state
A deterministic finite automaton (DFA) Transition Functions f:S S which can be extended to f’:S * S and or in an inductive form: • f’(q, ) = q • f’(q, a) = f’(f(q, a), )
a a 0 1 2 b b a,b DFA Example • Which strings of as and bs are accepted? • Transition function: • { (s0,a)s1, (s0,b)s0, (s1,a)s2, (s1,b)s1, (s2,a)s2, (s2,b)s2 }
Nondeterministic FAs (NFAs) • NFAs can transition to more than one state on any input • f:S (S) • As before, can extend: • f’:S * (S) • Inductively: f’(q, ) = {q} f’(q, a) = pf(q, a)f’(p, )
a,b 0 1 b a b NFA Example • Transition function: { (s0,a){s0,s1}, (s0,b){s1}, (s1,a), (s1,b){s0,s1} }
Regular Expressions • A regular language can always be described using a regular expression. • Examples • (01)* • 00 • • (a|b)*ab • this | that | theother • 0*1*2* • 01*|0 =01* • 00*11*22* =0+1+2+ • (1|0)*00(0|1)*
Regular Expressions and Tokens • Regular expressions are convenient for describing lexical tokens • intconst: [0-9][0-9]* • ident: [_a-zA-Z][_a-zA-Z0-9_]* • others: = | print | + | …
Regular Expressions • Let = {a,b}. • is a regular expression • L = {}
Regular Expressions • Let = {a,b}. • is a regular expression • L = {} • is a regular expression • L = {}
Regular Expressions • Let = {a,b}. • is a regular expression • L = {} • is a regular expression • L = {} • a is a regular expression • L = {a} a
Regular Expressions • Let = {a,b}. • is a regular expression • L = {} • is a regular expression • L = {} • a is a regular expression • L = {a}
Regular Expressions • Let = {a,b}. • is a regular expression • L = {} • is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are
Regular Expressions • Let = {a,b}. • is a regular expression • L = {} • is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are • LR+S = LRU LS
Regular Expressions • Let = {a,b}. • is a regular expression • L = {} • is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are • LR+S = LRU LS R S
Regular Expressions • Let = {a,b}. • is a regular expression • L = {} • is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are • LR+S = LRU LS R S
Regular Expressions • Let = {a,b}. • is a regular expression • is a regular expression • a is a regular expression • R|S is a regular expression if R and S are • RS is a regular expression if R and S are • LRS = {uv | u LR & v LS } S R
Regular Expressions • Let = {a,b}. • is a regular expression • is a regular expression • a is a regular expression • R|S is a regular expression if R and S are • RS is a regular expression if R and S are • R* is a regular expression if R is • LR* = U0 i LRi R
Regular Expressions • The language described by a regular expression can be accepted by an FA. RE NFA NFA DFA • A regular grammar can always be described using a regular expression. RG RE
Building FAs • An FA is a directed graph • How large is the input alphabet? • How many states? • How fast must it run? • How to get the lowest constant factor? • How to minimize space? • Representations • Matrix • Array of lists • Hashtable • Switch statement • For simplicity, we recommended this method in the assignment
Lex -- Automatic Lexer Generation Tools
History • Lexical analysis was once a performance bottleneck • certainly not true today! • As a result, early research investigated methods for efficient lexical analysis • While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use
History: A long-standing Goal • In this early period, a considerable amount of study went into the goal of creating an automatic lexer generator (aka compiler-compiler) declarative compiler specification compiler
History: Unix and C • In the mid-1960’s at Bell Labs, Ritchie and others were developing Unix • A key part of this project was the development of C and a compiler for it • Johnson, in 1968, proposed the use of finite state machines for lexical analysis and developed Lex [CACM 11(12), 1968] • Lex realized a part of the compiler-compiler goal by automatically generating fast lexical analyzers
The Lex tool lexical analyzer specification fast lexical analyzer Lex • The original Lex generated lexers written in C. Today every major language has its own lex tool(s): • flex, sml-lex, ocamllex, JLex, JFlex, C#Flex, …