Automata and Regular Expression

Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua bjhua@ustc.edu.cn

ident and intconst • For most programming languages, the terminal symbols represent the basic punctuation symbols, keywords, and operators • <id> and <num> are special in that they represent (infinite) sets of terminal symbols id ::= letter idRest letter ::= _ | a |…| z | A |… | Z idRest ::=  |letteridRest|digitidRest digit::= 0 | 1 | 2 | … | 9 <exp> ::= … | <ident> | <intconst>

Lexical Analysis • The lexical analyzer translates the source program into a stream of lexical tokens • Source program: • stream of (ASCII or Unicode) characters • Lexical token: • internal data structure that represents the occurrence of a terminal symbol

Example x = 11; y = 2; z = x + y; print (z); lexical analysis IDENT(x) ASSIGN INTCONST(11) SEMICOLON NEWLINE IDENT(y) ASSIGN INTCONST(2) SEMICOLON NEWLINE IDENT(z) ASSIGN IDENT(x) PLUS IDENT(y) SEMICOLON NEWLINE PRINT LPAREN IDENT(z) RPAREN SEMICOLON EOF

Practice Issues • What are tokens? • Recall the tagged union in our previous slides • What if the input characters are illegal? • Limited checking of the grammatical structure of input • only checks that input stream can be viewed as a stream of terminal symbols

Lexical Errors Not lexical errors x = 11 y = = = = = = 2; z =@ x + y; print (#z); Lexical errors

Position Info • For the purpose of later phases, it is useful to attach position information to each token • we’d see how to make use of such kind of info in later slides LPRREN(1,4) IDENT(x,4,5) MINUS(6,7) …

Tokens in C #ifndef TOKEN_H #define TOKEN_H enum tokenKind {ID, NUM, ASSIGN, LPAREN, …}; typedef tokenStruct *token; struct tokenStruct { enum tokenKind kind; union {…} u; int line; int column; }; #endif

Lexer Interface #ifndef LEXER_H #define LEXER_H #include “token.h” token nextToken (char *fileName); #endif

Client Code #include “lexer.h” int main() { // we want to analysis file “test.c” token t = nextToken (“test.c”); while (t!=EOF); { … t = nextToken (“test.c”); … } return 0; }

Finite-state Automata

M Input String {Yes, No} Finite-state Automata (FAs) M = (, S, s0, F, f) Transition function Input alphabet State set Final states Initial state

A deterministic finite automaton (DFA) Transition Functions f:S    S which can be extended to f’:S  *  S and or in an inductive form: • f’(q, ) = q • f’(q, a) = f’(f(q, a), )

a a 0 1 2 b b a,b DFA Example • Which strings of as and bs are accepted? • Transition function: • { (s0,a)s1, (s0,b)s0, (s1,a)s2, (s1,b)s1, (s2,a)s2, (s2,b)s2 }

Nondeterministic FAs (NFAs) • NFAs can transition to more than one state on any input • f:S   (S) • As before, can extend: • f’:S *  (S) • Inductively: f’(q, ) = {q} f’(q, a) = pf(q, a)f’(p, )

a,b 0 1 b a b NFA Example • Transition function: { (s0,a){s0,s1}, (s0,b){s1}, (s1,a), (s1,b){s0,s1} }

Regular Expression

Regular Expressions • A regular language can always be described using a regular expression. • Examples • (01)* • 00 •  • (a|b)*ab • this | that | theother • 0*1*2* • 01*|0 =01* • 00*11*22* =0+1+2+ • (1|0)*00(0|1)*

Regular Expressions and Tokens • Regular expressions are convenient for describing lexical tokens • intconst: [0-9][0-9]* • ident: [_a-zA-Z][_a-zA-Z0-9_]* • others: = | print | + | …

Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {}

Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {}

Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} a

Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a}

Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are

Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are • LR+S = LRU LS

Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are • LR+S = LRU LS R S

Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are • LR+S = LRU LS  R    S

Regular Expressions • Let  = {a,b}. •  is a regular expression •  is a regular expression • a is a regular expression • R|S is a regular expression if R and S are • RS is a regular expression if R and S are • LRS = {uv | u LR & v  LS }  S R

Regular Expressions • Let  = {a,b}. •  is a regular expression •  is a regular expression • a is a regular expression • R|S is a regular expression if R and S are • RS is a regular expression if R and S are • R* is a regular expression if R is • LR* = U0 i LRi    R 

Regular Expressions • The language described by a regular expression can be accepted by an FA. RE NFA  NFA  DFA • A regular grammar can always be described using a regular expression. RG  RE

Building FAs • An FA is a directed graph • How large is the input alphabet? • How many states? • How fast must it run? • How to get the lowest constant factor? • How to minimize space? • Representations • Matrix • Array of lists • Hashtable • Switch statement • For simplicity, we recommended this method in the assignment

Lex -- Automatic Lexer Generation Tools

History • Lexical analysis was once a performance bottleneck • certainly not true today! • As a result, early research investigated methods for efficient lexical analysis • While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use

History: A long-standing Goal • In this early period, a considerable amount of study went into the goal of creating an automatic lexer generator (aka compiler-compiler) declarative compiler specification compiler

History: Unix and C • In the mid-1960’s at Bell Labs, Ritchie and others were developing Unix • A key part of this project was the development of C and a compiler for it • Johnson, in 1968, proposed the use of finite state machines for lexical analysis and developed Lex [CACM 11(12), 1968] • Lex realized a part of the compiler-compiler goal by automatically generating fast lexical analyzers

The Lex tool lexical analyzer specification fast lexical analyzer Lex • The original Lex generated lexers written in C. Today every major language has its own lex tool(s): • flex, sml-lex, ocamllex, JLex, JFlex, C#Flex, …

Automata and Regular Expression

Automata and Regular Expression

Presentation Transcript

Regular Expressions and Automata

LaFA Lookahead Finite Automata Scalable Regular Expression Detection

Regular Expression

Regular Expression

^Regular Expression$

Regular Expressions and Automata

Regular Expressions and Automata

Regular Expression

Regular Expression

Regular Expression

03-60-214: regular expression and automata

Regular Expression

Regular Expressions and Automata

Regular Expression

Regular Expressions and Automata

Regular Expressions and Automata