430 likes | 826 Views
Lexical Analysis. Why separate lexical and syntax analyses? simpler design efficiency portability. Tokens, Patterns, Lexemes. Tokens Terminal symbols in the grammar Patterns Description of a class of tokens Lexemes Words in the the source program. Languages .
E N D
Lexical Analysis • Why separate lexical and syntax analyses? • simpler design • efficiency • portability by Neng-Fa Zhou
Tokens, Patterns, Lexemes • Tokens • Terminal symbols in the grammar • Patterns • Description of a class of tokens • Lexemes • Words in the the source program by Neng-Fa Zhou
Languages • Fixed and finite alphabet (vocabulary) • Finite length sentences • Possibly infinite number of sentences • Examples • Natural numbers {1,2,3,...10,11,...} • Strings over {a,b} anban • Terms on parts of a string • prefix, suffix, substring, proper .... by Neng-Fa Zhou
Operations on Languages by Neng-Fa Zhou
Examples L = {A,B,...,Z,a,b,...,z} D = {0,1,...,9} L D : the set of letters and digits LD : a letter followed by a digit L4 : four-letter strings L* : all strings of letters, including e L(L D)* : strings of letters and digits beginning with a letter D+ : strings of one or more digits by Neng-Fa Zhou
Regular Expression(RE) • e is a RE • a symbol in S is a RE • Let r and s be REs. • (r) | (s) : or • (r)(s) : concatenation • (r)* : zero or more instances • (r)+ : one or more instances • (r)? : zero or one instance by Neng-Fa Zhou
Examples Precedence of Operators all left associative high S = {a,b} 1. a|b 2. (a|b)(a|b) 3. a* 4. (a|b)* 5. a| a*b r* r+ r? rs low r|s by Neng-Fa Zhou
Algebraic Properties of RE by Neng-Fa Zhou
Regular Definitions d1 r1 d2 r2 di is a RE over S {d1,d2,...,di-1} .... dn rn not recursive by Neng-Fa Zhou
Example-1 %{ int num_lines = 0, num_chars = 0; %} %% \n ++num_lines; ++num_chars; . ++num_chars; %% main() { yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars ); } yywrap(){return 0;} by Neng-Fa Zhou
Example-2 D [0-9] INT {D}{D}* %% {INT}("."{INT}((e|E)("+"|-)?{INT})?)? {printf("valid %s\n",yytext);} . {printf("unrecognized %s\n",yytext);} %% int main(int argc, char *argv[]){ ++argv, --argc; if (argc>0) yyin = fopen(argv[0],"r"); else yyin = stdin; yylex(); } yywrap(){return 0;} by Neng-Fa Zhou
java.util.regex import java.util.regex.*; class Number { public static void main(String[] args){ String regExNum = "\\d+(\\.\\d+((e|E)(\\+|-)?\\d+)?)?"; if (Pattern.matches(regExNum,args[0])) System.out.println("valid"); else System.out.println("invalid"); } } by Neng-Fa Zhou
String Pattern Matching in Perl print "Input a string :"; $_ = <STDIN>; chomp($_); if (/^[0-9]+(\.[0-9]+((e|E)(\+|-)?[0-9]+)?)?$/){ print "valid\n"; } else { print "invalid\n"; } by Neng-Fa Zhou
Finite Automata • Nondeterministic finite automaton (NFA) NFA = (S,T,s0,F) • S: a set of states • T: a transition mapping • s0: the start state • F: final states or accepting states by Neng-Fa Zhou
Example by Neng-Fa Zhou
Deterministic Finite Automata (DFA) T: a transition function There is only one arc going out from each node on each symbol. by Neng-Fa Zhou
Simulating a DFA s = s0; c = nextchar; while (c != eof) { s = move(s,c); c = nextchar; } if (s is in F) return "yes"; else return "no"; by Neng-Fa Zhou
From RE to NFA • e • a in S • s|t by Neng-Fa Zhou
From RE to NFA (cont.) • st • s* by Neng-Fa Zhou
Example (a|b)*a by Neng-Fa Zhou
Building Lexical Analyzer RE NFA DFA Algorithm 3.23 (Thompson's construction) Algorithm 3.32 (Subset construction) Emulator by Neng-Fa Zhou
Conversion of an NFA into a DFA • Intuition • move(s,a) is a function in a DFA • move(s,a) is a mapping in a NFA NFA DFA A state reachable from s0 in the DFA on an input string corresponds to a set of states in NFA that are reachable on the same string. by Neng-Fa Zhou
Computation of e-Closure e-Closure(T): Set of NFA states reachable from some NFA state s in T by e-transition alone. by Neng-Fa Zhou
From an NFA to a DFA(The subset construction) by Neng-Fa Zhou
Example NFA DFA by Neng-Fa Zhou
Algorithm 3.39 P = {F, S-F}; do begin P0=P; for each group G in P do begin partition G into subgroups such that two states s and t of G are in the same subgroup iff for all input symbols a, s and t have transitions on a to states in the same group; replace G in P by the set of all subgroups formed; end if (P == P0) return;; end; by Neng-Fa Zhou
Example a b AC B AC B B D D B E E B AC by Neng-Fa Zhou
Construct a DFA Directly from a Regular Expression by Neng-Fa Zhou
Implementation Issues • Input buffering • Read in characters one by one • Unable to look ahead • Inefficient • Read in a whole string and store it in memory • Requires a big buffer • Buffer pairs by Neng-Fa Zhou
Buffer Pairs by Neng-Fa Zhou
Use Sentinels by Neng-Fa Zhou
Lexical Analyzer by Neng-Fa Zhou
Lex • A tool for automatically generating lexical analyzers by Neng-Fa Zhou
Lex Specifications declarations %% translation rules %% auxiliary procedures p1 {action1} p2 {action2} ... pn {actionn} by Neng-Fa Zhou
Lex Regular Expressions by Neng-Fa Zhou
yylex() yylex(){ switch (pattern_match()){ case 1: {action1} case 2: {action2} ... case n: {actionn} } } by Neng-Fa Zhou
Example DIGIT [0-9] ID [a-z][a-z0-9]* %% {DIGIT}+ {printf("An integer:%s(%d)\n",yytext,atoi(yytext));} {DIGIT}+"."{DIGIT}* {printf("A float: %s (%g)\n",yytext,atof(yytext));} if|then|begin|end|procedure|function {printf("A keyword: %s\n",yytext);} {ID} {printf("An identifier %s\n",yytext);} "+"|"-"|"*"|"/" {printf("An operator %s\n",yytext);} "{"[^}\n]*"}" {/* eat up one-line comments */} [ \t\n]+ {/* eat up white space */} . {printf("Unrecognized character: %s\n", yytext);} %% int main(int argc, char *argv[]){ ++argv, --argc; if (argc>0) yyin = fopen(argv[0],"r"); else yyin = stdin; yylex(); } by Neng-Fa Zhou