330 likes | 571 Views
Lexical Analysis Part 1. CMSC 431 Shon Vick. Lexical Analysis – What’s to come. Programs could be made from characters, and parse trees would go down to the character level Machine specific, obfuscates parsing, cumbersome
E N D
Lexical AnalysisPart 1 CMSC 431 Shon Vick
Lexical Analysis – What’s to come • Programs could be made from characters, and parse trees would go down to the character level • Machine specific, obfuscates parsing, cumbersome • Lexical analysis is firewall between program representation and parsing actions • Prior lexical analysis phase obtains tokens consisting of a type (ID) and value (the lexeme matched) • In Principle – simple transition diagrams (finite state automata) characterize each of the “things” that can be recognized • In Practice – a program combines the multiple automata definitions into an efficient state machine
Lexical Phase • Simple (non-recursive) • Efficient (special purpose code) • Portable (ignore character-set and architecture differences) • Use JavaCC, lex , flex , etc • Used in practice with Bison/Yacc , etc.
Lexical Processing • Token: terminal symbols in a grammar. At the lexical level this is a symbol constant, and in “print” is represented in bold • Pattern: set of matching strings. For a keyword it is a constant. For a variable or value it can be represented by a regular expression • Lexeme: character sequence matched by an instance of the token
Lexical Processing • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc. • Languages may have special rules (i.e., PL/1 does not have “Reserved words” and Fortran allows spaces in variables; both are obscure design choices)
Lexical Analysis – sequences • Expression • Base * base - 0x4 * height * width • Token sequence • Name:base operator:times name:base operator:minus hexConstant:4 operatort:imes name:height operator:times name:width • Lexical phase returns token and value (yylval , yytext, etc)
Tokens • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc. • Formal specification of tokens by regular expressions, define alphabet, strings, languages
Regular Expression Notation • a: an ordinary letter from our alphabet • ε: the empty string • r1 | r2: choosing from r1 or r2 • r1r2 : concatenation of r1 and r2 • r*: zero or more times (Kleene closure) • r+: one or more times • r?: zero or one occurrence • [a-zA-Z] character class (choice) • . period stands for any single char exc. newline
Semantics of Regular Expressions • L(e) = {e} • L(a) = {a} for all a in S • L (r1 | r2) = L(r1) U L (r2) • L (r1 r2) = {x,y) | x in L(r1 ), y in L(r2 )} • L (R*) = { e } U { x in L(R )} , { x1 x2 | x1 ,x2 in L(R ) } … { x1 . . .xn | x1. … xn in L(R ) }
For Homework • Suppose S is {a ,b} What is the regular expression for: • All strings beginning and ending in a? • All strings with an odd number of a’s? • All strings without two consecutive a’s? • All strings with an odd number of b’s followed by an even number of a’s • What’s the description for a Java floating point number? • What’s the description of variable name in Java?
NFA Regular expressions DFA Lexical Specification of Tokens Table-driven Implementation of DFA Why we care about Regular Expressions For every regular expression, there is a deterministic finite-state machine that defines the same language, and vice versa
Regular Expressions • Automaton is a good “visual” aid • but is not suitable as a specification (its textual description is too clumsy) • However regular expressions are a suitable specification • a compact way to define a language that can be accepted by an automaton.
RegExp Use and Construction • Used as the input to a scanner generator like lex or flex or JavaCC • define each token, and also • define white-space, comments, etc • these do not correspond to tokens, but must be recognized and ignored. • A NFA can be constructed from a RegExp via Thompson’s Construction
Thompson’s Construction • There are building blocks for each regular expression operator • More complex RegExps are constructed by composing smaller building blocks • Assumes that the NFAs at each step of the construction will have a single accepting state
M a Regular Expressions to NFA (1) • For each kind of rexp, define an NFA • Notation: NFA for rexp M • For • For input a
A B B A Regular Expressions to NFA (2) • For A B • For A | B
A Regular Expressions to NFA (3) • For A*
Others • What would be representation for A+? • What would be representation for A?? • What about for[a-z]?
Example of RegExp -> NFA conversion • Consider the regular expression (1|0)*1 • The NFA is 1 C E 1 B A G H I J 0 D F
More Homework Problems • What is the NFA for the following RE? (a(b+c))* a • What is the NFA for the following RE? ((a|b)*c) | (a b c*)
Lexical Analyzer • Can be programmed in a high-level language. • Can be generated using tools like LEX/Flex • Integrate these tools with C/C++ or Java code • In Java there are other tools Jflex for example
How can a tool like LEX or JAVACC work? • Translate regular expressions to Non-deterministic Finite Automata (NFA) • Easier expressive form than the DFA • Automata theory tells us how to optimize • Run the automata • Simulate NFA, or • Translate NFA to DFA: a new DFA where each state corresponds to a set of NFA states (see pgages 28-29 pf Appel for set construction) • Have DFA move between states in simulation of the NFAs states
Non-deterministic FA • NFA is modified to allow zero, one or MOREtransitions from a state on the same input symbol • Easier to express complex patterns as NFA • Harder to mechanically simulate NFS: what transition do we make on input (simulate all of them, then confirm it worked) • DFA and NFA are functionally equivalent.
DFA with null moves • The model of NFA can be extended to include transitions on <null> input. • Change the state without reading any symbol from the input stream. • e-closure(q) : set of all states reachable from q without reading any input symbol (following the null edges)
eClosure Operator • The eClosure operator is defined as eClosure(s) = { s } U states reachable from s using e transitions. • Example: eClosure(1) = {1,3} a start 1 5 3 a a/b b 2 4
RE to FA • If we write expression as RE (easy for people) how do we turn it into an FA (something a machine can simulate) • Use Thompson’s Construction • At most twice as many states as there are symbols and operators in the regular expression. • Results in a NFA (needs a non-deterministic computer to run most efficiently, hmm….)
NFA to DFA • Build “super states” in a DFA where each “super state” represents the set of transitions that the NFA could make from a state on a symbol • e-closure(q) : states that can be arrived at from q with just null transitions • move(S, a) : states that can be reached on scanning a symbol a (from the input) • e-closure(S) : states that can be reached with E transitions from states in S
NFA to DFA (cont….) • Subset Construction (alg 3.2) Find e-closure(q0) while ( S in FAStates is unmarked) { mark S for each a in alphabet { T = e-closure ( move(S, a) ) ; if (T FAStates) FAStates.include( T ); FATran[S, a] = T ; } }
FA v.s. NFA • NFA is smaller O(|r|) space but more time for simulation O(|r|*|x|) time even with the nice properties of Thompson’s construction • DFA is faster O(|x|) time, but is not space efficient, O(2|r|) space
NFA t DFA • What is the difference between the two? • Is there a single DFA for a corresponding NFA? • Why do we want to do this anyway?
Subset Construction for NFA-> DFA • Compute A = eClosure(start) • Compute the set of states reachable from A on transition a, call this new set S’ • Compute eClosure(S’) – this is the new state and label it with the next available label • Continue for all possible transitions from the current state for all applicable elements of S • Repeat steps 2-4 for each new state
Example: a c*b e a c e e b 1 2 3 4 6 5 e
References • Compilers Principles, Techniques and Tools, Aho, Sethi, Ullman Chapter 3 • http://www.cs.columbia.edu/~lerner/CS4115 • Modern Compiler Implementation in Java, Andrew Appel, Cambridge University Press, 2003