1 / 33

Lexical Analysis Part 1

Lexical Analysis Part 1. CMSC 431 Shon Vick. Lexical Analysis – What’s to come. Programs could be made from characters, and parse trees would go down to the character level Machine specific, obfuscates parsing, cumbersome

Download Presentation

Lexical Analysis Part 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical AnalysisPart 1 CMSC 431 Shon Vick

  2. Lexical Analysis – What’s to come • Programs could be made from characters, and parse trees would go down to the character level • Machine specific, obfuscates parsing, cumbersome • Lexical analysis is firewall between program representation and parsing actions • Prior lexical analysis phase obtains tokens consisting of a type (ID) and value (the lexeme matched) • In Principle – simple transition diagrams (finite state automata) characterize each of the “things” that can be recognized • In Practice – a program combines the multiple automata definitions into an efficient state machine

  3. Lexical Phase • Simple (non-recursive) • Efficient (special purpose code) • Portable (ignore character-set and architecture differences) • Use JavaCC, lex , flex , etc • Used in practice with Bison/Yacc , etc.

  4. Lexical Processing • Token: terminal symbols in a grammar. At the lexical level this is a symbol constant, and in “print” is represented in bold • Pattern: set of matching strings. For a keyword it is a constant. For a variable or value it can be represented by a regular expression • Lexeme: character sequence matched by an instance of the token

  5. Lexical Processing • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc. • Languages may have special rules (i.e., PL/1 does not have “Reserved words” and Fortran allows spaces in variables; both are obscure design choices)

  6. Lexical Analysis – sequences • Expression • Base * base - 0x4 * height * width • Token sequence • Name:base operator:times name:base operator:minus hexConstant:4 operatort:imes name:height operator:times name:width • Lexical phase returns token and value (yylval , yytext, etc)

  7. Tokens • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc. • Formal specification of tokens by regular expressions, define alphabet, strings, languages

  8. Regular Expression Notation • a: an ordinary letter from our alphabet • ε: the empty string • r1 | r2: choosing from r1 or r2 • r1r2 : concatenation of r1 and r2 • r*: zero or more times (Kleene closure) • r+: one or more times • r?: zero or one occurrence • [a-zA-Z] character class (choice) • . period stands for any single char exc. newline

  9. Semantics of Regular Expressions • L(e) = {e} • L(a) = {a} for all a in S • L (r1 | r2) = L(r1) U L (r2) • L (r1 r2) = {x,y) | x in L(r1 ), y in L(r2 )} • L (R*) = { e } U { x in L(R )} , { x1 x2 | x1 ,x2 in L(R ) } … { x1 . . .xn | x1. … xn in L(R ) }

  10. For Homework • Suppose S is {a ,b} What is the regular expression for: • All strings beginning and ending in a? • All strings with an odd number of a’s? • All strings without two consecutive a’s? • All strings with an odd number of b’s followed by an even number of a’s • What’s the description for a Java floating point number? • What’s the description of variable name in Java?

  11. NFA Regular expressions DFA Lexical Specification of Tokens Table-driven Implementation of DFA Why we care about Regular Expressions For every regular expression, there is a deterministic finite-state machine that defines the same language, and vice versa

  12. Regular Expressions • Automaton is a good “visual” aid • but is not suitable as a specification (its textual description is too clumsy) • However regular expressions are a suitable specification • a compact way to define a language that can be accepted by an automaton.

  13. RegExp Use and Construction • Used as the input to a scanner generator like lex or flex or JavaCC • define each token, and also • define white-space, comments, etc • these do not correspond to tokens, but must be recognized and ignored. • A NFA can be constructed from a RegExp via Thompson’s Construction

  14. Thompson’s Construction • There are building blocks for each regular expression operator • More complex RegExps are constructed by composing smaller building blocks • Assumes that the NFAs at each step of the construction will have a single accepting state

  15. M  a Regular Expressions to NFA (1) • For each kind of rexp, define an NFA • Notation: NFA for rexp M • For  • For input a

  16. A B B     A Regular Expressions to NFA (2) • For A B • For A | B

  17. A    Regular Expressions to NFA (3) • For A*

  18. Others • What would be representation for A+? • What would be representation for A?? • What about for[a-z]?

  19. Example of RegExp -> NFA conversion • Consider the regular expression (1|0)*1 • The NFA is  1   C E 1 B A G H  I J 0    D F  

  20. More Homework Problems • What is the NFA for the following RE? (a(b+c))* a • What is the NFA for the following RE? ((a|b)*c) | (a b c*)

  21. Lexical Analyzer • Can be programmed in a high-level language. • Can be generated using tools like LEX/Flex • Integrate these tools with C/C++ or Java code • In Java there are other tools Jflex for example

  22. How can a tool like LEX or JAVACC work? • Translate regular expressions to Non-deterministic Finite Automata (NFA) • Easier expressive form than the DFA • Automata theory tells us how to optimize • Run the automata • Simulate NFA, or • Translate NFA to DFA: a new DFA where each state corresponds to a set of NFA states (see pgages 28-29 pf Appel for set construction) • Have DFA move between states in simulation of the NFAs states

  23. Non-deterministic FA • NFA is modified to allow zero, one or MOREtransitions from a state on the same input symbol • Easier to express complex patterns as NFA • Harder to mechanically simulate NFS: what transition do we make on input (simulate all of them, then confirm it worked) • DFA and NFA are functionally equivalent.

  24. DFA with null moves • The model of NFA can be extended to include transitions on <null> input. • Change the state without reading any symbol from the input stream. • e-closure(q) : set of all states reachable from q without reading any input symbol (following the null edges)

  25. eClosure Operator • The eClosure operator is defined as eClosure(s) = { s } U states reachable from s using e transitions. • Example: eClosure(1) = {1,3} a  start 1 5 3 a a/b b 2 4

  26. RE to FA • If we write expression as RE (easy for people) how do we turn it into an FA (something a machine can simulate) • Use Thompson’s Construction • At most twice as many states as there are symbols and operators in the regular expression. • Results in a NFA (needs a non-deterministic computer to run most efficiently, hmm….)

  27. NFA to DFA • Build “super states” in a DFA where each “super state” represents the set of transitions that the NFA could make from a state on a symbol • e-closure(q) : states that can be arrived at from q with just null transitions • move(S, a) : states that can be reached on scanning a symbol a (from the input) • e-closure(S) : states that can be reached with E transitions from states in S

  28. NFA to DFA (cont….) • Subset Construction (alg 3.2) Find e-closure(q0) while ( S in FAStates is unmarked) { mark S for each a in alphabet { T = e-closure ( move(S, a) ) ; if (T  FAStates) FAStates.include( T ); FATran[S, a] = T ; } }

  29. FA v.s. NFA • NFA is smaller O(|r|) space but more time for simulation O(|r|*|x|) time even with the nice properties of Thompson’s construction • DFA is faster O(|x|) time, but is not space efficient, O(2|r|) space

  30. NFA t DFA • What is the difference between the two? • Is there a single DFA for a corresponding NFA? • Why do we want to do this anyway?

  31. Subset Construction for NFA-> DFA • Compute A = eClosure(start) • Compute the set of states reachable from A on transition a, call this new set S’ • Compute eClosure(S’) – this is the new state and label it with the next available label • Continue for all possible transitions from the current state for all applicable elements of S • Repeat steps 2-4 for each new state

  32. Example: a c*b e a c e e b 1 2 3 4 6 5 e

  33. References • Compilers Principles, Techniques and Tools, Aho, Sethi, Ullman Chapter 3 • http://www.cs.columbia.edu/~lerner/CS4115 • Modern Compiler Implementation in Java, Andrew Appel, Cambridge University Press, 2003

More Related