610 likes | 1.27k Views
Lexical Analysis (2 Lectures). Overview. Basic Concepts Regular Expressions Language Lexical analysis by hand Regular Languages Tools NFA DFA Scanning tools Lex / Flex / JFlex / ANTLR. Scanning Perspective. Purpose Transform a stream of symbols Into a stream of tokens.
E N D
Lexical Analysis (2 Lectures)
Overview • Basic Concepts • Regular Expressions • Language • Lexical analysis by hand • Regular Languages Tools • NFA • DFA • Scanning tools • Lex / Flex / JFlex / ANTLR
Scanning Perspective • Purpose • Transform a stream of symbols • Into a stream of tokens
Lexical Analyzer Responsibilities • Lexical analyzer [Scanner] • Scan input • Remove white spaces • Remove comments • Manufacture tokens • Generate lexical errors • Pass token to parser
Modular design • Rationale • Separate the two analysis • High cohesion / Low coupling • Improve efficiency • Improve portability / maintainability • Enable integration of third-party lexers • [lexer = lexical analysis tool]
Terminology • Token • A classification for a common set of strings • Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,.... • Pattern • The rules that characterize the set of strings for a token • Examples: [0-9]+ • Lexeme • Actual sequence of characters that matches a pattern and has a given Token class. • Examples: • Identifier: Name,Data,x • Integer: 345,2,0,629,....
“ ” “ ” Examples
Lexical Errors • Error Handling is very localized, w.r.t. Input Source • Example: fi(a==f(x)) …generates no lexical error in C • In what situations do errors occur? • Prefix of remaining input doesn’t match any defined token • Possible error recovery actions: • Deleting or Inserting Input Characters • Replacing or Transposing Characters • Or, skip over to next separator to ignore problem
Basic Scanning technique • Use 1 character of look-ahead • Obtain char with getc() • Do a case analysis • Based on lookahead char • Based on current lexeme • Outcome • If char can extend lexeme, all is well, go on. • If char cannot extend lexeme: • Figure out what the complete lexeme is and return its token • Put the lookahead back into the symbol stream
Language Concepts • A language, L, is simply any set of strings over a fixed alphabet. Alphabet Language {0,1} {0,10,100,1000,10000,…} {0,1,100,000,111,…} {a,b,c} {abc,aabbcc,aaabbbccc,…} {A…Z} {TEE,FORE,BALL…} {FOR,WHILE,GOTO…} {A…Z,a…z,0…9, {All legal PASCAL progs} +,-,…,<,>,…} {All grammatically correct English Sentences} Special Languages: Φ – EMPTY LANGUAGE ε – contains empty string ε only
Regular Languages • All examples above are • Quite expressive • Simple languages • But also... • Belong to a special class: regular languages • A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet. • Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r
Rules • fix alphabet Σ • εis a regular expression denoting {ε} • If a is in Σ , a is a regular expression that denotes {a} • Let r and s be R.E. for L(r) and L(s). Then • (a) (r) | (s) is a regular expression L(r) ∪ L(s) • (b) (r)(s) is a regular expression L(r) L(s) • (c) (r)* is a regular expression (L(r))* • (d) (r) is a regular expression L(r) • All are Left-Associative. • Parentheses are dropped as allowed by precedences. Precedeence
More Examples • All Strings that start with “tab” or end with “bat”: tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat • All Strings in Which {1,2,3} exist in ascending order: {A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*
… … … “+” “?” … Tokens as R.E.
Tokens as Patterns • Patterns are ??? • Tokens are ???
Throw Away Tokens • Fact • Some languages define tokens as useless • Example: C • whitespace, tabulations, carriage return, and comments can be discarded without affecting the program’s meaning.
Automaton • A tool to specify a token
What about keywords ? • Easy! • Use the “Identifier” token • After a match, lookup the keyword table • If found, return a token for the matched keyword • If not, return a token for the true identifier
Yes... But how to scan? • Remember the algorithm? • Acquire 1 character of lookahead • Case analysis based • On lookahead • On state of automaton
Scanner code class Scanner { InputStream _in; char _la; // The lookahead character char[] _window; // lexeme window Token nextToken() { startLexeme(); // reset window at start while(true) { switch(_state) { case 0: { _la = getChar(); if (_la == ‘<’) _state = 1; else if (_la == ‘=’) _state = 5; else if (_la == ‘>’) _state = 6; else failure(state); }break; case 6: { _la = getChar(); if (_la == ‘=’) _state = 7; else _state = 8; }break; } } } } case 7: { return new Token(GEQUAL); }break; case 8: { pushBack(_la); return new Token(GREATER); }
Handling Failures • Meaning • The automaton for this token failed • solution • If another automaton is available • “rewind” the input to the beginning of last lexeme • Jump to start state of next automaton • Start recognizing again • If no other automaton • This is a true lexical error. • Discard lexeme (or at least first char of lexeme) • Start from state 0 again
Overview • Basic Concepts • Regular Expressions • Language • Lexical analysis by hand • Regular Languages Tools • NFA / DFA • Scanning with DFAs • Scanning tools • Lex / Flex / JFlex
Automata & Language Theory • Terminology • FSA • A recognizer that takes an input string and determines whether it’s a valid string of the language. • Non-Deterministic FSA (NFA) • Has several alternative actions for the same input symbol • Deterministic FSA (DFA) • Has at most 1 action for any given input symbol • Bottom Line • expressive power(NFA) == expressive power(DFA) • Conversion can be automated
NFA An NFA is a mathematical model that consists of : • S, a set of states •Σ, the symbols of the input alphabet •move, a transition function. •move(state, symbol) → set of states •move : S ×Σ∪{∈} → Pow(S) • A state, s0∈ S, the start state • F ⊆ S, a set of final or accepting states.
Representing NFA Transition Diagrams : Transition Tables: Number states (circles), arcs, final states, … More suitable to representation within a computer We’ll see examples of both !
∈ 0 2 1 j i a start a b b 3 b Example NFA S = { 0, 1, 2, 3 } s0 = 0 F = { 3 } Σ = { a, b } What Language is defined ? What is the Transition Table ? ∈(null) moves possible i n p u t a b 0 { 0, 1 } { 0 } state 1 -- { 2 } Switch state but do not use any input symbol 2 -- { 3 }
Epsilon-Transitions • Given the regular expression : (a (b*c)) | (a (b | c+)?) • Find a transition diagram NFA that recognizes it. • Solution ?
NFA Construction • Automatic construction example • a(b*c) • a(b|c+)? Build a Disjunction
0 2 1 a start a b b 3 b Working NFA • Given an input string, we trace moves • If no more input & in final state, ACCEPT EXAMPLE: Input: ababb -OR- move(0, a) = 0 move(0, b) = 0 move(0, a) = 1 move(1, b) = 2 move(2, b) = 3 ACCEPT ! move(0, a) = 1 move(1, b) = 2 move(2, a) = ? (undefined) REJECT !
0 2 1 4 a start a b b 3 a b a a, b Σ Handling Undefined Transitions • We can handle undefined transitions by defining one more state, a “death” state, and transitioning all previously undefined transition to this death state.
0 2 1 a start a b b 3 b Worse still... • Not all path result in acceptance! aabb is accepted along path : 0 → 0 → 1 → 2 → 3 BUT… it is not accepted along the valid path: 0 → 0 → 0 → 0 → 0
The NFA “Problem” • Two problems • Valid input may not be accepted • Non-deterministic behavior from run to run... • Solution?
The DFA Save The Day • A DFA is an NFA with a few restrictions • No epsilon transitions • For every state s, there is only one transition (s,x) from s for any symbol x in Σ • Corollaries • Easy to implement a DFA with an algorithm! • Deterministic behavior
NFA vs. DFA • NFA • smaller number of states Qnfa • In order to simulate it requires a |Qnfa| computation for each input symbol. • DFA • larger number of states Qdfa • In order to simulate it requires a constant computation for each input symbol. • caveat - generic NFA=>DFA construction: Qdfa ~ 2^{Qnfa} • but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Qdfa )
One catch... • NFA-DFA comparison
NFA to DFA Conversion • Idea • Look at the state reachable without consuming any input • Aggregate them in macro states
Final Result • A state is final • IFF one of the NFA state was final
Preliminary Definitions • NFA N = ( S, Σ, s0, F, MOVE ) • ε-Closure(s) : s ε S • set of states in S that are reachable from s via ε-moves of N that originate from s. • ε-Closure(T) : T ⊆ S • NFA states reachable from all t ε T on ε-moves only. • move(T,a) : T ⊆ S, a ε Σ • Set of states to which there is a transition on input a from some t ε T
Algorithm computing the ε-closure forall(t in T) push(t); initialize ε-closure(T) to T; while stack is not empty do begin t = pop(); for each u ε S with edge t→u labeled ε if u is not in ε-closure(T) add u to ε-closure(T) ; push u onto stack
DFA construction computing the The set of states The transitions let Q = ε-closure(s0) ; D = { Q }; enQueue(Q) while queue not empty do X = deQueue(); for each a ε Σ do Y := ε-closure(move(X,a)); T[X,a] := Y if Y is not in D D = D U { Y } enQueue(Y); end end
Summary • We can • Specify tokens with R.E. • Use DFA to scan an input and recognize token • Transform an NFA into a DFA automatically • What we are missing • A way to transform an R.E. into an NFA • Then, we will have a complete solution • Build a big R.E. • Turn the R.E. into an NFA • Turn the NFA into a DFA • Scan with the obtained DFA
R.E. To NFA • Process • Inductive definition • Use the structure of the R.E. • Use atomic automata for atomic R.E. • Use composition rules for each R.E. expression • Recall • RE ::= ε ::= s in Σ ::= rs ::= r | s ::= r*
Epsilon Construction • RE ::= ε