200 likes | 339 Views
Scanning, or Lexical Analysis. Regular Grammars Non-terminals (arbitrary names) Terminals (characters) Productions limited to the following: Non-terminal ::= terminal Non-terminal ::= terminal Non-terminal Treat character class (e.g. digit) as terminal
E N D
Scanning, or Lexical Analysis. • Regular Grammars • Non-terminals (arbitrary names) • Terminals (characters) • Productions limited to the following: • Non-terminal ::= terminal • Non-terminal ::= terminal Non-terminal • Treat character class (e.g. digit) as terminal • Regular grammars cannot count: cannot express size limits on identifiers, literals • Cannot express proper nesting (parentheses) Department of Software & Media Technology
Regular Grammars • grammar for real literals with no exponent • digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • REALVAL ::= digit REALVAL1 • REALVAL1 ::= digit REALVAL1 (arbitrary size) • REALVAL1 ::= . INTEGERVAL • INTEGERVAL ::= digit INTEGERVAL (arbitrary size) • INTEGERVAL ::= digit • Start symbol is ? Department of Software & Media Technology
Regular Expressions • RE are defined by an alphabet (terminal symbols) and three operations: • Alternation RE1 | RE2 • Concatenation RE1 RE2 • Repetition RE* (zero or more RE’s) • Language of RE’s = regular grammars • Regular expressions are more convenient for some applications Department of Software & Media Technology
Finite State Machines or Finite Automata (FSM or FA) • A language defined by a grammar is a (possibly infinite) set of strings • An automaton is a computation that determines whether a given string belongs to a specified language • A finite state machine (FSM) is an automaton that recognize regular languages (regular expressions) • Simplest automaton: memory is single number (state) Department of Software & Media Technology
Specifying an Finite State Machine (FA) • A set of labeled states, directed arcs between states labeled with character • One or more states may be terminal (accepting) • Start is a distinguished state • Automaton makes transition from state S1 to S2 • If and only if arc from S1 to S2 is labeled with next character in input • Token is legal if automaton stops on terminal state Department of Software & Media Technology
FA from Grammar • One state for each non-terminal • A rule of the form • Nt1 ::= terminal, generates transition from a state to final state • A rule of the form • Nt1 ::= terminal Nt2 • Generates transition from state 1 to state 2 on an arc labeled by the terminal Department of Software & Media Technology
digit digit S letter letter letter underscore digit identifier digit Graphic representation of FA Department of Software & Media Technology
FA from RE • Each RE corresponds to a grammar • For all REs • A natural translation to FSM exists • Alternation often leads to non-deterministic machines Department of Software & Media Technology
Deterministic Finite Automata (DFA) • For all states S • For all characters C • There is at most one arc from any state S that is labeled with C • Easier to implement • No backtracking Conventions for DFA: • Error transitions are not explicitly shown • Input symbols that result in the same transition are grouped together (this set can even be given a name) • Still not displayed: stopping conditions and actions Department of Software & Media Technology
Non-Deterministic Finite Automata (NFA) • A non-deterministic FA • Has at least one state • With two arcs to two distinct states • Labeled with the same character • Example: from start state, a digit can begin an integer literal or a real literal • Implementation requires backtracking Department of Software & Media Technology
letter letter [other] start in_id finish return id digit Lookahead & Backtracking in NFA Department of Software & Media Technology
letter letter [other] start in_id finish return id digit Implementation of FA Department of Software & Media Technology
letter letter [other] start in_id finish return id digit From RE to DFA & RE to NFA Department of Software & Media Technology
NFA to DFA • There is an algorithm for converting a non-deterministic machine to a deterministic one • Result may have exponentially more states • Intuitively: need new states to express uncertainty about token: int or real • Other algorithms for minimizing number of states of FSM, for showing equivalence, etc. Department of Software & Media Technology
Example DFA Department of Software & Media Technology
Another view of the same DFA Department of Software & Media Technology
Yet another view of the same DFA Department of Software & Media Technology
State Minimization in DFA Department of Software & Media Technology
TINY DFA: Department of Software & Media Technology
Lex for Scanner • Lex Conventions for RE • Format of a Lex Input File Department of Software & Media Technology