440 likes | 967 Views
Lexical Analysis - Scanner. Lexical Analyzer reads source text and produces tokens, which are the basic lexical units of the language. Example: System.out.println(“Hello Class”);
E N D
Lexical Analysis - Scanner • Lexical Analyzer reads source text and produces tokens, which are the basic lexical units of the language. • Example: System.out.println(“Hello Class”); Has tokens System, dot, out, dot, println, left parenthesis, String: Hello Class, right parenthesis and a semicolon.
Role of Lexical Analyzer/Scanner (1) • Lexical Analyzer also keeps track of the source-coordinates of each token - which file name, line number and position. This is useful for debugging purposes. • Lexical Analyzer is the only part of a compiler that looks at each character of the source text.
Role of Lexical Analyzer/Scanner (2) • The scanner also removes white space • White space: characters that are ignored between tokens • Ex: spaces, tabs, newlines • Definitions of tokens and white space vary among languages • The lexical analyzer also filter comments
Why scanner is a separated phase? • Simpler design: based on • Token description using regular grammars or regular definitions • Elimination of white spaces and comments. How about if they are part of the parser? • Distinguishing between keywords and identifiers • Faster: since lexical is time-consuming in many compilers (40-60% ?), restricting the job of the lexical analyzer, a faster implementation is usually feasible using buffering techniques • Portability is enhanced: the representation of non standard symbols can be isolated in this phase
Difficulties of lexical analysis • Blanks existence in language statements example in FORTRAN: • Do 5 I = 1.25 Is it a DO loop or what? • Formatted paragraph lines like COBOL: now almost all programming languages are format free. • Using key words as identifiers example in PL1 if then then then else else = then
Lexical Errors • Lexical errors are violations of token definitions. For example using a$ as a char identifier in languages other than BASIC • Not all errors are detected by the scanner • fi (x) the scanner can not tell whether fi is if or it is a function name • There are different strategies for error handling errors. • Deleting characters until a token is recognized • Inserting a missing character • Replacing incorrect character with a correct one
Recognizing Tokens • How tokens are defined and recognized? By using regular expressions to define a token as a formal regular language. • What are regular expressions and how they are defined? There are simple and structured regular expressions. Each regular expression E denotes a language L(E). What is a language? • Let us look at some basic definitions.
Formal Language concepts • Alphabet - a finite set of symbols, ASCII is a computer alphabet. • String - finite sequence of symbols from the alphabet. • Empty string = special string of length 0 • Language = set of strings over a given alphabet (e.g., set of all programs)
Language opreations • Let L1 and L2 be two languages over some alphabet Σ, then the following is a list of language opreations: • L1|L2 (union) is = { s Є L1 or t Є L2} • L1L2 (concatenation) is = {st such that s Є L1 & t Є L2}. We use the notation L2 to denote LL and L0 to denote the language consisting of one symbol the null symbol {NULL}. • L* = L0∨L∨L2∨ … ∨Li∨… • L+ = L∨L2∨ … ∨Li∨…
Regular Expressions • Simple regular expressions are • An alphabet symbol, a, is a regular expression defines {a}. • An empty symbol is also a regular expression defines {Є}. • 2. Structured regular expressions: If E1 and E2 are regular expressions denoting languages L(E1) and L(E2), then • E1 | E2 is a regular expression denoting a language • L(E1) union L(E2). • E1 E2 is a regular expression denoting a language L(E1)followed by L(E2). • E* (E star) is a regular expression denoting L(E star) = • Kleene closure of L(E).
Names of regular expression • A collection of regular expression can be given a name for example: • If the alphabet Σ = {0, 1, 2, …,9}, then each digit is a RE. • To specify any digit we use 0|1|…|9 or we can give it a name such as digit 0|1|…|9 so digit can be considered as a RE • Also two_digits digit digit and so on
Examples • Specify a set of unsigned numbers as a regular expression. Examples: 1997, 19.97 • Solution: Note use of regular definitions as intermediate names that define regular subexpressions. • digit 0 | 1 | 2| 3| … | 9 • digits digit digit* (often written as digit+) This is the Kleene closure. Means 1 or more digits.
Example Contd • optional_fraction . digits | epsilon • Num digits optional_fraction • Note that we have used all the definitions of a regular expression. • One can define similar regular expression(s) for identifiers comments, Strings, operators and delimiters. • Question: How to write a regular expression for identifiers? (identifiers are letters followed by a letter or a digit).
Identifiers contd letter a|A|b|B| … |z|Z digit 0|1|2| … | 9 letter | digit letter_or_digit identifier letter | letter letter_or_digit*
Building a recognizer A General Approach • Build Nondeterministic Finite Automaton (NFA) from Regular Expression E. • Simulate execution of NFA to determine whether an input string belongs to L(E). • The simulation can be much simplified if you convert your NFA to Deterministic Finite Automaton (DFA).
NFA Definition • A NFA is a system consists of five components: • An input alphabet Σ. • A set of states S. • Designation of one state i a start state • A subset F of S containing final states • A transition function δ: Σv{Є}xS D a subset of S.
NFA A transition graph represents a NFA. • Nodes represent states. There is a distinguished start state and one or more final states. • Edges represent state transitions. • An edge can be labeled by an alphabet or an empty symbol
NFA cont. • From a state(node), there may be more than one edge labeled with the same alphabet and there may be no edge from a node labeled with an input symbol. • NFA accepts an input string iff (if and only if) there is a path in the transition graph from the start node to some final state such that the labels along the edge spell out the input string.
1 0,1 0,1 s0 s1 s2 s3 0,1 Non deterministic automata Allows more than a single transition from a state with the same label. There does not have to be a transition from every state with every label. Allows multiple initial states. Allows transitions.
1 0,1 0,1 s0 s1 s2 s3 0,1 Nondeterministic runs Input: 0100 Run 1: s0 0 s0 1 s0 0 s0 0 s0. Does not accept. Run 2: s0 0 s0 1 s1 0 s2 0 s3. Accepts. Accepts when there exists an accepting run.
Deterministic Finite Automaton (DFA) A finite automaton is deterministic if • It has no edges/transitions labeled with epsilon. • For each state and for each symbol in the alphabet, there is at most one edge leaving that state & labeled with that symbol. Such a transition graph is called a state graph.
DFA’s Cont. • NFAs are quicker to build but slower to simulate. • DFAs are slower to build but quicker to simulate. • The number of states in an NFA may be exponential in the number of states in a DFA.
Deterministic Finite Automata Includes: States {s1,s2,…,s5}. Initial states {s1}. Accepting states {s3,s5}. Alphabet {a, b, c}. Transitions: {(s1,a,s2), (s2, a, s3), …}. s1 a b b s2 b a s5 c b a c s4 s3 a Deterministic?
s0 s1 a b a Automaton. What is the language? b An input is a word over the alphabet . A run over a word is an alternating sequence ofstates and letters, starting from the initial state. Accepting run: ends with an accepting state.
s0 s1 a b a Example b Input: aabbb Run: s0 a s0 a s0 b s1 b s1 b s1. Accepts. Input: aba Run: s0 a s0 b s1 a s0. Does not accept.
Automaton. What is the language? b s0 s1 a b a
Automaton. What is the language? b s1 a s0 b a
Identifying tokens F I T H E N E L S E letter letter|digit
1 0,1 0,1 s0 s1 s2 s3 0,1 Determinizing Automata • Each state of D is a set of the states of N. • (S,a) T when T={t| sS and (s, a) t}. • The initial state of D includes the initial state of N. • Accepting states in D include at least one accepting state of N.
1 1 1 s0 s0,s1 s0,s1,s2 s0,s1,s2,s3 0 1 1 0 0 s0,s2 s0,s1,s3 0 0 0 s0,s2,s3 1 0 s0,s3 Determinization 1 0,1 0,1 s0 s1 s2 s3 0,1 1 0
1 1 1 000 001 011 111 0 1 1 0 0 010 101 0 0 0 110 1 0 100 Determinization 1 0
Examples of Constructing NFA from a R.E • A NFA for a regular expression can be constructed as in the following cases: • For simple regular expressions r, the NFA is constructed by: • Constructing two states, the start state and the final state. • One edge/transition labeled with alphabet symbol r. (this includes an epsilon symbol).
NFA Cont. 2. For E1E2 with NFA for each expression is constructed NFA(E1) and NFA(E2) • Construct a new start state and a new final state. • From the start state, add an edge labeled with epsilon to start state of NFA(E1). • From the final state of NFA(E1), add an epsilon transition to start state of NFA(E2). • Add a transition/edge from the final state of NFA(E2) to the constructed final state.
NFA Cont. • 3. For E1|E2: • Construct new start state i, new final state f. • Add a transition from the start state i to the start states of NFA(E1) and NFA(E2). These transitions are labeled with epsilon symbol. • Add a transition from the final states of NFA(E1) &NFA(E2) to the final state f. These transitions are labeled with epsilon symbol.
NFA Cont. 4. For E*, • Construct new start state i and new final state f. • Add an epsilon transition from the start state i to the start state of NFA(E), an epsilon transition from the final state of NFA(E) to f, and an epsilon transition from the final state f to the start state i.
Translating regular expressions into automata L1 L1L2 L2 L1.L2 L1 L2 L L*
a b Automatic translation (a|b).(a.b)=(ab)(ab)=(a+b).(a+b) a b a a b b
s3,s5,s6,s7,s8 a a s9,s11 b s0,s1,s2 b a s4,s5,s6,s7,s8 s10,s11 b Determinization with transitions. a a s7 s9 s1 s3 s5 s6 s11 s0 b b s8 s10 s2 s4 Add to each set states reachable using transitions.
Example • Let Σ be the ASCII char set, construct the NFA to recognizes each of the following: • Identifier • Integers • Floating constants.
Simulation of NFA An epsilon closure of a state x is the set of states that can be reached (including itself) by making just transition labeled with epsilon. We want to get the next token from the input stream. Properties: 1. The longest sequence of characters starting at the current position that matches a regular exp. for a token. 2. Input buffer is repositioned to the first character following the token. 3. Nothing gets read after the end-of-file.
Algorithm page 126 of text alg.3.3 getNextToken() { t.error = true; // t is a token that will be found S = epsilon_closure({start}); while(true) { if (S is empty) break; if (S contains a final state) { t.eror=false //fill in t.line and other attributes.} if (end_of_file) break; c= getchar(): T=move(S,c); S=epsilon_closure(T);} reset_inputbuffer(t.line,t.lastcol+1); return t}