750 likes | 795 Views
CS375. Compilers Lexical Analysis 4 th February, 2010. Outline. Overview of a compiler. What is lexical analysis? Writing a Lexer Specifying tokens: regular expressions Converting regular expressions to NFA, DFA Optimizations. How It Works. Program representation. Source code
E N D
CS375 Compilers Lexical Analysis 4th February, 2010
Outline • Overview of a compiler. • What is lexical analysis? • Writing a Lexer • Specifying tokens: regular expressions • Converting regular expressions to NFA, DFA • Optimizations.
How It Works Program representation Source code (character stream) if (b == 0) a = b; Lexical Analysis Tokenstream if ( b == 0 ) a = b ; Syntax Analysis (Parsing) if == = Abstract syntaxtree (AST) b 0 a b Semantic Analysis if boolean int == = Decorated AST intb int0 inta lvalue intb
What is a lexical analyzer • What? • Reads in a stream of characters and groups them into “tokens” or “lexemes”. • Language definition describes what tokens are valid. • Why? • Makes writing the parser a lot easier, parser operates on “tokens”. • Input dependent functionality such as character codes, EOF, new line characters.
First Step: Lexical Analysis Source code (character stream) if (b == 0) a = b; Lexical Analysis Token stream if ( b == 0 ) a = b ; Syntax Analysis Semantic Analysis
What it should do? Token Description ? Input String Token? {Yes/No} w We want some way to describe tokens, and have our Lexer take that description as input and decide if a string is a token or not.
Tokens • Logical grouping of characters. • Identifiers: x y11 elsen _i00 • Keywords: if else while break • Constants: • Integer: 2 1000 -500 5L 0x777 • Floating-point: 2.0 0.00020 .02 1. 1e5 0.e-10 • String: ”x” ”He said, \”Are you?\”\n” • Character: ’c’ ’\000’ • Symbols: + * { } ++ < << [ ] >= • Whitespace (typically recognized and discarded): • Comment: /** don’t change this **/ • Space: <space> • Format characters: <newline> <return>
Ad-hoc Lexer • Hand-write code to generate tokens • How to read identifier tokens? Token readIdentifier( ) { String id = “”; while (true) { char c = input.read(); if (!identifierChar(c)) return new Token(ID, id, lineNumber); id = id + String(c); } } • Problems • How to start? • What to do with following character? • How to avoid quadratic complexity of repeated concatenation? • How to recognize keywords?
Look-ahead Character • Scan text one character at a time • Use look-ahead character (next) to determine what kind of token to read and when the current token ends char next; … while (identifierChar(next)) { id = id + String(next); next = input.read (); } e l s e n next (lookahead)
Ad-hoc Lexer: Top-level Loop class Lexer { InputStream s; char next; Lexer(InputStream _s) { s = _s; next = s.read(); } Token nextToken( ) { if (identifierFirstChar(next))//starts with a char return readIdentifier(); //is an identifier if (numericFirstChar(next)) //starts with a num return readNumber(); //is a number if (next == ‘\”’) return readStringConst(); … } }
Problems • Might not know what kind of token we are going to read from seeing first character • if token begins with “i’’ is it an identifier? (what about int, if ) • if token begins with “2” is it an integer constant? • interleaved tokenizer code hard to write correctly, harder to maintain • in general, unbounded look-ahead may be needed
Problems (cont.) • How to specify (unambiguously) tokens. • Once specified, how to implement them in a systematic way? • How to implement them efficiently?
Problems (cont. ) • For instance, consider. • How to describe tokens unambiguously 2.e0 20.e-01 2.0000 “” “x” “\\” “\”\’” • How to break up text into tokens if (x == 0) a = x<<1; if (x == 0) a = x<1; • How to tokenize efficiently • tokens may have similar prefixes • want to look at each character ~1 time
Principled Approach • Need a principled approach • Lexer Generators • lexer generator that generates efficient tokenizer automatically (e.g., lex, flex, Jlex) a.k.a. scanner generator • Your own Lexer • Describe programming language’s tokens with a set of regular expressions • Generate scanning automaton from that set of regular expressions
Top level idea… • Have a formal language to describe tokens. • Use regular expressions. • Have a mechanical way of converting this formal description to code. • Convert regular expressions to finite automaton (acceptors/state machines) • Run the code on actual inputs. • Simulate the finite automaton. 16
An Example : Integers • Consider integers. • We can describe integers using the following grammar: • Num -> ‘-’ Pos • Num -> Pos • Pos ->0 | 1 |…|9 • Pos ->0 | 1 |…|9 Pos • Or in a more compact notation, we have: • Num-> -? [0-9]+ 17
An Example : Integers • Using Num-> -? [0-9]+ we can generate integers such as -12, 23, 0. • We can also represent above regular expression as a state machine. • This would be useful in simulation of the regular expression. 18
An Example : Integers - 0-9 0 3 1 2 0-9 • The Non-deterministic Finite Automaton is as follows. • We can verify that -123, 65, 0 are accepted by the state machine. • But which path to take? -paths? 19
An Example : Integers {2,3} - {0,1} {1} 0-9 0-9 0-9 • The NFA can be converted to an equivalent Deterministic FA as below. • We shall see later how. • It accepts the same tokens. • -123 • 65 • 0 20
An Example : Integers • The deterministic Finite automaton makes implementation very easier, as we shall see later. • So, all we have to do is: • Express tokens as regular expressions • Convert RE to NFA • Convert NFA to DFA • Simulate the DFA on inputs 21
The larger picture… Regular Expression describing tokens RE NFA Conversion R NFA DFA Conversion Yes, if w is valid token DFA Simulation Input String w No, if not 22
Language Theory Review • Let be a finite set • called an alphabet • a called a symbol • * is the set of all finite strings consisting of symbols from • A subset L * is called a language • If L1 and L2 are languages, then L1 L2 is the concatenation of L1 and L2, i.e., the set of all pair-wise concatenations of strings from L1 and L2, respectively
Language Theory Review, ctd. • Let L * be a language • Then • L0 = {} • Ln+1= L Ln for all n 0 • Examples • if L = {a, b} then • L1 = L = {a, b} • L2 = {aa, ab, ba, bb} • L3 = {aaa, aab, aba, aba, baa, bab, bba, bbb} • …
Syntax of Regular Expressions • Set of regular expressions (RE) over alphabet is defined inductively by • Let a and R, S RE. Then: • a RE • ε RE • RE • R|S RE • RS RE • R* RE • In concrete syntactic form, precedence rules, parentheses, and abbreviations
Semantics of Regular Expressions • Regular expression T RE denotes the language L(R) * given according to the inductive structure of T: • L(a) ={a} the string “a” • L(ε) = {“”} the empty string • L()= {} the empty set • L(R|S) = L(R) L(S) alternation • L(RS) = L(R) L(S) concatenation • L(R*) = {“”} L(R) L(R2) L(R3) L(R4) … Kleene closure
Simple Examples • L(R) = the “language” defined by R • L( abc ) = { abc } • L( hello|goodbye ) = {hello, goodbye} • OR operator, so L(a|b) is the language containing either strings of a, or strings of b. • L( 1(0|1)* ) = all non-zero binary numerals beginning with 1 • Kleene Star. Zero or more repetitions of the string enclosed in the parenthesis.
Convienent RE Shorthand R+ one or more strings from L(R): R(R*) R? optional R: (R|ε) [abce] one of the listed characters: (a|b|c|e) [a-z] one character from this range: (a|b|c|d|e|…|y|z) [^ab] anything but one of the listed chars [^a-z] one character notfrom this range ”abc” the string “abc” \( the character ’(’ . . . id=R named non-recursive regular expressions
More Examples Regular Expression RStrings in L(R) digit = [0-9] “0” “1” “2” “3” … posint = digit+ “8” “412” … int = -? posint “-42” “1024” … real = int ((. posint)?) “-1.56” “12” “1.0” = (-|ε)([0-9]+)((. [0-9]+)|ε) [a-zA-Z_][a-zA-Z0-9_]* C identifiers else the keyword “else”
Historical Anomalies • PL/I • Keywords not reserved • IF IF THEN THEN ELSE ELSE; • FORTRAN • Whitespace stripped out prior to scanning • DO 123 I = 1 • DO 123 I = 1 , 2 • By and large, modern language design intentionally makes scanning easier
Writing a Lexer • Regular Expressions can be very useful in describing languages (tokens). • Use an automatic Lexer generator (Flex, Lex) to generate a Lexer from language specification. • Have a systematic way of writing a Lexer from a specification such as regular expressions.
How To Use Regular Expressions • Given R RE and input string w, need a mechanism to determine if w L(R) • Such a mechanism is called an acceptor R RE (that describes a token family) Yes, if w is a token ? No, if w not a token Input string w (from the program)
Acceptors • Acceptor determines if an input string belongs to a language L • Finite Automata are acceptors for languages described by regular expressions L Description of language Finite Automaton Yes, if w L Acceptor No, if w L Input String w
Finite Automata • Informally, finite automaton consist of: • A finiteset of states • Transitions between states • An initial state (start state) • A set of final states (accepting states) • Two kinds of finite automata: • Deterministic finite automata (DFA): the transition from each state is uniquely determined by the current input character • Non-deterministic finite automata (NFA): there may be multiple possible choices, and some “spontaneous” transitions without input
DFA Example • Finite automaton that accepts the strings in the language denoted by regular expression ab*a • Can be represented as a graph or a transition table. • A graph. • Read symbol • Follow outgoing edge b a a 2 0 1
DFA Example (cont.) a b 0 1 Error 1 2 1 2 ErrorError • Representing FA as transition tables makes the implementation very easy. • The above FA can be represented as : • Current state and current symbol determine next state. • Until • error state. • End of input.
Simulating the DFA • Determine if the DFA accepts an input string transition_table[NumSTATES][NumCHARS] accept_states[NumSTATES] state = INITIAL while (state != Error) {c = input.read();if (c == EOF) break;state = trans_table[state][c]; } return (state!=Error) && accept_states[state]; b a a 2 0 1
RE Finite automaton? • Can we build a finite automaton for every regular expression? • Strategy: build the finite automaton inductively, based on the definition of regular expressions ε a a
RE Finite automaton? ? ? ? • Alternation R|S • Concatenation: RS • Recall ? implies optional move. R automaton S automaton R automaton S automaton
NFA Definition b a e a a b e • A non-deterministic finite automaton (NFA) is an automaton where: • There may be ε-transitions (transitions that do not consume input characters) • There may be multiple transitions from the same state on the same input character Example:
RE NFA intuition -?[0-9]+ - 0-9 0-9 When to take the -path?
NFA construction (Thompson) • NFA only needs one stop state (why?) • Canonical NFA form: • Use this canonical form to inductively construct NFAs for regular expressions
Inductive NFA Construction ε ε R R|S ε ε S ε RS R S ε ε ε R* R ε
Inductive NFA Construction ε ε R R|S ε ε S RS R S ε ε ε R* R ε
DFA vs NFA • DFA: action of automaton on each input symbol is fully determined • obvious table-driven implementation • NFA: • automaton may have choice on each step • automaton accepts a string if there is any way to make choices to arrive at accepting state • every path from start state to an accept state is a string accepted by automaton • not obvious how to implement!
Simulating an NFA • Problem: how to execute NFA? “strings accepted are those for which there is some corresponding path from start state to an accept state” • Solution: search all paths in graph consistent with the string in parallel • Keep track of the subset of NFA states that search could be in after seeing string prefix • “Multiple fingers” pointing to graph
Example - 0-9 0 3 1 2 0-9 • Input string: -23 • NFA states: • Start:{0,1} • “-” :{1} • “2” :{2, 3} • “3” :{2, 3} • But this is very difficult to implement directly.