1 / 74

CS375

CS375. Compilers Lexical Analysis 4 th February, 2010. Outline. Overview of a compiler. What is lexical analysis? Writing a Lexer Specifying tokens: regular expressions Converting regular expressions to NFA, DFA Optimizations. How It Works. Program representation. Source code

weylin
Download Presentation

CS375

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS375 Compilers Lexical Analysis 4th February, 2010

  2. Outline • Overview of a compiler. • What is lexical analysis? • Writing a Lexer • Specifying tokens: regular expressions • Converting regular expressions to NFA, DFA • Optimizations.

  3. How It Works Program representation Source code (character stream) if (b == 0) a = b; Lexical Analysis Tokenstream if ( b == 0 ) a = b ; Syntax Analysis (Parsing) if == = Abstract syntaxtree (AST) b 0 a b Semantic Analysis if boolean int == = Decorated AST intb int0 inta lvalue intb

  4. What is a lexical analyzer • What? • Reads in a stream of characters and groups them into “tokens” or “lexemes”. • Language definition describes what tokens are valid. • Why? • Makes writing the parser a lot easier, parser operates on “tokens”. • Input dependent functionality such as character codes, EOF, new line characters.

  5. First Step: Lexical Analysis Source code (character stream) if (b == 0) a = b; Lexical Analysis Token stream if ( b == 0 ) a = b ; Syntax Analysis Semantic Analysis

  6. What it should do? Token Description ? Input String Token? {Yes/No} w We want some way to describe tokens, and have our Lexer take that description as input and decide if a string is a token or not.

  7. Starting off

  8. Tokens • Logical grouping of characters. • Identifiers: x y11 elsen _i00 • Keywords: if else while break • Constants: • Integer: 2 1000 -500 5L 0x777 • Floating-point: 2.0 0.00020 .02 1. 1e5 0.e-10 • String: ”x” ”He said, \”Are you?\”\n” • Character: ’c’ ’\000’ • Symbols: + * { } ++ < << [ ] >= • Whitespace (typically recognized and discarded): • Comment: /** don’t change this **/ • Space: <space> • Format characters: <newline> <return>

  9. Ad-hoc Lexer • Hand-write code to generate tokens • How to read identifier tokens? Token readIdentifier( ) { String id = “”; while (true) { char c = input.read(); if (!identifierChar(c)) return new Token(ID, id, lineNumber); id = id + String(c); } } • Problems • How to start? • What to do with following character? • How to avoid quadratic complexity of repeated concatenation? • How to recognize keywords?

  10. Look-ahead Character • Scan text one character at a time • Use look-ahead character (next) to determine what kind of token to read and when the current token ends char next; … while (identifierChar(next)) { id = id + String(next); next = input.read (); } e l s e n next (lookahead)

  11. Ad-hoc Lexer: Top-level Loop class Lexer { InputStream s; char next; Lexer(InputStream _s) { s = _s; next = s.read(); } Token nextToken( ) { if (identifierFirstChar(next))//starts with a char return readIdentifier(); //is an identifier if (numericFirstChar(next)) //starts with a num return readNumber(); //is a number if (next == ‘\”’) return readStringConst(); … } }

  12. Problems • Might not know what kind of token we are going to read from seeing first character • if token begins with “i’’ is it an identifier? (what about int, if ) • if token begins with “2” is it an integer constant? • interleaved tokenizer code hard to write correctly, harder to maintain • in general, unbounded look-ahead may be needed

  13. Problems (cont.) • How to specify (unambiguously) tokens. • Once specified, how to implement them in a systematic way? • How to implement them efficiently?

  14. Problems (cont. ) • For instance, consider. • How to describe tokens unambiguously 2.e0 20.e-01 2.0000 “” “x” “\\” “\”\’” • How to break up text into tokens if (x == 0) a = x<<1; if (x == 0) a = x<1; • How to tokenize efficiently • tokens may have similar prefixes • want to look at each character ~1 time

  15. Principled Approach • Need a principled approach • Lexer Generators • lexer generator that generates efficient tokenizer automatically (e.g., lex, flex, Jlex) a.k.a. scanner generator • Your own Lexer • Describe programming language’s tokens with a set of regular expressions • Generate scanning automaton from that set of regular expressions

  16. Top level idea… • Have a formal language to describe tokens. • Use regular expressions. • Have a mechanical way of converting this formal description to code. • Convert regular expressions to finite automaton (acceptors/state machines) • Run the code on actual inputs. • Simulate the finite automaton. 16

  17. An Example : Integers • Consider integers. • We can describe integers using the following grammar: • Num -> ‘-’ Pos • Num -> Pos • Pos ->0 | 1 |…|9 • Pos ->0 | 1 |…|9 Pos • Or in a more compact notation, we have: • Num-> -? [0-9]+ 17

  18. An Example : Integers • Using Num-> -? [0-9]+ we can generate integers such as -12, 23, 0. • We can also represent above regular expression as a state machine. • This would be useful in simulation of the regular expression. 18

  19. An Example : Integers -  0-9 0 3 1 2  0-9 • The Non-deterministic Finite Automaton is as follows. • We can verify that -123, 65, 0 are accepted by the state machine. • But which path to take? -paths? 19

  20. An Example : Integers {2,3} - {0,1} {1} 0-9 0-9 0-9 • The NFA can be converted to an equivalent Deterministic FA as below. • We shall see later how. • It accepts the same tokens. • -123 • 65 • 0 20

  21. An Example : Integers • The deterministic Finite automaton makes implementation very easier, as we shall see later. • So, all we have to do is: • Express tokens as regular expressions • Convert RE to NFA • Convert NFA to DFA • Simulate the DFA on inputs 21

  22. The larger picture… Regular Expression describing tokens RE  NFA Conversion R NFA  DFA Conversion Yes, if w is valid token DFA Simulation Input String w No, if not 22

  23. Quick Language Theory Review…

  24. Language Theory Review • Let  be a finite set •  called an alphabet • a   called a symbol • * is the set of all finite strings consisting of symbols from  • A subset L  * is called a language • If L1 and L2 are languages, then L1 L2 is the concatenation of L1 and L2, i.e., the set of all pair-wise concatenations of strings from L1 and L2, respectively

  25. Language Theory Review, ctd. • Let L  * be a language • Then • L0 = {} • Ln+1= L Ln for all n  0 • Examples • if L = {a, b} then • L1 = L = {a, b} • L2 = {aa, ab, ba, bb} • L3 = {aaa, aab, aba, aba, baa, bab, bba, bbb} • …

  26. Syntax of Regular Expressions • Set of regular expressions (RE) over alphabet  is defined inductively by • Let a  and R, S  RE. Then: • a  RE • ε RE •   RE • R|S RE • RS RE • R*  RE • In concrete syntactic form, precedence rules, parentheses, and abbreviations

  27. Semantics of Regular Expressions • Regular expression T  RE denotes the language L(R)  * given according to the inductive structure of T: • L(a) ={a} the string “a” • L(ε) = {“”} the empty string • L()= {} the empty set • L(R|S) = L(R)  L(S) alternation • L(RS) = L(R) L(S) concatenation • L(R*) = {“”}  L(R)  L(R2)  L(R3)  L(R4)  … Kleene closure

  28. Simple Examples • L(R) = the “language” defined by R • L( abc ) = { abc } • L( hello|goodbye ) = {hello, goodbye} • OR operator, so L(a|b) is the language containing either strings of a, or strings of b. • L( 1(0|1)* ) = all non-zero binary numerals beginning with 1 • Kleene Star. Zero or more repetitions of the string enclosed in the parenthesis.

  29. Convienent RE Shorthand R+ one or more strings from L(R): R(R*) R? optional R: (R|ε) [abce] one of the listed characters: (a|b|c|e) [a-z] one character from this range: (a|b|c|d|e|…|y|z) [^ab] anything but one of the listed chars [^a-z] one character notfrom this range ”abc” the string “abc” \( the character ’(’ . . . id=R named non-recursive regular expressions

  30. More Examples Regular Expression RStrings in L(R) digit = [0-9] “0” “1” “2” “3” … posint = digit+ “8” “412” … int = -? posint “-42” “1024” … real = int ((. posint)?) “-1.56” “12” “1.0” = (-|ε)([0-9]+)((. [0-9]+)|ε) [a-zA-Z_][a-zA-Z0-9_]* C identifiers else the keyword “else”

  31. Historical Anomalies • PL/I • Keywords not reserved • IF IF THEN THEN ELSE ELSE; • FORTRAN • Whitespace stripped out prior to scanning • DO 123 I = 1 • DO 123 I = 1 , 2 • By and large, modern language design intentionally makes scanning easier

  32. Writing a lexer

  33. Writing a Lexer • Regular Expressions can be very useful in describing languages (tokens). • Use an automatic Lexer generator (Flex, Lex) to generate a Lexer from language specification. • Have a systematic way of writing a Lexer from a specification such as regular expressions.

  34. Writing your own lexer

  35. How To Use Regular Expressions • Given R RE and input string w, need a mechanism to determine if w  L(R) • Such a mechanism is called an acceptor R RE (that describes a token family) Yes, if w is a token ? No, if w not a token Input string w (from the program)

  36. Acceptors • Acceptor determines if an input string belongs to a language L • Finite Automata are acceptors for languages described by regular expressions L Description of language Finite Automaton Yes, if w  L Acceptor No, if w  L Input String w

  37. Finite Automata • Informally, finite automaton consist of: • A finiteset of states • Transitions between states • An initial state (start state) • A set of final states (accepting states) • Two kinds of finite automata: • Deterministic finite automata (DFA): the transition from each state is uniquely determined by the current input character • Non-deterministic finite automata (NFA): there may be multiple possible choices, and some “spontaneous” transitions without input

  38. DFA Example • Finite automaton that accepts the strings in the language denoted by regular expression ab*a • Can be represented as a graph or a transition table. • A graph. • Read symbol • Follow outgoing edge b a a 2 0 1

  39. DFA Example (cont.) a b 0 1 Error 1 2 1 2 ErrorError • Representing FA as transition tables makes the implementation very easy. • The above FA can be represented as : • Current state and current symbol determine next state. • Until • error state. • End of input.

  40. Simulating the DFA • Determine if the DFA accepts an input string transition_table[NumSTATES][NumCHARS] accept_states[NumSTATES] state = INITIAL while (state != Error) {c = input.read();if (c == EOF) break;state = trans_table[state][c]; } return (state!=Error) && accept_states[state]; b a a 2 0 1

  41. RE Finite automaton? • Can we build a finite automaton for every regular expression? • Strategy: build the finite automaton inductively, based on the definition of regular expressions ε a  a

  42. RE Finite automaton? ? ? ? • Alternation R|S • Concatenation: RS • Recall ? implies optional move. R automaton S automaton R automaton S automaton

  43. NFA Definition b a e a a b e • A non-deterministic finite automaton (NFA) is an automaton where: • There may be ε-transitions (transitions that do not consume input characters) • There may be multiple transitions from the same state on the same input character Example:

  44. RE NFA intuition -?[0-9]+ -  0-9  0-9 When to take the -path?

  45. NFA construction (Thompson) • NFA only needs one stop state (why?) • Canonical NFA form: • Use this canonical form to inductively construct NFAs for regular expressions

  46. Inductive NFA Construction ε ε R R|S ε ε S ε RS R S ε ε ε R* R ε

  47. Inductive NFA Construction ε ε R R|S ε ε S RS R S ε ε ε R* R ε

  48. DFA vs NFA • DFA: action of automaton on each input symbol is fully determined • obvious table-driven implementation • NFA: • automaton may have choice on each step • automaton accepts a string if there is any way to make choices to arrive at accepting state • every path from start state to an accept state is a string accepted by automaton • not obvious how to implement!

  49. Simulating an NFA • Problem: how to execute NFA? “strings accepted are those for which there is some corresponding path from start state to an accept state” • Solution: search all paths in graph consistent with the string in parallel • Keep track of the subset of NFA states that search could be in after seeing string prefix • “Multiple fingers” pointing to graph

  50. Example -  0-9 0 3 1 2  0-9 • Input string: -23 • NFA states: • Start:{0,1} • “-” :{1} • “2” :{2, 3} • “3” :{2, 3} • But this is very difficult to implement directly.

More Related