720 likes | 901 Views
Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning). Mayer Goldberg and Roman Manevich Ben-Gurion University. General stuff. Topics taught by me Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation
E N D
Winter 2012-2013Compiler PrinciplesLexical Analysis (Scanning) Mayer Goldberg and Roman Manevich Ben-Gurion University
General stuff • Topics taught by me • Lexical analysis (scanning) • Syntax analysis (parsing) • … • Dataflow analysis • Register allocation • Slides will be available from web-site after lecture • Request: please mute mobiles, tablets, super-cool squeaking devices
Today • Understand role of lexical analysis • Lexical analysis theory • Implementing modern scanner
Role of lexical analysis • First part of compiler front-end • Convert stream of characters into stream of tokens • Split text into most basic meaningful strings • Simplify input for syntax analysis High-levelLanguage(scheme) LexicalAnalysis Syntax Analysis Parsing AST SymbolTableetc. Inter.Rep.(IR) CodeGeneration Executable Code
+ num * num x From scanning to parsing 5 + (7 * x) program text Lexical Analyzer token stream Grammar:E id E num E E+EE E*EE ( E ) Parser valid syntaxerror Abstract Syntax Tree
Javascript example • Identify basic units in this code varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } }
Javascript example • Identify basic units in this code varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } }
Javascript example • Identify basic units in this code operator numeric literal keyword varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } string literal whitespace identifier punctuation
Scanner output varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } Stream of TokensLINE: ID(value) 1: VAR1: ID(currOption)1: EQ1: INT_LITERAL(0)1: SEMI3: FUNCTION3: ID(choose)3: LP3: ID(id)3: EP3: LCB...
What is a token? • Lexeme – substring of original text constituting an identifiable unit • Identifiers, Values, reserved words, … • Record type storing: • Kind • Value (when applicable) • Start-position/end-position • Any information that is useful for the parser • Different for different languages
C++ example 1 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int>> myVector >>operator >, >two tokens or ?
C++ example 2 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int> > myVector >, >two tokens
Separating tokens • Lexemes are recognized but get consumed rather than transmitted to parser • ifi fi/*comment*/f
Designing a scanner • Define each type of lexeme • Reserved words: var, if, for, while • Operators: < = ++ • Identifiers: myFunction • Literals: 123 “hello” • Annotations: @SuppressWarnings • But how do we define lexemes of unbounded length?
Designing a scanner • Define each type of lexeme • Reserved words: var, if, for, while • Operators: < = ++ • Identifiers: myFunction • Literals: 123 “hello” • Annotations: @SuppressWarnings • But how do we define lexemes of unbounded length? • Regular expressions
Regular languages refresher • Formal languages • Alphabet = finite set of letters • Word = sequence of letter • Language = set of words • Regular languages defined equivalently by • Regular expressions • Finite-state automata
Regular expressions • Empty string:Є • Letter: a • Concatenation: R1 R2 • Union: R1 | R2 • Kleene-star: R* • Shorthand: R+ stands for R R* • scope: (R) • Example: (0* 1*) | (1* 0*) • What is this language?
Exercise 1 - Question • Language of Java identifiers • Identifiers start with either an underscore ‘_’or a letter • Continue with either underscore, letter, or digit
Exercise 1 - Answer • Language of Java identifiers • Identifiers start with either an underscore ‘_’or a letter • Continue with either underscore, letter, or digit • (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* • Using shorthand macrosFirst = _|a|b|…|z|A|…|ZNext = First|0|…|9R = First Next*
Exercise 2 - Question • Language of rational numbers in decimal representation (no leading, ending zeros) • 0 • 123.757 • .933333 • Not 007 • Not 0.30
Exercise 2 - Answer • Language of rational numbers in decimal representation (no leading, ending zeros) • Digit = 1|2|…|9Digit0 = 0|DigitNum = Digit Digit0*Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.FracPosOrNeg = (Є|-)PosR = 0 | PosOrNeg
Exercise 3 - Question • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …
Exercise 3 - Answer • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … • Not regular • Context-free • Grammar:S ::= [] | [S]
Finite automata • An automaton is defined by states and transitions transition acceptingstate b c a start b startstate
Automaton running example • Words are read left-to-right b c a start b
Automaton running example • Words are read left-to-right b c a start b
Automaton running example • Words are read left-to-right b c a start b
Automaton running example • Words are read left-to-right wordaccepted b c a start b
Word outside of language b c a start b
Word outside of language • Missing transition means non-acceptance b c a start b
Exercise - Question • What is the language defined by the automaton below? b c a start b
Exercise - Answer • What is the language defined by the automaton below? • a b* c • Generally: all paths leading to accepting states b c a start b
Non-deterministic automata • Allow multiple transitions from given state labeled by same letter b c a start c a b
NFA run example b c a start c a b
NFA run example • Maintain set of states b c a start c a b
NFA run example b c a start c a b
NFA run example • Accept word if any of the states in the set is accepting b c a start c a b
NFA+Єautomata • Є transitions can “fire” without reading the input b a c start Є
NFA+Єrun example b a c start Є
NFA+Єrun example • Now Є transition can non-deterministically take place b a c start Є
NFA+Єrun example b a c start Є
NFA+Єrun example b a c start Є
NFA+Єrun example b a c start Є
NFA+Єrun example • Word accepted b a c start Є
Reg-exp vs. automata • Regular expressions are declarative • Offer compact way to define a regular language by humans • Don’t offer direct way to check whether a given word is in the language • Automata are operative • Define an algorithm for deciding whether a given word is in a regular language • Not a natural notation for humans
From reg. exp. to automata • Theorem: there is an algorithm to build an NFA+Єautomaton for any regular expression • Proof: by induction on the structure of the regular expression • For each sub-expression R we build an automaton with exactly one start state and one accepting state • Start state has no incoming transitions • Accepting state has no outgoing transitions
From reg. exp. to automata • Theorem: there is an algorithm to build an NFA+Єautomaton for any regular expression • Proof: by induction on the structure of the regular expression start
Base cases R = start a R = a start