Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)

Winter 2012-2013Compiler PrinciplesLexical Analysis (Scanning) Mayer Goldberg and Roman Manevich Ben-Gurion University

General stuff • Topics taught by me • Lexical analysis (scanning) • Syntax analysis (parsing) • … • Dataflow analysis • Register allocation • Slides will be available from web-site after lecture • Request: please mute mobiles, tablets, super-cool squeaking devices

Today • Understand role of lexical analysis • Lexical analysis theory • Implementing modern scanner

Role of lexical analysis • First part of compiler front-end • Convert stream of characters into stream of tokens • Split text into most basic meaningful strings • Simplify input for syntax analysis High-levelLanguage(scheme) LexicalAnalysis Syntax Analysis Parsing AST SymbolTableetc. Inter.Rep.(IR) CodeGeneration Executable Code

+ num * num x From scanning to parsing 5 + (7 * x) program text Lexical Analyzer token stream Grammar:E id E num E E+EE  E*EE  ( E ) Parser valid syntaxerror Abstract Syntax Tree

Javascript example • Identify basic units in this code varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } }

Javascript example • Identify basic units in this code operator numeric literal keyword varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } string literal whitespace identifier punctuation

Scanner output varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } Stream of TokensLINE: ID(value) 1: VAR1: ID(currOption)1: EQ1: INT_LITERAL(0)1: SEMI3: FUNCTION3: ID(choose)3: LP3: ID(id)3: EP3: LCB...

What is a token? • Lexeme – substring of original text constituting an identifiable unit • Identifiers, Values, reserved words, … • Record type storing: • Kind • Value (when applicable) • Start-position/end-position • Any information that is useful for the parser • Different for different languages

C++ example 1 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int>> myVector >>operator >, >two tokens or ?

C++ example 2 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int> > myVector >, >two tokens

Example tokens

Separating tokens • Lexemes are recognized but get consumed rather than transmitted to parser • ifi fi/*comment*/f

Preprocessor directives in C

Designing a scanner • Define each type of lexeme • Reserved words: var, if, for, while • Operators: < = ++ • Identifiers: myFunction • Literals: 123 “hello” • Annotations: @SuppressWarnings • But how do we define lexemes of unbounded length?

Designing a scanner • Define each type of lexeme • Reserved words: var, if, for, while • Operators: < = ++ • Identifiers: myFunction • Literals: 123 “hello” • Annotations: @SuppressWarnings • But how do we define lexemes of unbounded length? • Regular expressions

Regular languages refresher • Formal languages • Alphabet = finite set of letters • Word = sequence of letter • Language = set of words • Regular languages defined equivalently by • Regular expressions • Finite-state automata

Regular expressions • Empty string:Є • Letter: a • Concatenation: R1 R2 • Union: R1 | R2 • Kleene-star: R* • Shorthand: R+ stands for R R* • scope: (R) • Example: (0* 1*) | (1* 0*) • What is this language?

Exercise 1 - Question • Language of Java identifiers • Identifiers start with either an underscore ‘_’or a letter • Continue with either underscore, letter, or digit

Exercise 1 - Answer • Language of Java identifiers • Identifiers start with either an underscore ‘_’or a letter • Continue with either underscore, letter, or digit • (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* • Using shorthand macrosFirst = _|a|b|…|z|A|…|ZNext = First|0|…|9R = First Next*

Exercise 2 - Question • Language of rational numbers in decimal representation (no leading, ending zeros) • 0 • 123.757 • .933333 • Not 007 • Not 0.30

Exercise 3 - Question • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

Exercise 3 - Answer • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … • Not regular • Context-free • Grammar:S ::= [] | [S]

Finite automata • An automaton is defined by states and transitions transition acceptingstate b c a start b startstate

Automaton running example • Words are read left-to-right b c a start b

Automaton running example • Words are read left-to-right wordaccepted b c a start b

Word outside of language b c a start b

Word outside of language • Missing transition means non-acceptance b c a start b

Exercise - Question • What is the language defined by the automaton below? b c a start b

Exercise - Answer • What is the language defined by the automaton below? • a b* c • Generally: all paths leading to accepting states b c a start b

Non-deterministic automata • Allow multiple transitions from given state labeled by same letter b c a start c a b

NFA run example b c a start c a b

NFA run example • Maintain set of states b c a start c a b

NFA run example b c a start c a b

NFA run example • Accept word if any of the states in the set is accepting b c a start c a b

NFA+Єautomata • Є transitions can “fire” without reading the input b a c start Є

NFA+Єrun example b a c start Є

NFA+Єrun example • Now Є transition can non-deterministically take place b a c start Є

NFA+Єrun example b a c start Є

NFA+Єrun example • Word accepted b a c start Є

Reg-exp vs. automata • Regular expressions are declarative • Offer compact way to define a regular language by humans • Don’t offer direct way to check whether a given word is in the language • Automata are operative • Define an algorithm for deciding whether a given word is in a regular language • Not a natural notation for humans

From reg. exp. to automata • Theorem: there is an algorithm to build an NFA+Єautomaton for any regular expression • Proof: by induction on the structure of the regular expression • For each sub-expression R we build an automaton with exactly one start state and one accepting state • Start state has no incoming transitions • Accepting state has no outgoing transitions

From reg. exp. to automata • Theorem: there is an algorithm to build an NFA+Єautomaton for any regular expression • Proof: by induction on the structure of the regular expression start

Base cases  R =  start a R = a start

Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)