1.01k likes | 1.25k Views
Fall 2014-2015 Compiler Principles Lecture 1: Lexical Analysis. Roman Manevich Ben-Gurion University. Agenda. Understand role of lexical analysis in a compiler Lexical analysis theory Implementing professional scanner via scanner generator. Javascript example. var currOption = 0;
E N D
Fall 2014-2015 Compiler PrinciplesLecture 1: Lexical Analysis Roman Manevich Ben-Gurion University
Agenda • Understand role of lexical analysis in a compiler • Lexical analysis theory • Implementing professional scanner via scanner generator
Javascript example var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } Can you some identify basic units in this code?
Javascript example keyword ? ? ? ? ? var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } ? ? Can you some identify basic units in this code?
Javascript example keyword identifier operator numeric literal punctuation comment var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } string literal whitespace Can you some identify basic units in this code?
Role of lexical analysis High-levelLanguage(scheme) LexicalAnalysis Syntax Analysis Parsing AST SymbolTableetc. Inter.Rep.(IR) CodeGeneration Executable Code • First part of compiler front-end • Convert stream of characters into stream of tokens • Split text into most basic meaningful strings • Simplify input for syntax analysis
+ num * num x From scanning to parsing 59 + (1257 * xPosition) program text Lexical Analyzer Lexicalerror valid token stream Grammar:E id E num E E+EE E*EE ( E ) Parser valid syntaxerror Abstract Syntax Tree
Scanner output var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } Stream of TokensLINE: ID(value) 1: VAR1: ID(currOption)1: EQ1: INT_LITERAL(0)1: SEMI3: FUNCTION3: ID(choose)3: LP3: ID(id)3: EP3: LCB...
What is a token? • Lexeme – substring of original text constituting an identifiable unit • Identifiers, Values, reserved words, … • Record type storing: • Kind • Value (when applicable) • Start-position/end-position • Any information that is useful for the parser • Different for different languages
C++ example 1 vector<vector<int>> myVector >>operator >, >two tokens or ? Splitting text into tokens can be tricky How should the code below be split?
C++ example 2 vector<vector<int> > myVector >, >two tokens Splitting text into tokens can be tricky How should the code below be split?
Separating tokens • Lexemes are recognized but get consumed rather than transmitted to parser • ifi fi/*comment*/f
First step of designing a scanner ? • Define each type of lexeme • Reserved words: var, if, for, while • Operators: < = ++ • Identifiers: myFunction • Literals: 123 “hello” • Annotations: @SuppressWarnings • How can we define lexemes of unbounded length
First step of designing a scanner ? • Define each type of lexeme • Reserved words: var, if, for, while • Operators: < = ++ • Identifiers: myFunction • Literals: 123 “hello” • Annotations: @SuppressWarnings • How can we define lexemes of unbounded length • Regular expressions
Agenda • Understand role of lexical analysis in a compiler • Convert text to stream of tokens • Lexical analysis theory • Implementing professional scanner via scanner generator
Regular languages refresher • Formal languages • Alphabet = finite set of letters • Word = sequence of letter • Language = set of words • Regular languages defined equivalently by • Regular expressions • Finite-state automata
Regular expressions • Empty string:Є • Letter: a • Concatenation: R1 R2 • Union: R1 | R2 • Kleene-star: R* • Shorthand: R+ stands for R R* • scope: (R) • Example: (0* 1*) | (1* 0*) • What is this language?
Exercise 1 - Question • Language of Java identifiers • Identifiers start with either an underscore ‘_’or a letter • Continue with either underscore, letter, or digit
Exercise 1 - Answer • Language of Java identifiers • Identifiers start with either an underscore ‘_’or a letter • Continue with either underscore, letter, or digit • (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)*
Exercise 1 – Better answer • Language of Java identifiers • Identifiers start with either an underscore ‘_’or a letter • Continue with either underscore, letter, or digit • (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* • Using shorthand macrosFirst = _|a|b|…|z|A|…|ZNext = First|0|…|9R = First Next*
Exercise 2 - Question • Language of rational numbers in decimal representation (no leading, ending zeros) • Positive examples: • 0 • 123.757 • .933333 • 0.7 • Negative examples: • 007 • 0.30
Exercise 2 - Answer • Language of rational numbers in decimal representation (no leading, ending zeros) • Digit = 1|2|…|9Digit0 = 0|DigitNum = Digit Digit0*Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.FracPosOrNeg = (Є|-)PosR = 0 | PosOrNeg
Exercise 3 - Question Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …
Exercise 3 - Answer Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … Not regular Context-free Grammar:S ::= [] | [S]
Finite automata transition acceptingstate b c a start b startstate An automaton is defined by states and transitions
Automaton running example b c a start b Words are read left-to-right
Automaton running example b c a start b Words are read left-to-right
Automaton running example b c a start b Words are read left-to-right
Automaton running example wordaccepted b c a start b Words are read left-to-right
Word outside of language b c a start b
Word outside of language b c a start b Missing transition means non-acceptance
Word outside of language b c a start b
Word outside of language b c a start b
Word outside of language b c a start b Final state is not an accepting state
Exercise - Question b c a start b What is the language defined by the automaton below?
Exercise - Answer b c a start b • What is the language defined by the automaton below? • a b* c • Generally: all paths leading to accepting states
A little about me • Joined Ben-Gurion University two years ago • Research interests • Advanced compilation and synthesis techniques • Language-supported parallelism • Static analysis and verification
I am here for • Teaching you theory and practice of popular compiler algorithms • Hopefully make you think about solving problemsby examples from the compilers world • Answering questions about material • Contacting me • e-mail: romanm@cs.bgu.ac.il • Office hours: see course web-page • Announcements • Forums (per assignment)
Tentative syllabus mid-term exam
Non-deterministic automata b c a start c a b Allow multiple transitions from given state labeled by same letter
NFA run example b c a start c a b
NFA run example b c a start c a b Maintain set of states
NFA run example b c a start c a b
NFA run example b c a start c a b Accept word if any of the states in the set is accepting