1 / 72

Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)

Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning). Mayer Goldberg and Roman Manevich Ben-Gurion University. General stuff. Topics taught by me Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation

gilles
Download Presentation

Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Winter 2012-2013Compiler PrinciplesLexical Analysis (Scanning) Mayer Goldberg and Roman Manevich Ben-Gurion University

  2. General stuff • Topics taught by me • Lexical analysis (scanning) • Syntax analysis (parsing) • … • Dataflow analysis • Register allocation • Slides will be available from web-site after lecture • Request: please mute mobiles, tablets, super-cool squeaking devices

  3. Today • Understand role of lexical analysis • Lexical analysis theory • Implementing modern scanner

  4. Role of lexical analysis • First part of compiler front-end • Convert stream of characters into stream of tokens • Split text into most basic meaningful strings • Simplify input for syntax analysis High-levelLanguage(scheme) LexicalAnalysis Syntax Analysis Parsing AST SymbolTableetc. Inter.Rep.(IR) CodeGeneration Executable Code

  5. + num * num x From scanning to parsing 5 + (7 * x) program text Lexical Analyzer token stream Grammar:E id E num E E+EE  E*EE  ( E ) Parser valid syntaxerror Abstract Syntax Tree

  6. Javascript example • Identify basic units in this code varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } }

  7. Javascript example • Identify basic units in this code varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } }

  8. Javascript example • Identify basic units in this code operator numeric literal keyword varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } string literal whitespace identifier punctuation

  9. Scanner output varcurrOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; varelt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } Stream of TokensLINE: ID(value) 1: VAR1: ID(currOption)1: EQ1: INT_LITERAL(0)1: SEMI3: FUNCTION3: ID(choose)3: LP3: ID(id)3: EP3: LCB...

  10. What is a token? • Lexeme – substring of original text constituting an identifiable unit • Identifiers, Values, reserved words, … • Record type storing: • Kind • Value (when applicable) • Start-position/end-position • Any information that is useful for the parser • Different for different languages

  11. C++ example 1 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int>> myVector >>operator >, >two tokens or ?

  12. C++ example 2 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int> > myVector >, >two tokens

  13. Example tokens

  14. Separating tokens • Lexemes are recognized but get consumed rather than transmitted to parser • ifi fi/*comment*/f

  15. Preprocessor directives in C

  16. Designing a scanner • Define each type of lexeme • Reserved words: var, if, for, while • Operators: < = ++ • Identifiers: myFunction • Literals: 123 “hello” • Annotations: @SuppressWarnings • But how do we define lexemes of unbounded length?

  17. Designing a scanner • Define each type of lexeme • Reserved words: var, if, for, while • Operators: < = ++ • Identifiers: myFunction • Literals: 123 “hello” • Annotations: @SuppressWarnings • But how do we define lexemes of unbounded length? • Regular expressions

  18. Regular languages refresher • Formal languages • Alphabet = finite set of letters • Word = sequence of letter • Language = set of words • Regular languages defined equivalently by • Regular expressions • Finite-state automata

  19. Regular expressions • Empty string:Є • Letter: a • Concatenation: R1 R2 • Union: R1 | R2 • Kleene-star: R* • Shorthand: R+ stands for R R* • scope: (R) • Example: (0* 1*) | (1* 0*) • What is this language?

  20. Exercise 1 - Question • Language of Java identifiers • Identifiers start with either an underscore ‘_’or a letter • Continue with either underscore, letter, or digit

  21. Exercise 1 - Answer • Language of Java identifiers • Identifiers start with either an underscore ‘_’or a letter • Continue with either underscore, letter, or digit • (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* • Using shorthand macrosFirst = _|a|b|…|z|A|…|ZNext = First|0|…|9R = First Next*

  22. Exercise 2 - Question • Language of rational numbers in decimal representation (no leading, ending zeros) • 0 • 123.757 • .933333 • Not 007 • Not 0.30

  23. Exercise 2 - Answer • Language of rational numbers in decimal representation (no leading, ending zeros) • Digit = 1|2|…|9Digit0 = 0|DigitNum = Digit Digit0*Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.FracPosOrNeg = (Є|-)PosR = 0 | PosOrNeg

  24. Exercise 3 - Question • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

  25. Exercise 3 - Answer • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … • Not regular • Context-free • Grammar:S ::= [] | [S]

  26. Finite automata • An automaton is defined by states and transitions transition acceptingstate b c a start b startstate

  27. Automaton running example • Words are read left-to-right b c a start b

  28. Automaton running example • Words are read left-to-right b c a start b

  29. Automaton running example • Words are read left-to-right b c a start b

  30. Automaton running example • Words are read left-to-right wordaccepted b c a start b

  31. Word outside of language b c a start b

  32. Word outside of language • Missing transition means non-acceptance b c a start b

  33. Exercise - Question • What is the language defined by the automaton below? b c a start b

  34. Exercise - Answer • What is the language defined by the automaton below? • a b* c • Generally: all paths leading to accepting states b c a start b

  35. Non-deterministic automata • Allow multiple transitions from given state labeled by same letter b c a start c a b

  36. NFA run example b c a start c a b

  37. NFA run example • Maintain set of states b c a start c a b

  38. NFA run example b c a start c a b

  39. NFA run example • Accept word if any of the states in the set is accepting b c a start c a b

  40. NFA+Єautomata • Є transitions can “fire” without reading the input b a c start Є

  41. NFA+Єrun example b a c start Є

  42. NFA+Єrun example • Now Є transition can non-deterministically take place b a c start Є

  43. NFA+Єrun example b a c start Є

  44. NFA+Єrun example b a c start Є

  45. NFA+Єrun example b a c start Є

  46. NFA+Єrun example • Word accepted b a c start Є

  47. Reg-exp vs. automata • Regular expressions are declarative • Offer compact way to define a regular language by humans • Don’t offer direct way to check whether a given word is in the language • Automata are operative • Define an algorithm for deciding whether a given word is in a regular language • Not a natural notation for humans

  48. From reg. exp. to automata • Theorem: there is an algorithm to build an NFA+Єautomaton for any regular expression • Proof: by induction on the structure of the regular expression • For each sub-expression R we build an automaton with exactly one start state and one accepting state • Start state has no incoming transitions • Accepting state has no outgoing transitions

  49. From reg. exp. to automata • Theorem: there is an algorithm to build an NFA+Єautomaton for any regular expression • Proof: by induction on the structure of the regular expression start

  50. Base cases  R =  start a R = a start

More Related