1 / 22

Week 2 – Lecture 1

Week 2 – Lecture 1. Compiler Construction. Lexical Analysis The language of Lexical Analysis Regular Expressions DFAs and NFAs Errors in Lexical Analysis Reading: section 2.6, Chapter 3. Lexical Analysis. Why split it from parsing? Simplifies design

dwight
Download Presentation

Week 2 – Lecture 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Week 2 – Lecture 1 Compiler Construction • Lexical Analysis • The language of Lexical Analysis • Regular Expressions • DFAs and NFAs • Errors in Lexical Analysis Reading: section 2.6, Chapter 3

  2. Lexical Analysis • Why split it from parsing? • Simplifies design • Parsers with whitespace and comments are more awkward • Efficiency • Only use the most powerful technique that works • And nothing more • No parsing sledgehammers for lexical nuts • Portability • More modular code • More code re-use

  3. Source Code Characteristics • Code • Identifiers • Count, max, get_num • Language keywords • switch, if .. then.. else, printf, return, void • Mathematical operators • +, *, >> …. • <=, =, != … • Literals • “Hello World” • Comments • Whitespace

  4. Language of Lexical Analysis Tokens Patterns Lexemes

  5. Tokens are not enough… • Clearly, if we replaced every occurrence of a variable with a token then …. We would lose other valuable information • Other data items are attributes of the tokens • Stored in the symbol table

  6. Token delimiters • When does a token/lexeme end? e.g xtemp=ytemp

  7. Ambiguity in identifying tokens • A programming language definition will state how to resolve uncertain token assignment • <> Is it 1 or 2 tokens? • Disambiguating rules state what to do • Reserved keywords (e.g. if) take precedence over identifiers • ‘Principle of longest substring’

  8. Regular Expressions • To represent patterns of strings of characters • REs • Alphabet – set of legal symbols • Meta-characters – characters with special meanings •  is the empty string • 3 basic operations • Choice – choice1|choice2, • a|b matches either a or b • Concatenation – firstthing secondthing • (a|b)c matches the strings { ac, bc } • Repetition (Kleene closure)– repeatme* • a* matches { , a, aa, aaa, aaaa, ….} • Precedence: * is highest, | is lowest • Thus a|bc* is a|(b(c*))

  9. Regular Expressions (2) • We can add in regular definitions • digit = 0|1|2 …|9 • And then use them: • digit digit* • A sequence of 1 or more digits • One or more repetitions: • (a|b)(a|b)*  (a|b)+ • Any character in the alphabet . • .*b.* - strings containing at least one b • Ranges [a-z], [a-zA-Z], [0-9], (assume character set ordering) • Not: ~a or [^a]

  10. Limitations of REs • REs can describe many language constructs but not all • For example Alphabet = {a,b}, describe the set of strings consisting of a single a surrounded by an equal number of b’s S= {a, bab, bbabb, bbbabbb, …}

  11. Transition Diagrams •  Algorithm to match REs start digit 1 2 Double lines means an accepting state Matches all single digits Anything else goes to an ‘error state’ not usually shown

  12. Lookahead • <=, <>, < • When we read a token delimiter to establish a token we need to make sure that it is still available • It is the start of the next token! • This is lookahead • Decide what to do based on the character we ‘haven’t read’ • Sometimes implemented by reading from a buffer and then pushing the input back into the buffer • And then starting with recognizing the next token

  13. Classic Fortran example • DO 99 I=1,10 becomes DO99I=1,10 versus DO99I=1.10 • When can the lexical analyzer assign a token? • Push back into input buffer • or ‘backtracking’

  14. Transition Diagrams (2) Attach return values with accepting states < = = less_eq start 1 2 3 other other is context sensitive > * 4 5 Lookahead * = [other] = less_than = not_eq

  15. Transition Diagrams (3) 3, +2, -45, +379, 1001… digit + digit start 1 2 3 - digit (+|-)? digit digit* = + digit digit* | - digit digit* | digit digit*

  16. DFAs = LE < start > 1 NE < This is not a DFA because we have 3 different possible moves from state 1 < LT

  17. NFAs  - transitions   - transitions can ‘glue together’ automata enabling us to build large automata easily from lots of small ones

  18. RE -> NFA (Thompson’s Construction) a b a  b ab a   a|b b      a a* 

  19. Overall Picture Regular Expression Subset construction Algorithm Algorithm 3.2, pg 118 NFA DFA Thompson’s construction Algorithm Algorithm 3.3, pg 122 Program Or write an ad-hoc Lexical Analyzer Fig 3.22, pg 116 RE -> DFA, pg 135 Flex

  20. Lexical Errors • Only a small %age of errors can be recognised during Lexical Analysis Consider fi (good == bad) … if = good;

  21. Examples from the oberon language (QUT) • Line ends inside literal string • Illegal character in input file • Input file ends inside a comment • Invalid exponent in REAL constant • Number too long • Illegal use of underscore in identifier

  22. In general • What does a lexical error mean? • Strategies: • “Panic-mode” • Delete chars from input until something matches • Inserting characters • Re-ordering characters • Replacing characters • For an error like “illegal character” then we should report it sensibly

More Related