220 likes | 402 Views
Week 2 – Lecture 1. Compiler Construction. Lexical Analysis The language of Lexical Analysis Regular Expressions DFAs and NFAs Errors in Lexical Analysis Reading: section 2.6, Chapter 3. Lexical Analysis. Why split it from parsing? Simplifies design
E N D
Week 2 – Lecture 1 Compiler Construction • Lexical Analysis • The language of Lexical Analysis • Regular Expressions • DFAs and NFAs • Errors in Lexical Analysis Reading: section 2.6, Chapter 3
Lexical Analysis • Why split it from parsing? • Simplifies design • Parsers with whitespace and comments are more awkward • Efficiency • Only use the most powerful technique that works • And nothing more • No parsing sledgehammers for lexical nuts • Portability • More modular code • More code re-use
Source Code Characteristics • Code • Identifiers • Count, max, get_num • Language keywords • switch, if .. then.. else, printf, return, void • Mathematical operators • +, *, >> …. • <=, =, != … • Literals • “Hello World” • Comments • Whitespace
Language of Lexical Analysis Tokens Patterns Lexemes
Tokens are not enough… • Clearly, if we replaced every occurrence of a variable with a token then …. We would lose other valuable information • Other data items are attributes of the tokens • Stored in the symbol table
Token delimiters • When does a token/lexeme end? e.g xtemp=ytemp
Ambiguity in identifying tokens • A programming language definition will state how to resolve uncertain token assignment • <> Is it 1 or 2 tokens? • Disambiguating rules state what to do • Reserved keywords (e.g. if) take precedence over identifiers • ‘Principle of longest substring’
Regular Expressions • To represent patterns of strings of characters • REs • Alphabet – set of legal symbols • Meta-characters – characters with special meanings • is the empty string • 3 basic operations • Choice – choice1|choice2, • a|b matches either a or b • Concatenation – firstthing secondthing • (a|b)c matches the strings { ac, bc } • Repetition (Kleene closure)– repeatme* • a* matches { , a, aa, aaa, aaaa, ….} • Precedence: * is highest, | is lowest • Thus a|bc* is a|(b(c*))
Regular Expressions (2) • We can add in regular definitions • digit = 0|1|2 …|9 • And then use them: • digit digit* • A sequence of 1 or more digits • One or more repetitions: • (a|b)(a|b)* (a|b)+ • Any character in the alphabet . • .*b.* - strings containing at least one b • Ranges [a-z], [a-zA-Z], [0-9], (assume character set ordering) • Not: ~a or [^a]
Limitations of REs • REs can describe many language constructs but not all • For example Alphabet = {a,b}, describe the set of strings consisting of a single a surrounded by an equal number of b’s S= {a, bab, bbabb, bbbabbb, …}
Transition Diagrams • Algorithm to match REs start digit 1 2 Double lines means an accepting state Matches all single digits Anything else goes to an ‘error state’ not usually shown
Lookahead • <=, <>, < • When we read a token delimiter to establish a token we need to make sure that it is still available • It is the start of the next token! • This is lookahead • Decide what to do based on the character we ‘haven’t read’ • Sometimes implemented by reading from a buffer and then pushing the input back into the buffer • And then starting with recognizing the next token
Classic Fortran example • DO 99 I=1,10 becomes DO99I=1,10 versus DO99I=1.10 • When can the lexical analyzer assign a token? • Push back into input buffer • or ‘backtracking’
Transition Diagrams (2) Attach return values with accepting states < = = less_eq start 1 2 3 other other is context sensitive > * 4 5 Lookahead * = [other] = less_than = not_eq
Transition Diagrams (3) 3, +2, -45, +379, 1001… digit + digit start 1 2 3 - digit (+|-)? digit digit* = + digit digit* | - digit digit* | digit digit*
DFAs = LE < start > 1 NE < This is not a DFA because we have 3 different possible moves from state 1 < LT
NFAs - transitions - transitions can ‘glue together’ automata enabling us to build large automata easily from lots of small ones
RE -> NFA (Thompson’s Construction) a b a b ab a a|b b a a*
Overall Picture Regular Expression Subset construction Algorithm Algorithm 3.2, pg 118 NFA DFA Thompson’s construction Algorithm Algorithm 3.3, pg 122 Program Or write an ad-hoc Lexical Analyzer Fig 3.22, pg 116 RE -> DFA, pg 135 Flex
Lexical Errors • Only a small %age of errors can be recognised during Lexical Analysis Consider fi (good == bad) … if = good;
Examples from the oberon language (QUT) • Line ends inside literal string • Illegal character in input file • Input file ends inside a comment • Invalid exponent in REAL constant • Number too long • Illegal use of underscore in identifier
In general • What does a lexical error mean? • Strategies: • “Panic-mode” • Delete chars from input until something matches • Inserting characters • Re-ordering characters • Replacing characters • For an error like “illegal character” then we should report it sensibly