Lexical Analysis

Lexical Analysis Cheng-Chia Chen

Outline • Introduction to lexical analyzer • Tokens • Regular expressions (RE) • Finite automata (FA) • deterministic and nondeterministic finite automata (DFA and NFA) • from RE to NFA • from NFA to DFA • from DFA to optimized DFA

Outline • Informal sketch of lexical analysis • Identifies tokens in input string • Issues in lexical analysis • Lookahead • Ambiguities • Specifying lexers • Regular expressions • Examples of regular expressions

Source Tokens Interm. Language Parsing The Structure of a Compiler Lexical analysis Code Gen. Machine Code Optimization

Lexical Analysis • What do we want to do? Example: if (i == j) Z = 0; else Z = 1; • The input is just a sequence of characters: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Goal: Partition input string into substrings • And determine the categories (tokens) to which the substrings belong

What’s a Token? • Output of lexical analysis is a stream of tokens • A token is a syntactic category • In English: noun, verb, adjective, … • In a programming language: Identifier, Integer, Keyword, Whitespace, … • Parser relies on the token distinctions: identifiers are treated differently than keywords

Aspects of Token • language view: a set of strings (belonging to that token) • [while], [identifier], [arithOp] • Pattern (grammar): a rule defining a token • [while]: while • [identifier]: letter followed by letters and digits • [arithOp]: + or - or * or / • Lexeme (member): a string matched by the pattern of a token • while, var23, count, +, *

Attributes of Tokens • Attributes are used to distinguish different lexemes in a token • [while,] • [identifier, var35] • [arithOp, +] • [integer, 26] • [string, “26”] • positional information: start/end line/position • Tokens affect syntax analysis and attributes affect semantic analysis and error handling

Lexical Analyzer: Implementation • An implementation must do two things: • Recognize substrings corresponding to tokens • Return the attributes( lexeme and positional information) of the token • The lexeme is either the substring or some data object constructed from the substring.

Example • input lines: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Token-lexeme pairs returned by the lexer: • [Whitespace, “\t”] • [if, ] • [OpenPar, “(“] • [Identifier, “i”] • [Relation, “==“] • [Identifier, “j”] • …

Lexical Analyzer: Implementation • The lexer usually discards “uninteresting” tokens that don’t contribute to parsing. • Examples: Whitespace, Comments • Question: What happens if we remove all whitespace and all comments prior to lexing?

Lexical Analysis in FORTRAN • FORTRAN rule: Whitespace is insignificant • E.g., VAR1is the same as VA R1 • Footnote: FORTRAN whitespace rule motivated by inaccuracy of punch card operators

A terrible design! Example • Consider • DO 5 I = 1,25 • DO 5 I = 1.25 • The first is DO 5 I = 1 , 25 • The second is DO5I = 1.25 • Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after “,” is reached

Lexical Analysis in FORTRAN. Lookahead. • Two important points: • The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time • “Lookahead” may be required to decide where one token ends and the next token begins • Even our simple example has lookahead issues ivs. if =vs. ==

Lexical Analysis in PL/I • PL/I keywords are not reserved IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN • PL/I Declarations: DECLARE (ARG1,. . ., ARGN) • Can’t tell whether DECLARE is a keyword or array reference until after the ). • Requires arbitrary lookahead!

Review • The goal of lexical analysis is to • Partition the input string into lexemes • Identify the token of each lexeme • Left-to-right scan => lookahead sometimes required

Next • We still need • A way to describe the lexemes of each token • A way to resolve ambiguities • Is if two variables i and f? • Is ==two equal signs = =?

Regular Expressions and Regular Languages • There are several formalisms for specifying tokens • Regular languages are the most popular • Simple and useful theory • Easy to understand • Efficient implementations

Languages Def. Let S be a set of characters. A language over S is a set of strings of characters drawn from S (S is called the alphabet )

Alphabet = English characters Language = English words Not every string on English characters is an English word likes, school,… beee,yykk,… Alphabet = ASCII Language = C programs Note: ASCII character set is different from English character set Examples of Languages

Notation • Languages are sets of strings. • Need some notation for specifying which sets we want • For lexical analysis we care about regular languages, which can be described using regular expressions.

Regular Expressions and Regular Languages • Each regular expression is a notation for a regular language (a set of words) • If A is a regular expression then we write L(A) to refer to the language denoted by A

Atomic Regular Expressions • Single character: c L(c) = { c } (for any c 2) • Epsilon (empty string): e L(e) = {e}

Compound Regular Expressions • Union (or choice) L( A | B ) = { s | s  L(A) or s  L(B) } • Concatenation: AB(where A and B are reg. exp.) L(A  B) = { ab | a  L(A) and b L(B) } • Note: • AB (set concatenation) and ab (string concatenation) will be abbreviated to AB and ab, respectively. • Examples: • if | then | else = { if, then, else} • 0 | 1 | … | 9 = { 0, 1, …, 9 } • Another example: (0 | 1) (0 | 1) = { 00, 01, 10, 11 }

More Compound Regular Expressions • So far we do not have a notation for infinite languages • Iteration: A* L(A*) = { e } [ L(A) [ L(AA) [ L(AAA) [… • Examples: 0* = {e, 0, 00, 000, …} 1 0* = { strings starting with 1 and followed by 0’s }

Example: Keyword • Keyword: else or if or begin… else | if | begin | …

Example: Integers Integer: a non-empty string of digits ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )* • problem: reuse complicated expression • improvement: define intermediate reg. expr. digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 number = digit digit* Abbreviation: A+ = A A*

Regular Definitions • Names for regular expressions • d1 =r1 • d2 =r2 • ... • dn =rn where ri over alphabet È {d1, d2, ..., d i-1} • note: Recursion is not allowed.

Example • Identifier: strings of letters or digits, starting with a letter digit = 0 | 1 | ... | 9 letter = A | … | Z | a | … | z identifier = letter (letter | digit) * • Is (letter* | digit*) the same ?

Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, CRNL and tabs WS = (\ | \t | \n | \r\n )+

Example: Email Addresses • Consider chencc@cs.nccu.edu.tw  = letters [ { ., @ } name = letter+ address = name ‘@’ name (‘.’ name)*

Notational Shorthands • One or more instances • r+ = r r* • r* = r+ | e • Zero or one instance • r? = r | e • Character classes • [abc] = a | b | c • [a-z] = a | b | ... | z • [ac-f] = a | c | d | e | f • [^ac-f] = S – [ac-f]

Summary • Regular expressions describe many useful languages • Regular languages are a language specification • We still need an implementation • problem: Given a string s and a rexp R, is

Implementation of Lexical Analysis

Outline • Specifying lexical structure using regular expressions • Finite automata • Deterministic Finite Automata (DFAs) • Non-deterministic Finite Automata (NFAs) • Implementation of regular expressions RegExp => NFA => DFA =>optimized DFA => Tables

Regular Expressions in Lexical Specification • Last lecture: a specification for the predicate s  L(R) • But a yes/no answer is not enough ! • Instead: partition the input into lexemes • We will adapt regular expressions to this goal

Regular Expressions => Lexical Spec. (1) • Select a set of tokens • Number, Keyword, Identifier, ... • Write a rexp for the lexemes of each token • Number = digit+ • Keyword = if | else | … • Identifier = letter (letter | digit)* • OpenPar =‘(‘ • …

Regular Expressions => Lexical Spec. (2) • Construct R, matching all lexemes for all tokens R = Keyword | Identifier | Number | … = R1 | R2 | R3 + … Facts: If s  L(R) then s is a lexeme • Furthermore s  L(Ri) for some “i” • This “i” determines the token that is reported

Regular Expressions => Lexical Spec. (3) • Let the input be x1…xn (x1 ... xnare characters in the language alphabet) • For 1  i  n check x1…xi L(R) ? • It must be that x1…xi L(Rj) for some j • Remove x1…xi from input and go to (4)

How to Handle Spaces and Comments? • We could create a token Whitespace Whitespace = (‘’ + ‘\n’ + ‘\r’ + ‘\t’)+ • We could also add comments in there • An input “ \t\n 5555 “ is transformed into Whitespace Integer Whitespace • Lexer skips spaces (preferred) • Modify step 5 from before as follows: It must be that xk ... xi L(Rj) for some j such that x1 ... xk-1 L(Whitespace) • Parser is not bothered with spaces

Ambiguities (1) • There are ambiguities in the algorithm • How much input is used? What if • x1…xi L(R) and also • x1…xK L(R) • Rule: Pick the longest possible substring • The maximal match

Ambiguities (2) • Which token is used? What if • x1…xi L(Rj) and also • x1…xi L(Rk) • Rule: use rule listed first (j if j < k) • Example: • R1 = Keyword and R2 = Identifier • “if” matches both. • Treats “if” as a keyword not an identifier

Error Handling • What if No rule matches a prefix of input ? • Problem: Can’t just get stuck … • Solution: • Write a rule matching all “bad” strings • Put it last • Lexer tools allow the writing of: R = R1 | ... | Rn | Error • Token Error matches if nothing else matches

Summary • Regular expressions provide a concise notation for string patterns • Use in lexical analysis requires small extensions • To resolve ambiguities • To handle errors • Good algorithms known (next) • Require only single pass over the input • Few operations per character (table lookup)

Finite Automata • Regular expressions = specification • Finite automata = implementation • A finite automaton consists of • An input alphabet  • A set of states S • A start state n • A set of accepting states F  S • A set of transitions state input state

Finite Automata • Transition s1as2 • Is read In state s1 on input “a” go to state s2 • If end of input (or no transition possible) • If in accepting state => accept • Otherwise => reject

a Finite Automata State Graphs • A state • The start state • An accepting state • A transition

1 A Simple Example • A finite automaton that accepts only “1”

1 0 Another Simple Example • A finite automaton accepting any number of 1’s followed by a single 0 • Alphabet: {0,1}

1 0 0 0 1 1 And Another Example • Alphabet {0,1} • What language does this recognize?

Lexical Analysis