Programming Languages 2nd edition Tucker and Noonan

Programming Languages2nd editionTucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic sugar causes cancer of the semicolon. A. Perlis

Contents 3.1 Chomsky Hierarchy 3.2 Lexical Analysis 3.3 Syntactic Analysis

Lexical Analysis • 3.1 Chomsky Hierarchy of Languages • 3.2 Purpose of Lexical Analysis • Regular Expressions • regular expressions for Clite lexicon • Finite State Automata (FSA) • FSA as a basis for a lexical analyzer • Lexical Analyzer (Lexer) Code

3.1 Chomsky Hierarchy • Each grammar class corresponds to a language class • Regular grammarslexical grammars • Context-free grammars programming language syntax • Context-sensitive grammars able to express some type rules • Unrestricted grammars – most powerfulcan express all features of languages such as C/C++

Chomsky Hierarchy • Context sensitive and unrestricted grammars are not appropriate for developing translators • Given a terminal string ω and a context-sensitive language G it is undecidable whether ω is in the language defined by G, and it is undecidable whether L(G) has any valid strings. • A problem is decidable if you can write an algorithm that is guaranteed to solve the problem in a finite number of steps.

Regular Grammars (for Lexical Analysis) • In terms of expressive power, equivalent to: • Regular expressions • Finite-state automata

Context-Free Grammars • Capable of expressing concrete syntax of programming languages • Equivalent to • a pushdown automaton • Other grammar levels – beyond the scope of this course; see CS 403 or 603 – also correspond to theoretical machines

3.2 Lexical Analysis • Input: a sequence of characters (the program) • Discard: whitespace, comments • Output: tokens • Define: A token is a logically cohesive sequence of characters representing a single symbol; e.g. • Identifiers: numberVal • Literals: 123, 5.67, 'x', true • Keywords: bool | char ... • Operators: + - * / ... • Punctuation: ; , ( ) { }

Character Sequences to Be Recognized by Clite Lexer (tokens + other) • Identifiers • Whitespace: space or tab • Literals • Comments: // to end-of- line • Keywords • End-of-line • Operators • End-of-file • Punctuation

Ways to Describe Lexical Elements • Natural language descriptions • Regular grammars • Regular expressions • Context free grammars

Regular Expressions • Regular expressions (regexp) are patterns that describe a particular class of strings • Used for pattern matching • One regexp can describe or match many strings • Used in many text-processing applications • Python, Perl, Tcl, UNIX utilities such as grep all use regular expressions

Using Regular Expressions • An alternative to regular grammars for expressing lexical syntax • Lexical-analyzer generator programs (e.g. Lex) take regular expressions as input and produce C/C++ programs that tokenize text.

With Regular Expressions You Canhttp://msdn2.microsoft.com/en-us/library/101eysae(VS.80).aspx • Test for a pattern within a string (data validation) • For example, you can test an input string to see if a telephone number pattern or a credit card number pattern occurs within the string. • Replace text. • Use a regular expression to identify specific text in a document and either remove it completely or replace it with other text. • Extract a substring from a string based upon a pattern match. • Find specific text within a document or input field.

Regular Expression Notation – page 62 • RegExprMeaning • x a character x • \x an escape character, e.g., \n or \t • { name } a reference to a name • M| N M or N • M N M followed by N • M* zero or more occurrences of M • Red characters = metacharacters

RegExpr Meaning • M+ One or more occurrences of M • M? Zero or one occurrence of M • [aeiou] the set of vowels/choose one • [0-9] the set of digits/choose one (‘-’ is a metachar.) • . Any single character (1-char wildcard) • \d same as [0-9] • \w same as [a-zA-Z0-9_] • \s whitespace: [ \t\n] • Differences in some representations

Simple Example • gr[ae]y, (gray|grey) and gr(a|e)y are equivalent regexps. • Both match either "gray" or "grey".

Pattern To Match a Date In the Formyyyy-mm-dd, yyyy.mm.dd, or yyyy/mm/dd • (19|20)\d\d[- /.](0[1-9]|1[012])[- /.] (0[1-9]|[12][0-9]|3[01]) • (19|20)\d\d : matches “19” or “20” followed by two digits • [- /.] : matches ‘-’ or ‘ ‘ or ‘/’ or ‘.’ • (0[1-9]|1[012]) : the first option matches a digit between 01 and 09, the second matches 10, 11 or 12. • (0[1-9]|[12][0-9]|3[01]) : the 1st option matches digits 01-09, the 2nd 10-29, and the 3rd matches 30 or 31.

Clite Lexical Syntax: Ancillary Definitions • Category NameDefinition • anyChar [-~] // all printable ASCII chars; blank - tilde • letter [a-zA-Z] • digit [0-9] • whitespace [ \t] // blank or tab • eol \n • eof \004

Clite Lexical Syntax (regexp metacharacters in red) • CategoryDefinition • keyword bool |char |else | false • | float | if | int | main • | true | while • identifier {letter}({letter} | {digit})* • integerLit {digit}+ • floatLit {digit}+\.{digit}+ • charLit ‘{anyChar}’ • operator: = |||| && | == |!= | < | <= |> |>= | + | - | * | / |!|[|] • separator: ; | . | {| } | (| ) • comment: //({anyChar} | {whitespace})*{eol}

Lexical Analyzer Generators • Input: regular expressions • Output: a lexical analyzer • C/C++: Lex, Flex • Java: JLex • Regular grammars or regular expressions are converted to a deterministic finite state automaton (DFSA) and then to a lexical analyzer.

Elements of a Finite State Automata • Set of states: represented by graph nodes • Input alphabet + unique end-of-input symbol • State transition function represented as labelled, directed edges (arcs) connecting graph nodes • A unique start state • One or more final states

Deterministic FSA • Definition: A finite state automaton is deterministic if for each state and each input symbol, there is at most one outgoing arc from the state labelled with the input symbol.

A Finite State Automaton for Identifiers • Figure 3.2 (p. 64)

Use a DFSA to recognize (accept) or reject a string • Process the string, one character at a time, by making a series of moves: • Follow the exit arc that corresponds to the leftmost input symbol, thereby consuming it. • If no such arc, then either the input is accepted (if you are in the final state) or there is an error. • An input is accepted if, beginning from the start state, the automaton consumes all the input and halts in a final state.

Example • (S, a2i$) ├ (I, 2i$) • ├ (I, i$) • ├ (I, $) • ├ (F, ) • Thus: (S, a2i$) ├* (F, ) • Follow the exit arc that corresponds to the leftmost input symbol, thereby consuming it. • If no such arc, then either the input is accepted (if you are in the final state) or there is an error.

Practical Issues • Explicit terminator (end-of-input symbol) is used only at end of program, not each token. • The symbols l and d represent an arbitrary letter and digit, respectively. • An unlabelled arc represents any valid input symbol (other than those on labelled arcs leaving the same state).

Practical Issues • When a token is recognized, move to a final state (one with no exit arc) • Recognize a non-token, move back to start • Recognize EOF means end of source code. • Automaton must be deterministic. • Recognize key words as identifiers; then do a table look-up.

How It’s Used • The lexer is called from the parser. • Parser: • Get next token • Parse next token • Lexer enters Start state each time the parser calls for a new token • Lexer enters “Final” state when a legal token has been recognized. The character that causes the transition to the final state may be white space; may be the first character of the next token.

Figure 3.3 (p. 66) – DFSA token recognizer

Lexer Code • Parser calls lexer when it needs a new token. • Lexer must remember where it left off. • Sometimes the lexer gets one character ahead in the input; compare ab=13; to ab = 13 ; • In the first case, the identifier ab isn’t recognized until the next token, =, is read. • In the second case, blanks signify ends of tokens

Lexer Code • Solutions: • peek function • pushback function • no symbol consumed by moving out of start state; i.e., always have the next character available. • when the parser calls the lexer, the lexer already has the first character of the next token, probably in a variable ch

3.2.3 - From Design to Code • private char ch = ‘ ’ ; • public Token next( ) { • do { • switch (ch) { • ... • } • } while (true); • } • Figure 3.4: Outline of Next Token Routine

Remarks • Exit do-while loop only when a token is found • Loop exited via a return statement which returns control to the parser • Variable ch must be initialized to a space character; thereafter it always holds the next character to be processed.

Translation Rules • Pages 67,68 give rules for translating the DFSA into code. • A Java Tokenizer Method for Clite is shown on page 69 (Figure 3.5) • Auxiliary functions described on page 68 and 70.

private boolean isLetter(char c) { • return ch >= ‘a’ && ch <= ‘z’ || • ch >= ‘A’ && ch <= ‘Z’; • }

private String concat(String set) { • StringBuffer r=new StringBuffer(“”); • do { • r.append(ch); • ch = nextChar( ); • } while (set.indexOf(ch) >= 0); • return r.toString( ); • }

// bold indicates auxiliary methods • public Token next( ) { • do {if(isLetter(ch) {//ident or keyword • String spelling=concat(letters+digits); • return Token.keyword(spelling); • }else if(isDigit(ch)){//numeric literal • String number = concat(digits); • if (ch != ‘.’) // int literal • return Token.mkIntLiteral(number); • number += concat(digits); • return Token.mkFloatLiteral(number); • }

else switch (ch) { • case ‘ ‘: case ‘\t’: case ‘\r’: case eolnCh: • ch = nextCh( ); break; • //omitted ‘/’, comments, ‘\’ • case eofCh: return Token.eofTok; • case ‘+’: ch = nextChar( ); • return Token.plusTok; • … • case ‘&’: check(‘&’); return Token.andTok; • case ‘=‘: return chkOpt(‘=‘,Token.assignTok, • Token.eqeqTok);

// a first program // with 3 comments int main ( ) { char c; int i; c = 'h'; i = c + 3; } // main Token TypeToken Keyword int Keyword main Punctuation ( Punctuation ) Punctuation { Keyword char Identifier c Punctuation ; etc. Source Tokens

Contents 3.1 Chomsky Hierarchy 3.2 Lexical Analysis 3.3 Syntactic Analysis

Syntactic Analysis (The Parser) • Purpose: to recognize source code structure • Input: tokens • Output: parse tree or abstract syntax tree

Parsing Algorithms – two types • Top-down: (recursive descent, LL) • LL = Left-to-right scan of input, Leftmost derivation • Based directly on BNF grammar for the language • Builds the parse tree in preorder • begin with the start symbol as the root of the tree • expand downward using BNF specific rules; intermediate tree nodes correspond to language non-terminals • leaves of the parse tree will be terminal symbols (tokens) • The representation of the parse tree may be converted to abstract syntax as parsing proceeds.

Parsing Algorithms – two types • Bottom-up: (LR) • LL = Left-to-right scan of input, Rightmost derivation • start with the leaves (tokens) • group them together to form interior tree nodes to match rules in the grammar • End up at the root of the parse tree. • Equivalent to right-most derivations

Top down Exp: Exp+term Bottom up x * y = term Partial Example: to parse x*y + z exp

Recursive Descent Parsing • A recursive descent parser “builds” the parse tree in a top-down manner • Defines a method/function for each non-terminal to recognize input derivable from that nonterminal • Each method should • Recognize the longest sequence of tokens (in the input stream) derivable from that non-terminal • Return an object which is the root of a subtree.

Token Implementation • Tokens have two parts: • a type (e.g., Identifier, Literal) • a value (e.g., xyz, 3.45)

Auxiliary Functions for the Parser • match( )compares the current token to the expected token t • If they match, get next token and return. Return value = token.value • Else display a syntax error message. • error( ) displays the error message and exits.

private String match (TokenType t) { • String value = token.value(); • if (token.type().equals t) • token = lexer.next(); • // token is a global variable • else • error(t); // function to report an error • return value; • }

Grammar for Parsing Example • (remove recursion for recursive descent parsing) • Assignment→ Identifier = Expression • Expression → Term{AddOpTerm} • AddOp → + | - • Term → Factor{MulOpFactor} • MulOp → * | / • Factor → [UnaryOp]Primary • UnaryOp → - | ! • Primary → Identifier | Literal | ( Expression )

Programming Languages 2nd edition Tucker and Noonan