Chapter 4

Chapter 4 Lexical and Syntax Analysis

Chapter 4Outline • Lexical Analysis • The Parsing Problem • Introduction to Parsing • Top-Down Parsers • Bottom-Up Parsers • The Complexity of Parsing • Recursive-Descent Parsing • The Recursive-Descent Parsing Process • The LL Grammar Class • Bottom-Up Parsing • The Parsing Problem for Bottom-Up Parsers • Shift-Reduce Algorithms • LR Parsers

Introduction • Language implementation systems analyze source code • Syntax analysis is based on a formal description of the syntax of the source language (BNF) • The syntax analysis portion of a language processor consists of two parts: • 1. A low-level part called a lexical analyzer • 2. A high-level part called a syntax analyzer, • or parser

Introduction Reasons to use BNF to describe syntax: 1. Provides a clear and concise syntax description 2. The parser can be based directly on the BNF 3. Parsers based on BNF are easy to maintain Reasons to separate lexical and syntax analysis: 1. Simplicity 2. Efficiency 3. Portability

Lexical Analysis • A lexical analyzer is a pattern matcher for character strings • A lexical analyzer is a “front-end” for the parser • Identifies substrings of the source program that belong together • lexemes • match a character pattern • associated with a lexical category called a token • sum is a lexeme; its token may be IDENT

Lexical Analysis • A lexical analyzer is a function that returns the next token in an input string (the program) • Three approaches to building a lexical analyzer: • Use a software-generated table-driven lexical analyzer • Write a program implementing a state diagram • Hand-construct a table-driven implementation of a state diagram

State Diagram Design A naïve state diagram would have a transition from every state on every character in the source language Such a diagram would be very large! Transitions can be combined to simplify the state diagram

Example State Diagram Design • When recognizing an identifier, • if all uppercase and lowercase letters are equivalent, • use a character class that includes all letters • When recognizing an integer literal, • all digits are equivalent • Reserved words and identifiers can be recognized together: • a table lookup can determine identifier vs. reserved word

ExampleConvenient Utility Subprograms • getChar • gets the next character of input • puts it in nextChar • determines its class and puts the class in charClass • addChar • moves the character from nextChar into the variable where the lexeme is being accumulated: lexeme • lookup • determines whether the string in lexeme is a reserved word (returns a code)

Example

Example Implementationassume initialization int lex() { switch (charClass) { case LETTER: addChar(); /* nextChar -> lexeme */ getChar(); /* input -> nextChar */ while (charClass == LETTER || charClass == DIGIT) { addChar(); /* nextChar -> lexeme */ getChar(); /* input -> nextChar */ } return lookup(lexeme); break; case DIGIT: addChar(); /* nextChar -> lexeme */ getChar(); /* input -> nextChar */ while (charClass == DIGIT) { addChar(); /* nextChar -> lexeme */ getChar(); /* input -> nextChar */ } return INT_LIT; break; } /* End of switch */ } /* End of function lex */

The Parsing Problem • Find all syntax errors, • produce a diagnostic message, • and recover quickly! • End result: • a trace of the parse tree for the program • Two categories of parsers • Top down: • builds parse tree starting at the root • Bottom up: • builds parse tree starting at the leaves • Parsers look only one token ahead in the input

Top-Down Parsers Given a sentential form, xA , where x is a string of terminals, A is a non-terminal, and is a string of terminals and non-terminals, the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A The most common top-down parsing algorithms: 1. Recursive descent - a coded implementation 2. LL parsers - table driven implementation

Bottom-Up Parsers Given a right sentential form, , what substring of  is the right-hand side of the rule in the grammar that must be reduced (production in reverse) to produce the previous sentential form in the right derivation The most common bottom-up parsing algorithms are in the LR family: SLR LALR (parses more grammars than SLR) Canonical LR (parses more than LALR)

The Complexity of Parsing Parsers that work for any unambiguous grammar are complex and inefficient O(n3), where n is the length of the input Compilers use parsers that only work for a subset of all unambiguous grammars, but do it in linear time O(n), where n is the length of the input

Recursive-Descent Parsing Use a subprogram for each nonterminal in the grammar to parse sentences containing that nonterminal Since a grammar expressed in EBNF minimizes the number of nonterminals, EBNF is a good basis for recursive-descent parsing Assume a lexical analyzer named lex puts the token code of the next terminal of the input stream into nextToken By convention, every parsing routine ends calling lex to read the next input token and load nextToken

Recursive-Descent Parsing A grammar for simple expressions: <expr>  <term> {(+ | -) <term>} <term>  <factor> {(* | /) <factor>} <factor>  id | ( <expr> )

Recursive-Descent Parsing The coding process when there is only one RHS for a particular nonterminal: For each terminal symbol in the RHS, compare it to nextToken: match? continue no match? error For each nonterminal symbol in the RHS, call its associated parsing subprogram

Recursive-Descent Parsingfunction for expression /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Function expr Parses strings in the language generated by the rule: <expr> → <term> {(+ | -) <term>} * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ void expr() { /* Parse sequence of additive operations */ term(); /* Parse the first term */ while (nextToken==PLUS_CODE || nextToken==MINUS_CODE) { /* As long as the next token is + or - */ lex(); /* call lex to get the next token*/ term(); /* parse the next term */ }/* End of while */ }/* End of function expr */ This particular routine does not detect errors

Recursive-Descent Parsing The coding process when there is more than one RHS for a particular nonterminal: Call a process that determines whichRHS to parse The correct RHS is chosen by comparing nextToken (the lookahead) with the first token generated by each RHS until a match is found If no match is found, it is a syntax error Once a particular RHS is chosen, procede as in the single RHS case

Recursive-Descent Parsingfunction for factor /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Function factor Parses strings in the language generated by the rule: <factor> -> id | (<expr>) * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ void factor() { /* Determine which RHS */ if (nextToken)==ID_CODE) /* If RHS is id */ lex(); /* just call lex to load nextToken */ else if (nextToken==LEFT_PAREN_CODE){ /*If RHS==(<expr>)*/ lex(); /* call lex to pass over the left paren */ expr(); /* then call expr */ if (nextToken==RIGHT_PAREN_CODE) /* check right paren */ lex(); /* call lex to load nextToken */ else error(); }/* END OF else if (nextToken == LEFT_PAREN_CODE) */ else error(); /* since neither RHS matches */ }/* END OF FUNCTION factor() */

The LL Grammar ClassLeft-to-right scan, Leftmost derivation • The Left Recursion Problem • Two common characteristics of grammars • disallow top-down parsing: • 1. If a grammar has left recursion • (either direct or indirect) • A grammar can be modified to remove left recursion • 2. If a grammar lacks pairwise disjointness

Pairwise Disjointness The ability to choose a RHS using one token lookahead Definition: FIRST() = {a |  =>* a } (If  =>* ,  Є FIRST()) Pairwise Disjointness Test: For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A => iand A => j, it must be true that FIRST(i) ∩ FIRST(j) =  Examples: A => a | bB | cAb GOOD! A => a | aBBAD!

Pairwise Disjointness Left factoring can resolve the problem Replace <variable> => ident | ident [<expr>] with <variable> => ident <new> <new> =>  |[<expr>] or <variable> => ident [[<expr>]] (outer brackets are metasymbols in EBNF)

Bottom-Up Parsing The parsing problem: dealing only with right-sentential forms, find the correct RHS to reduce to get the previous rightmost derivation step reduce = replace with the LHS of a production rule

LR ParsersLeft-to-right scan, Rightmost derivation • Advantages of LR parsers • They work for most grammars that describe programming languages • They work on a larger class of grammars than other bottom-up algorithms, but are just as efficient as any other bottom-up parser • They can detect syntax errors as soon as possible • The LR class of grammars is a superset of the class parsable by LL parsers

Bottom-Up Parsers Given a right sentential form, , what substring of  is the right-hand side of the rule in the grammar that must be reduced (production in reverse) to produce the previous sentential form in the right derivation The most common bottom-up parsing algorithms are in the LR family: SLR LALR (parses more grammars than SLR) Canonical LR (parses more than LALR)

Bottom-Up Parsing The parsing problem: dealing only with right-sentential forms, find the correct RHS to reduce to get the previous rightmost derivation step reduce = replace with the LHS of a production rule

LR ParsersLeft-to-right scan, Rightmost derivation • Advantages of LR parsers • They work for most grammars that describe programming languages • They work on a larger class of grammars than other bottom-up algorithms, but are just as efficient as any other bottom-up parser • They can detect syntax errors as soon as possible • The LR class of grammars is a superset of the class parsable by LL parsers

Bottom-Up ParsingDefinitions β is a phrase of the right sentential form γ if and only if S =>*γ = α1Aα2=>+α1βα2 corresponding to the internal nodes of the parse tree β is a simple phrase of the right sentential form γ if and only if S =>*γ = α1Aα2=>α1βα2 corresponding to a RHS of the grammar β is the handle of the right sentential form γ = αβw if and only if S =>*rmαAw =>rmαβw

Shift-Reduce Algorithms Handles The handle of a right sentential form is its leftmost simple phrase Given a parse tree, it is now easy to find the handle Bottom-up parsing can be thought of as handle pruning

Figure 4.2 A parse tree for E + T * id

Shift-Reduce Algorithms 1. Shift moving the next token to the top of the parse stack 2. Reduce replacing the handle on the top of the parse stack with its corresponding LHS

Figure 4.3 The structure of an LR parser

LR Parsers • LR parsers must be constructed with a tool • Knuth’s insight: • A bottom-up parser can use the entire history of the parse, up to the current point, to make parsing decisions. • Only a finite and relatively small number of different parse situations can occur • so the history can be stored in a parser state, on the parse stack

Bottom-up Parsing An LR configuration stores the state of an LR parser in a pair of data structures, a stack and a queue: The stack contains pairs of either a lexeme or a token paired with a state. ((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)) The queue contains the input string of lexemes (terminals): (ai ai+1 … an $) Initial configuration: (S0) and (a1 … an $)

LR Configuration • LR parsers are table driven, where the table has two components, an ACTION table and a GOTO table • The ACTION table specifies the action of the parser, given the parser state and the next token • Rows = state names • Columns = terminals • The GOTO table specifies which state to put on top of the parse stack after a reduction action is done • Rows = state names • Columns = nonterminals

Bottom-up ParsingParser Actions Assume conditions (during the parse): stack is ((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)) where Xi is a single terminal or nonterminal and Si is a state name queue is (ai ai+1 … an $), where all elements are unparsed terminals. Then fetch top of stack (Sm) and front of queue (ai) in order to determine ACTION[Sm,ai]

Bottom-up ParsingParser Actions switch (ACTION[Sm,ai]) { caseShift(S): the next configuration is: stack: ((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)(ai,S)) queue: (ai+1 ai+2 … an $) caseReduce(A): the next configuration is: stack: ((-,S0)(X1,S1)(X2,S2)…(Xm-r,Sm-r)(A,S)) queue: (ai ai+1 … an $) (unchanged) where S = GOTO[Sm-r,A], and r = length() case Accept: parse complete; no errors found default: Error, call an error-handling routine }

Shift-ReduceParsing Example Traditional Grammar for Arithmetic Expressions 1. E => E + T 2. E => T 3. T => T * F 4. T => F 5. F => ( E ) 6. F => id Consider the expression: id + id * id $

Figure 4.5 The LR parsing table for an arithmetic expression grammar

Summary • Syntax analysis is a common part of language implementation • A lexical analyzer is a pattern matcher that isolates small-scale parts of a program • Detects syntax errors • Produces a parse tree • A recursive-descent parser is an LL parser • EBNF • Parsing problem for bottom-up parsers: find the substring of current sentential form • The LR family of shift-reduce parsers is the most common bottom-up parsing approach

Chapter 4

Chapter 4

Presentation Transcript

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4-4

Chapter 4

Chapter 4

Chapter 4 - 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4