Lexical and syntax analysis

CSci210.BA4 Lexical and syntax analysis

Chapter 4 Topics • Introduction • Lexical and Syntax Analysis • The Parsing Problem • Recursive-Descent Parsing • Bottom-Up Parsing

Introduction • Syntax analyzers almost always based on a formal description of the syntax of the source language (grammars) • Almost all compilers separate analyzing syntax into: • Lexical Analysis – low-level • Syntax Analysis – high-level

Reasons to Separate Syntax and Lexical Analysis • Simplicity – lexical analysis is less complex, so the process is simpler when separated • Efficiency – allows for selective optimization • Portability – lexical analyzer is somewhat platform dependent whereas the syntax analyzer is more platform independent

Lexical Analysis • A pattern matcher for character strings • Performs syntax analysis at the lowest level of the program structure • Extracts lexemes from a given input string and produce the corresponding tokens

Lexical Analysis (continued) result = oldsum – value / 100; TokenLexeme IDENT result ASSIGN_OP = IDENT oldsum SUB_OP - IDENT value DIV_OP / INT_LIT 100 SEMICOLON ;

Building a Lexical Analyzer • Write a formal description of the tokens and use a software tool that constructs lexical analyzers when given such a description • Design a state transition diagram that describes the tokens and write a program that implements the diagram • Design a state transition diagram that describes the tokens and hand-construct a table-driven implementation of the state diagram

State (Transition) Diagram Design • A directed graph with nodes labeled with state names and arcs labeled with input characters • Including states and transitions for each and every token pattern would be too large and complex • Transitions can be combined to simplify the state diagram

The Parsing Problem • Two goals of syntax analysis: • Check the input program for any syntax errors, produce a diagnostic message if an error is found, and recover • Produce the parse tree, or at least a trace of the parse tree, for the program • Two Classes of parsers: • Top-down • Bottom-up

Top-Down Parsers • Traces or builds a parse tree in preorder (leftmost derivation) • The most common top-down parsing algorithms: • Recursive descent • LL parsers

Bottom-Up Parsers • Produce the parse tree by beginning at the leaves and progressing towards the root • Most common bottom-up parsers are in the LR family

Complexity of Parsing • Parsing algorithms that work for any unambiguous grammar are complex and inefficient: O(n3) • Compilers use parsers that only work for a subset of all unambiguous grammars, but do it in linear time: O(n)

Recursive-Descent Parsing • Top-Down Parser • EBNF is ideal for the basis of a recursive-descent parser • Each terminal maps to a function • For a non-terminal with more than one RHS, look at the next token to determine which side to choose • No mapping = syntax error

Recursive-Descent Parsing • Grammar for an expression: <expr> → <term> {+ <term>} <term> → <factor> {* <factor>} <factor> → id | int_constant | ( <expr> ) • How do we parse? Expression: 1 + 2 <expr> → <term> + <term> → <factor> + <term> → 1 + <term>

Recursive-Descent Parsing • Grammar for an expression: <expr> → <term> {+ <term>} <term> → <factor> {* <factor>} <factor> → id | int_constant | ( <expr> ) • What does code look like? void expr() { term(); while (nextToken == ADD_OP) { lex(); term(); } }

Recursive-Descent Parsing • The LL (Left Recursion) Problem <expr> → <expr> + <term> <expr> → <expr> + <term> + <term> <expr> → <expr> + <term> + <term> + <term> • How do we fix it? • Modify grammar to remove left recursion Before: <expr> → <expr> + <term> After: <expr> → <term> + <term> <term> → id | int_constant | <expr>

Recursive-Descent Parsing • The PairwiseDisjointness Problem • If the grammar is not pairwise disjoint, how do you know which RHS to pick based on the next token?<variable> → identifier | identifier[<expr>] • How do we fix it? • Left Factoring <variable> → identifier<new> <new> → ø | [<expr>]

Bottom-Up Parsing • Parsing is based on reduction • Reverse of a rightmost derivation • At each step, find the correct RHS that reduces to the previous step in the derivation • Example Grammar <S> → <A>b Input: ab <A> → a Step 1: <A>b <A> → b Step 2: <S>

Bottom-Up Parsing • Most bottom-up parsers are shift-reduce algorithms • Shift – move token onto the stack • Reduce – replace RHS with LHS

Bottom-Up Parsing • Handles • Def:  is the handle of the right sentential form iff  = w if and only if S =>*rmAw =>rmw • The handle of a right sentential form is its leftmost simple phrase • Bottom-Up Parsing is essentially looking for handles and replacing them with their LHS

Bottom-Up Parsing • Advantages of Shift Reduction Parsers • They can be built for all programming languages • They can detect syntax errors as soon as it is possible in a left-to-right scan • They LR class of grammars is a proper superset of the class parsable by LL parsers (for example, many left recursive grammars are LR, but none are LL)

Bottom-Up Parsing • Shift Reduction Algorithms • Input Sequence – input to be parsed • Parse Stack – input is shifted onto the parse stack • ACTION Table – what the parser does • GOTO Table – holds state symbols to be pushed onto the stack when a reduction is completed

Bottom-Up Parsing • ACTION Table (or Parse Table) • Rows = State Symbols • Columns = Terminal symbols • Values • Shift – push token on stack • Reduce – replace handle with LHS • Accept – stack only has start symbol and input is empty • Error – original input is invalid

Bottom-Up Parsing • GOTO Table (or Parse Table) • Rows = State Symbols • Columns = Nonterminal Symbols • Values indicate which state symbol should be pushed onto the parse stack after a reduction has been completed

Lexical and syntax analysis