Introduction to Parsing

CIS 461Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan, Keith Cooper, and Linda TorczonLecture-Module #7 Introduction to Parsing

First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into tokens Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Scanner works as a subroutine of the parser token IR Source code Parser Scanner get next token Errors

Lexical Analysis • Specify tokens using Regular Expressions • Translate Regular Expressions to Finite Automata • Use Finite Automata to generate tables or code for the scanner source code tokens Scanner tables or code specifications (regular expressions) Scanner Generator

Automating Scanner Construction To build a scanner: • Write down the RE that specifies the tokens • Translate the RE to an NFA • Build the DFA that simulates the NFA • Minimize the DFA • Turn it into code or table Scanner generators • Lex , Flex, Jlex work along these lines • Algorithms are well-known and well-understood • Interface to parser is important

The Cycle of Constructions minimal DFA RE NFA DFA Automating Scanner Construction RENFA (Thompson’s construction) • Build an NFA for each term • Combine them with -moves NFA DFA(subset construction) • Build the simulation DFA Minimal DFA • Hopcroft’s algorithm DFA RE • All pairs, all paths problem • Union together paths from s0to a final state

Scanner Generators: JLex, Lex, FLex directly copied to the output file user code %% JLex directives %% regular expression rules macro (regular) definitions (e.g., digits = [0-9]+) and state names each rule: optinal state list, regular expression, action • States can be mixed with regular expressions • For each regular expression we can define a set of states where it is valid (JLex, Flex) • Typical format of regular expression rules: • <optional_state_list> regular_expression { actions }

JLex, FLex, Lex Automata for regular expression r_1 Java code for JLex, C code for FLex and Lex new final states Regular expression rules: r_1 { action_1 } r_2 { action_2 } . . . r_n { action_n } Ar_1 error new start sate  Ar_2  s0 error . Rules used by scanner generators 1) Continue scanning the input until reaching an error state 2) Accept the longest prefix that matches to a regular expression and execute the corresponding action 3) If two patterns match the longest prefix, then the action which is specified earlier will be executed 4) After a match, go back to the end of the accepted prefix in the input and start scanning for the next token  . . Ar_n error For faster scanning, convert this NFA to a DFA and minimize the states

Limits of Regular Languages Advantages of Regular Expressions • Simple & powerful notation for specifying patterns • Automatic construction of fast recognizers • Many kinds of syntax can be specified with REs If REs are so useful … Why not use them for everything? Example — an expression grammar Id  [a-zA-Z] ([a-zA-z] | [0-9])* Num  [0-9]+ Term  Id | Num Op “+” | “-” | “” | “/” Expr ( Term Op )* Term

Limits of Regular Languages If we add balanced parentheses to the expressions grammar, we cannot represent it using regular expressions: A DFA of size n cannot recognize balanced parentheses with nesting depth greater than n Not all languages are regular: RL’s  CFL’s  CSL’s Solution: Use a more powerful formalism, context-free grammars • Id  [a-zA-Z] ([a-zA-z] | [0-9])* • Num  [0-9]+ • Term  Id | Num • Op “+” | “-” | “” | “/” • Expr Term | Expr Op Expr | “(“ Expr “)”

The Front End: Parser Parser Input: a sequence of tokens representing the source program Output: A parse tree (in practice an abstract syntax tree) While generating the parse tree parser checks the stream of tokens for grammatical correctness Checks the context-free syntax Parser builds an IR representation of the code Generates an abstract syntax tree Guides checking at deeper levels than syntax token Source code IR Type Checker IR Scanner Parser get next token Errors

The Study of Parsing Need a mathematical model of syntax — a grammar G Context-free grammars Need an algorithm for testing membership in L(G) Parsing algorithms Parsing is the process of discovering a derivation for some sentence from the rules of the grammar Equivalently, it is the process of discovering a parse tree Natural language analogy Lexical rules correspond to rules that define the valid words Grammar rules correspond to rules that define valid sentences

An Example Grammar 1 Start  Expr 2 Expr  Expr Op Expr 3 | num 4 | id 5 Op  + 6 | - 7 | * 8 | / Start Symbol: S = Start Nonterminal Symbols: N = { Start, Expr, Op } Terminal symbols: T = { num, id, +, -, *, / } Productions: P = { 1, 2, 3, 4, 5, 6, 7, 8 } (shown above)

Specifying Syntax with a Grammar Context-free syntax is specified with a context-free grammar Formally, a grammar is a four tuple, G = (S,N,T,P) T is a set of terminal symbols These correspond to tokens returned by the scanner For the parser tokens are indivisible units of syntax N is a set of non-terminal symbols These are syntactic variables that can be substituted during a derivation Variables that denote sets of substrings occurring in the language S is the start symbol : S  N All the strings in L(G) are derived from the start symbol P is a set of productions or rewrite rules : P : N  (N  T)*

Production Rules Restriction on production rules determines the expressive power • Regular grammars: productions are either left-linear or right-linear • Right-linear: Productions are of the form A  wB, or A  w where A,B are nonterminals and w is a string of terminals • Left-linear: Productions are of the form A  Bw, or A  w where A,B are nonterminals and w is a string of terminals • Regular grammars recognize regular sets • One can automatically construct a regular grammar from an NFA that accepts the same language (and visa versa) • Context-free grammars: Productions are of the form A   where A is a nonterminal symbol and  is a string of terminal and nonterminal symbols • Context-sensitive grammars: Productions are of the form   where  and  are arbitrary strings of terminal and nonterminal symbols with   and ||  || • Unrestricted grammars: Productions are of the form   where  and  are arbitrary strings of terminal and nonterminal symbols with   • Unrestricted grammars are as powerful as Turing machines

An NFA can be translated to a Regular Grammar • For each state i of the NFA create a nonterminal symbol Ai • If state i has a transition to state j on symbol a, introduce the production Ai a Aj • If state i goes to state j on symbol , introduce the production Ai Aj • If i is an accepting state, introduce Ai • If i is the start state make Ai be the start symbol of the grammar • 1 A0 A1 • 2 A1 a A1 • 3 | b A1 • 4 | a A2 • 5 A2  b A3 • 6 A3  b A4 • 7 A4   a  a b b S0 S1 S2 S3 S4 b

Derivations Such a sequence of rewrites is called a derivation Process of discovering a derivation is called parsing An example derivation for x - 2* y An example grammar Rule Sentential Form — S 1 Expr 2Expr Op Expr 4<id,x> Op Expr 6<id,x> - Expr 2<id,x> - Expr Op Expr 3<id,x> - <num,2> Op Expr 7<id,x> - <num,2> * Expr 4<id,x> - <num,2> * <id,y> 1S Expr 2Expr Expr Op Expr 3 | num 4 | id 5Op  + 6 | - 7 | * 8 | / We denote this as: S* id - num * id AB meansAderives Bafter applying one production A*BmeansAderives Bafter applying zero or more productions

Sentences and Sentential Forms Given a grammar G with a start symbol S • A string of terminal symbols than can be derived from S by applying the productions is called a sentence of the grammar • These strings are the members of set L(G), the languagedefined by the grammar • A string of terminal and nonterminal symbols that can be derived from S by applying the productions of the grammar is called a sentential form of the grammar • Each step of derivation forms a sentential form • Sentences are sentential forms with no nonterminal symbols

Derivations At each step, we make two choices Choose a non-terminal to replace Choose a production to apply Different choices lead to different derivations Two types of derivation are of interest Leftmost derivation — replace leftmost non-terminal at each step Rightmost derivation — replace rightmost non-terminal at each step These are the two systematic derivations (the first choice is fixed) The example on the earlier slide was a leftmost derivation Of course, there is a rightmost derivation (next slide)

Two Derivations for x - 2 * y In both cases, S * id - num * id Note that, these two derivations produce different parse trees The parse trees imply different evaluation orders! Rule Sentential Form — S 1 Expr 2Expr Op Expr 4<id,x> Op Expr 6<id,x> - Expr 2<id,x> - Expr Op Expr 3<id,x> - <num,2> Op Expr 7<id,x> - <num,2> * Expr 4<id,x> - <num,2> * <id,y> Rule Sentential Form — S 1 Expr 2Expr Op Expr 4Expr Op <id,y> 7Expr * <id,y> 2Expr Op Expr * <id,y> 3Expr Op <num,2> * <id,y> 6Expr - <num,2> * <id,y> 4<id,x> - <num,2> * <id,y> Leftmost derivation Rightmost derivation

Derivations and Parse Trees Leftmost derivation S Expr <id,y> Rule Sentential Form — S 1 Expr 2Expr Op Expr 4<id,x> Op Expr 6<id,x> - Expr 2<id,x> - Expr Op Expr 3<id,x> - <num,2> Op Expr 7<id,x> - <num,2> * Expr 4<id,x> - <num,2> * <id,y> Expr Expr Expr Op Op <id,x> Expr - This evaluates as x - ( 2 * y ) <num,2> *

Derivations and Parse Trees Rightmost derivation S E E Op E E Op E Rule Sentential Form — S 1 Expr 2Expr Op Expr 4Expr Op <id,y> 7Expr * <id,y> 2Expr Op Expr * <id,y> 3Expr Op <num,2> * <id,y> 6Expr - <num,2> * <id,y> 4<id,x> - <num,2> * <id,y> <id,y> * This evaluates as ( x - 2 ) * y <id,x> <num,2> -

S Expr <id,y> Another Rightmost Derivation Another Rightmost derivation Rule Sentential Form — S 1 Expr 2Expr Op Expr 2Expr Op Expr Op Expr 4Expr Op Expr Op <id,y> 7Expr Op Expr * <id,y> 3Expr Op <num,2> * <id,y> 6 Expr - <num,2> * Expr 4<id,x> - <num,2> * <id,y> Expr Expr Expr Op Op <id,x> Expr - This evaluates as x - ( 2 * y ) <num,2> * This parse tree is different than the parse tree for the previous rightmost derivation, but it is the same as the parse tree for the earlier leftmost derivation

Derivation and Parse Trees • A parse tree does not show in which order the productions were applied, it ignores the variations in the order • Each parse tree has a corresponding unique leftmost derivation • Each parse tree has a corresponding unique rightmost derivation

Parse Trees and Precedence These two parse trees point out a problem with the expression grammar: It has no notion of precedence (implied order of evaluation between different operators) To add precedence Create a non-terminal for each level of precedence Isolate the corresponding part of the grammar Force parser to recognize high precedence subexpressions first For algebraic expressions Multiplication and division, first Subtraction and addition, next

S E E Op E E <num,2> Another Problem: Parse Trees and Associativity S E E E E Op Op E <num,5> Op - E <num,2> - - - <num,5> <num,2> <num,2> Result is 1 Result is 5

Precedence and Associativity Adding the standard algebraic precedence and using left recursion produces: • This grammar is slightly larger • Takes more rewriting to reach • some of the terminal symbols • Encodes expected precedence • Enforces left-associativity • Produces same parse tree • under leftmost & rightmost • derivations • Let’s see how it parses our example 1S Expr 2Expr Expr + Term 3 | Expr - Term 4 | Term 5Term  Term * Factor 6 | Term / Factor 7| Factor 8 Factor num 9 | id

S E E - T T T * F F F <id,y> <id,x> <num,2> Precedence Rule Sentential Form S 1 Expr 3 Expr - Term 7 Term - Term 8 Factor - Term 3 <id,x> - Term 7 <id,x> - Term * Factor 8 <id,x> - Factor * Factor 4 <id,x> - <num,2> * Factor 7 <id,x> - <num,2> * <id,y> The leftmost derivation Its parse tree This produces x - ( 2 * y ) , along with an appropriate parse tree. Both the leftmost and rightmost derivations give the same parse tree and the same evaluation order, because the grammar directly encodes the desired precedence.

Associativity S Rule Sentential Form S 1 Expr 3 Expr - Term 7 Expr - Factor 8 Expr - <num,2> 3 Expr - Term - <num,2> 7 Expr - Factor - <num,2> 8 Expr - <num,2> - <num,2> 4 Term - <num,2> - <num,2> 7 Factor - <num,2> - <num,2> 8 <num,5> - <num,2> - <num,2> E E - T E - F T T F <num,2> F <num,2> <num,5> The rightmost derivation Its parse tree This produces ( 5 - 2 ) - 2 , along with an appropriate parse tree. Both the leftmost and rightmost derivations give the same parse tree and the same evaluation order

Ambiguous Grammars What was the problem with the original grammar? Rule Sentential Form — S 1 Expr 2Expr Op Expr 4Expr Op <id,y> 7Expr * <id,y> 2Expr Op Expr * <id,y> 3Expr Op <num,2> * <id,y> 6Expr - <num,2> * <id,y> 4<id,x> - <num,2> * <id,y> Rule Sentential Form — S 1 Expr 2Expr Op Expr 2Expr Op Expr Op Expr 4Expr Op Expr Op <id,y> 7Expr Op Expr * <id,y> 3Expr Op <num,2> * <id,y> 6Expr - <num,2> * Expr 4<id,x> - <num,2> * <id,y> 1S Expr 2Expr Expr Op Expr 3 | num 4 | id 5Op  + 6 | - 7 | * 8 | / • This grammar allows multiple rightmost derivations for x - 2 * y • Equivalently, this grammar generates multiple parse trees for x - 2 * y • The grammar is ambiguous different choices

Ambiguous Grammars If a grammar has more than one leftmost derivation for some sentence (or sentential form), then the grammar is ambiguous If a grammar has more than one rightmost derivation for some sentence (or sentential form), then the grammar is ambiguous If a grammar produces more than one parse tree for some sentence (or sentential form), then it is ambiguous Classic example — the dangling-else problem • 1 Stmt  if Expr then Stmt • 2 | if Expr then Stmt else Stmt • | … other stmts …

Ambiguity The following sentential form has two parse trees: if Expr1then if Expr2 then Stmt1 else Stmt2 Stmt Stmt if Stmt2 Expr1 then Stmt else if Expr1 then Stmt if Expr2 then Stmt1 if Expr2 then Stmt1 else Stmt2 production 2, then production 1 production 1, then production 2

Ambiguity Removing the ambiguity Must rewrite the grammar to avoid generating the problem Match each else to innermost unmatched if(common sense rule) With this grammar, the example has only one parse tree 1Stmt Matched 2| Unmatched 3Matched  If Expr then Matchedelse Matched 4 |… other kinds of stmts … 5Unmatched If Expr then Stmt 6| If Expr then Matchedelse Unmatched

Ambiguity if Expr1then if Expr2 then Stmt1 else Stmt2 This binds the else to the inner if RuleSentential Form —Stmt 2Unmatched 5if Expr then Stmt ?if Expr1 then Stmt 1if Expr1 then Matched 3if Expr1 then if Expr then Matched else Matched ?if Expr1 then if Expr2 then Matched else Matched 4if Expr1 then if Expr2 then Stmt1else Matched 4if Expr1 then if Expr2 then Stmt1else Stmt2

Ambiguity Theoretical results: • It is undecidable whether an arbitrary CFG is ambiguous • There exists CFLs for which every CFG is ambiguous. These are called inherenlty ambiguous CFLs. • Example: { 0i 1j2k| i = j or j = k }

Ambiguity Ambiguity usually refers to confusion in the CFG Overloading can create deeper ambiguity a = f(17) In many Algol-like languages, f could be either a function or a subscripted variable Disambiguating this one requires context Need values of declarations Really an issue of type, not context-free syntax Requires an extra-grammatical solution (not in CFG) Must handle these with a different mechanism Step outside grammar rather than use a more complex grammar

Ambiguity Ambiguity can arise from two distinct sources Confusion in the context-free syntax Confusion that requires context to resolve Resolving ambiguity To remove context-free ambiguity, rewrite the grammar To handle context-sensitive ambiguity takes cooperation Knowledge of declarations, types, … Accept a superset of the input language then check it with other means (type checking, context-sensitive analysis) This is a language design problem Sometimes, the compiler writer accepts an ambiguous grammar Parsing algorithms can be kludged so that they “do the right thing”

Introduction to Parsing