290 likes | 717 Views
Chapter 4 Syntax Analysis. Topics to cover: Context-Free Grammars: Concepts and Notation Writing and rewriting a grammar Syntax Error Handling and Recovery. Introduction. Why CFG CFG gives a precise syntactic specification of a programming language.
E N D
Chapter 4 Syntax Analysis Topics to cover: Context-Free Grammars: Concepts and Notation Writing and rewriting a grammar Syntax Error Handling and Recovery compiler Constreuction
Introduction • Why CFG • CFG gives a precise syntactic specification of a programming language. • Automatic efficient parser generator • Enabling automatic translator generator • Language extension becomes easier • The role of the parser • Taking tokens from scanner, parsing, reporting syntax errors • Not just parsing, in a syntax-directed translator, the parser also conducts type checking, semantic analysis and IR generation. compiler Constreuction
Example of CFG • A C– program is made out of functions, a function out of declarations and blocks, a block out of statements, a statement out of expressions, … etc <program> <global_decl_list> <global_decl_list> <global_decl_list><global_decl> | e <global_decl> <decl_list> <function_decl> <function_decl> <type> id ( <param_list> ) { <block> } <block> <decl_list> <statement_list> | e <decl_list> <decl_list> <decl> | <decl> | e <decl> <type_decl> | <var_decl> <type> void | int | float <statement_list> …. <statement> { <block> } compiler Constreuction
Notational Conventions • Following symbols are terminals • Lower case letters such as a,b,c. • Operators (+,-, etc) and punctuation symbols (parentheses, commas, etc) • Digits such as 0,1,2,etc • Boldface strings such as id or if compiler Constreuction
Notational Conventions • Nonterminals • Upper case letters such as A,B,C • The letter S – the start symbol • Lower case italic names such as expr or stmt • Grammar symbols • upper case, late in the alphabet, such as X,Y,Z,. • Strings of terminals • lower case letters late in the alphabet, such as u,v,.. z • Strings of grammar symbols • Lower-case Greek letters, such as a,b,g compiler Constreuction
Example expr expr op expr expr (expr) expr - expr expr id op + op - op * op / op h Using the notational shorthand E E A E | (E) | -E | id A + | - | * | / | h Non-terminals: E and A Start symbol: E compiler Constreuction
Derivation • Given a string aAb If A g is a production, then we can replace aAb by agb, written as aAb agb • means derives in one-step • + means derive in one or more steps • * means drive in zero or more steps The language L(G) generated by G is the set of terminal strings w such that S + w. The string w is called a sentence of G. If S * a where a may contain nonterminals, we say a is a sentential form of G compiler Constreuction
Exercise • What is a sentence of language L defined by the C++ grammar G? • Is the following string a sentence or a sentential form? int parse(<parameter_list>) {} a C++ program A sentential form compiler Constreuction
Derivation (cont.) Consider the following grammar G0 E E + E | E * E | (E) | -E | id The string -(id + id) is a sentence of G0 because there is a derivation E - E - (E) - (E+E) - (id +E) -(id + id) Leftmost derivation: only the leftmost nonterminal is replaced Rightmost derivation: only the rightmost nonterminal is replaced Exercise: is id-id a sentence of G0? Is –id+id a sentence? Yes No compiler Constreuction
Parse Tree and Derivation A Parse tree can be viewed as a graphical representation for a derivation that ignore replacement order. E - E - (E) - (E+E) - (id +E) -(id + id) E - E ( E ) Interior node: non-terminal Leaves: terminal Children: right-hand side E + E id id compiler Constreuction
CFG is more powerful than RE • Every RE can be described by a CFG • Example (a|b)*abb A aA | bA | abb • Converting a NFA into a CFG • For each state I of the NFA, create a nonterminal symbol Ai • If state i goes to stat j on input a, add production Ai aAj • Ai Aj if state i goes to j on e • Ai e if state i is an accepting state compiler Constreuction
Why do we need RE? • RE is sufficiently powerful for lexical rules • RE is more concise and easier to understand • More efficient lexical analyzer can be constructed from RE than from CFG • Separating lexical from nonlexical part has a few advantages such as modularization, easier to port, etc. • Exercise:what if we don’t have token definition? compiler Constreuction
Defects in CFG • Useless nonterminals • S A | B A a B Bb C c • Ambiguity • Top-Down parsing issues • Left recursion • Left factoring <derives no terminal string> <unreachable> compiler Constreuction
Ambiguity • A grammar is ambiguous if it produces more than one parse tree for some sentences • example 1: A+B+C ( is it (A+B)+C or A+(B+C) ) • Improper production: expr expr + expr | id • example 2: A+B*C ( is it (A+B)*C or A+(B*C) ) • Improper production: expr expr + expr | expr * expr • example 3: if E1 then if E2 then S1 else S2 (which then does the else match with) • Improper production: • stmt if expr then stmt | if expr then stmt else stmt compiler Constreuction
Two parse trees of example 3 stmt stmt if E1 then stmt if E1 then stmt else S2 if E2 then S1 else S2 if E2 then S1 compiler Constreuction
Eliminating Ambiguity • Operator Associativity • expr expr + term | term • Operator Precedence • expr expr + term | term term term * factor | factor • Dangling Else • stmt matched | unmatched matched if expr then matched else matched unmatched if expr then stmt | if expr then matched else unmatched compiler Constreuction
Eliminating Left Recursion • Immediate left recursion • Example: A Aa | b • Transformation A Aa1 | Aa2 | … | b1 | b2 | … Where no b begins with A, we replace A productions by A b1A’ | b2A’ | …. A’ a1A’ | a2A’ | … | e compiler Constreuction
Indirect Left Recursion • Example: S Aa | b A Ac | Sd | e • Transformation (assuming no cycles A+ A) • Arrange nonterminals in order A1, A2, … An • for i := 1 to n do for j := 1 to i-1 do begin Replace Ai Ajg by Ai d1g | d2g .. where Aj d1 | d2 | … are current Aj prod end Eliminate the immediate left recursion among Ai end compiler Constreuction
In the above example, S Aa | b A Ac | Sd | e A Sd will be replaced by A Ac | Aad | bd | e , then eliminates immediate recursion among A productions and yields the following S Aa | b A bdA’ | A’ A’ cA’ | adA’ | e compiler Constreuction
Algorithm 4.1 Eliminating Left Recursion • This algorithm will systematically eliminate left recursions from a grammar. • This is about how to remove indirect left recursions. • Precondition: the grammar has no cycles or e-productions. A cycle means: A + A To avoid getting A A type of productions during nonterminal replacement. For example, A BA, B Ab | e when ABA is derived to AeA, a cycle shows up. e-production also makes the algorithm more complex because ABCD may be derived to ACD so handling the leftmost non-terminal only is not sufficient compiler Constreuction
Indirect Left Recursion A Bb | a B Cc | b C Dd | c D Aa | d A Bb Ccb Ddcb Aadcb C Dd Aad Bbad Ccbad Need to expose immediate left recursions and then eliminate them. Some ordering is needed. Suppose we replace ABb by A Ccb and then start with B Cc Ddc Aadc Ccbabc, this would never expose the immediate left recursion in this example. compiler Constreuction
Algorithm 4.1 For i:= 1 to n do begin For j:= 1 to i-1 do begin replace each production of the form Ai Ajg by the productions Aid1g | d2g .. where Ajd1 | d2 | … are current Aj production End eliminate the immediate left recursion among Ai-productions End Key idea: For each non-terminal Ai, all references to lower numbered non-terminal Aj, (where j < i) will be replaced by higher numbered non-terminals. compiler Constreuction
. A1 … A2 Ai-1 g | Ai+k h | … … Ai Ai-1a | A2 b | … … An After replacement, there will be no backward references compiler Constreuction
Left Factoring Consider the following grammar A ab1 | ab2 It is not easy to determine whether to expand A to ab1 or ab2 A transformation called left factoring can be applied. It becomes: A aA’ A’ b1 | b2 compiler Constreuction
Exercise stmt if expr then stmt | if expr then stmt else stmt For the following grammar form: A ab1 | ab2 What is a? b1? b2? • : if expr then stmt b1: e b2: else stmt compiler Constreuction
Syntax Error Handling • Different type of errors • Lexical • Syntactic • Semantic • Logical • Error handling goals • Report errors clearly and accurately • Recover quickly • Fast compiler Constreuction
Error Handling Strategies • Don’t quit after detecting the 1st error. • Avoid introducing “spurious” errors • Inhibit error messages that stem from errors uncovered too close together • Simple error repair will be sufficient due to the increasing emphasis on interactive computing and good programming environment. compiler Constreuction
Error Recovery Strategies • Panic mode • Deleting input tokens until one of a designated set of synchronizing tokens is found. • Phrase level • Local correction to repair punctuation errors • Error productions • Augment the grammar with error productions • Global correction • Globally least-cost correction to a string, costly to implement. compiler Constreuction