Sections 4.1-4.4: Syntax Analysis

Sections 4.1-4.4: Syntax Analysis Aggelos Kiayias Computer Science & Engineering Department The University of Connecticut 371 Fairfield Road Storrs, CT 06269-1155 aggelos@cse.uconn.edu http://www.cse.uconn.edu/~akiayias

Syntax Analysis - Parsing • An overview of parsing : • Functions & Responsibilities • Context Free Grammars • Concepts & Terminology • Writing and Designing Grammars • Resolving Grammar Problems / Difficulties • Top-Down Parsing • Recursive Descent & Predictive LL • Bottom-Up Parsing • LR & LALR • Concluding Remarks/Looking Ahead

An Overview of Parsing • Why are Grammars to formally describe Languages Important ? • Precise, easy-to-understand representations • Compiler-writing tools can take grammar and generate a compiler • allow language to be evolved (new statements, changes to statements, etc.) Languages are not static, but are constantly upgraded to add new features or fix “old” ones • ADA  ADA9x, C++ Adds: Templates, exceptions, How do grammars relate to parsing process ?

Parsing During Compilation errors token lexical analyzer rest of front end source program parse tree parser get next token symbol table regular expressions interm repres • also technically part or parsing • includes augmenting info on tokens in source, type checking, semantic analysis • uses a grammar to check structure of tokens • produces a parse tree • syntactic errors and recovery • recognize correct syntax • report errors

Parsing Responsibilities • Syntax Error Identification / Handling • Recall typical error types: • Lexical : Misspellings • Syntactic : Omission, wrong order of tokens • Semantic : Incompatible types • Logical : Infinite loop / recursive call • Majority of error processing occurs during syntax analysis • NOTE: Not all errors are identifiable !! Which ones?

Key Issues – Error Processing • Detecting errors • Finding position at which they occur • Clear / accurate presentation • Recover (pass over) to continue and find later errors • Don’t impact compilation of “correct” programs

What are some Typical Errors ? #include<stdio.h> int f1(int v) { int i,j=0; for (i=1;i<5;i++) { j=v+f2(i) } return j; } int f2(int u) { int j; j=u+f1(u*u); return j; } int main() { int i,j=0; for (i=1;i<10;i++) { j=j+i*i printf(“%d\n”,i); } printf("%d\n",f1(j)); return 0; } As reported by MS VC++ 'f2' undefined; assuming extern returning intsyntax error : missing ';' before '}‘syntax error : missing ';' before identifier 'printf' Which are “easy” to recover from? Which are “hard” ?

Error Recovery Strategies Panic Mode– Discard tokens until a “synchro” token is found ( end, “;”, “}”, etc. ) -- Decision of designer -- Problems: skip input miss declaration – causing more errors miss errors in skipped material -- Advantages: simple suited to 1 error per statement Phrase Level – Local correction on input -- “,” ”;” – Delete “,” – insert “;” -- Also decision of designer -- Not suited to all situations -- Used in conjunction with panic mode to allow less input to be skipped

Error Recovery Strategies – (2) Error Productions: -- Augment grammar with rules -- Augment grammar used for parser construction / generation -- example: add a rule for := in C assignment statements Report error but continue compile -- Self correction + diagnostic messages Global Correction: -- Adding / deleting / replacing symbols is chancy – may do many changes ! -- Algorithms available to minimize changes costly - key issues

Motivating Grammars Reg. Lang. CFLs • Regular Expressions •  Basis of lexical analysis •  Represent regular languages • Context Free Grammars •  Basis of parsing •  Represent language constructs •  Characterize context free languages EXAMPLE: anbn , n  1 : Is it regular ?

Context Free Grammars : Concepts & Terminology Definition: A Context Free Grammar, CFG, is described by T, NT, S, PR, where: T: Terminals / tokens of the language NT: Non-terminals to denote sets of strings generated by the grammar & in the language S: Start symbol, SNT, which defines all strings of the language PR: Production rules to indicate how T and NT are combined to generate valid strings of the language. PR: NT  (T | NT)* Like a Regular Expression / DFA / NFA, a Context Free Grammar is a mathematical model

Context Free Grammars : A First Look assign_stmt id := expr ; expr  expr operator term expr  term term id term real term integer operator + operator - What do “blue” symbols represent? Derivation: A sequence of grammar rule applications and substitutions that transform a starting non-term into a sequence of terminals / tokens. Simply stated: Grammars / production rules allow us to “rewrite” and “identify” correct syntax.

Derivation Let’s derive: id := id + real – integer ; using production: • assign_stmt assign_stmt id := expr ; • id := expr ; expr  expr operator term • id := expr operator term; expr  expr operator term • id := expr operator term operator term; expr  term • id := term operator term operator term; term id • id := id operator term operator term; operator + • id := id + term operator term; term real • id := id +real operator term; operator - • id := id +real - term; term integer • id := id +real - integer;

Example Grammar expr  expr op expr expr ( expr ) expr - expr expr id op + op - op * op / op  Black : NT Blue : T expr : S 9 Production rules To simplify / standardize notation, we offer a synopsis of terminology.

Example Grammar - Terminology Terminals: a,b,c,+,-,punc,0,1,…,9, blue strings Non Terminals: A,B,C,S, black strings T or NT: X,Y,Z Strings of Terminals: u,v,…,z in T* Strings of T / NT:  ,  in ( T  NT)* Alternatives of production rules: A 1; A 2; …; A k;  A  1 | 2 | … | 1 First NT on LHS of 1st production rule is designated as start symbol ! E  E A E | ( E ) | -E | id A  + | - | * | / | 

Grammar Concepts A step in a derivation is zero or one action that replaces a NT with the RHS of a production rule. EXAMPLE: E  -E (the  means “derives” in one step) using the production rule: E  -E EXAMPLE: E  E A E  E * E  E * ( E ) DEFINITION:  derives in one step  derives in  one step  derives in  zero steps + * EXAMPLES:  A      if A  is a production rule 1  2 …  n  1  n ;    for all  If    and    then    * * * *

How does this relate to Languages? + Let G be a CFG with start symbol S. Then S W (where W has no non-terminals) represents the language generated by G, denoted L(G). So WL(G)  S W. + W : is a sentence of G When S  (and  may have NTs) it is called a sentential form of G. EXAMPLE: id * id is a sentence Here’s the derivation: E  E A E E * E  id * E  id * id E id * id Sentential forms *

Other Derivation Concepts   lm rm Leftmost: Replace the leftmost non-terminal symbol E  E A E  id A E  id * E  id * id Rightmost: Replace the leftmost non-terminal symbol E  E A E  E A id  E *id  id * id lm lm lm lm rm rm rm rm Important Notes: A   If A    , what’s true about  ? If A    , what’s true about  ? Derivations: Actions to parse input can be represented pictorially in a parse tree.

Examples of LM / RM Derivations E  E A E | ( E ) | -E | id A  + | - | * | / |  A leftmost derivation of : id + id * id A rightmost derivation of : id + id * id

Derivations & Parse Tree E E E E * E E E E A A A A E E E E id * id id * E  E A E  E * E  id * E  id * id

Parse Trees and Derivations E E E E E + E  id+ E E E E E + + * + E E E E id+ E  id+ E * E id id Consider the expression grammar: E  E+E | E*E | (E) | -E | id Leftmost derivations of id + id * id E  E + E

Parse Tree & Derivations - continued id+ E * E  id+id* E id id id id id E E E E E E E E + * * + E E E E id+id* E  id+id* id

Alternative Parse Tree & Derivation E E E * E E + E id id id E  E * E  E + E * E  id + E * E  id + id * E  id + id * id WHAT’S THE ISSUE HERE ? Two distinct leftmost derivations!

Resolving Grammar Problems/Difficulties Reg. Lang. CFLs Regular Expressions : Basis of Lexical Analysis Reg. Expr.  generate/represent regular languages Reg. Languages  smallest, most well defined class of languages Context Free Grammars: Basis of Parsing CFGs  represent context free languages CFLs  contain more powerful languages anbn – CFL that’s not regular! Try to write DFA/NFA for it !

Resolving Problems/Difficulties – (2) a start a b b 0 2 1 3 b Since Reg. Lang.  Context Free Lang., it is possible to go from reg. expr. to CFGs via NFA. Recall: (a | b)*abb

Resolving Problems/Difficulties – (3) a start a b b 1 2 0 3 b b a j i j i Construct CFG as follows: • Each State I has non-terminal Ai : A0, A1, A2, A3 • If then Aia Aj • If then Ai bAj • If I is an accepting state, Ai  : A3   • If I is a starting state, Ai is the start symbol : A0 : A0 aA0, A0 aA1 : A0 bA0,A1 bA2 : A2 bA3 T={a,b}, NT={A0, A1, A2, A3}, S = A0 PR ={ A0 aA0 | aA1 | bA0 ; A1  bA2 ; A2  bA3 ; A3   }

How Does This CFG Derive Strings ? a start a b b 0 2 1 3 b vs. A0 aA0, A0 aA1 A0 bA0, A1 bA2 A2 bA3, A3  How is abaabb derived in each ?

Regular Expressions vs. CFGs • Regular expressions for lexical syntax • CFGs are overkill, lexical rules are quite simple and straightforward • REs – concise / easy to understand • More efficient lexical analyzer can be constructed • RE for lexical analysis and CFGs for parsing promotes modularity, low coupling & high cohesion. CFGs : Match tokens “(“ “)”, begin / end, if-then-else, whiles, proc/func calls, … Intended for structural associations between tokens ! Are tokens in correct order ?

Resolving Grammar Difficulties : Motivation • ambiguity • -moves • cycles • left recursion • left factoring • Humans write / develop grammars • Different parsing approaches have different needs Top-Down vs. Bottom-Up • For: 1  remove “errors” • For: 2  put / redesign grammar Grammar Problems

Resolving Problems: Ambiguous Grammars stmt expr if then stmt else stmt stmt else stmt expr then if E2 S1 S3 E1 S2 Consider the following grammar segment: stmt if exprthen stmt | if exprthen stmtelse stmt | other (any other statement) What’s problem here ? Let’s consider a simple parse tree: Else must match to previous then. Structure indicates parse subtree for expression.

Example : What Happens with this string? If E1then if E2then S1else S2 How is this parsed ? if E1then if E2then S1 else S2 if E1then if E2then S1 else S2 vs. What’s the issue here ?

Parse Trees for Example stmt expr if then stmt stmt else stmt expr then if stmt expr if then stmt else stmt stmt expr then if E1 E1 S2 E2 S2 E2 S1 S1 Form 1: Form 2: What’s the issue here ?

Removing Ambiguity Take Original Grammar: stmt if exprthen stmt | if exprthen stmtelse stmt | other (any other statement) Revise to remove ambiguity: stmt  matched_stmt | unmatched_stmt matched_stmt if exprthen matched_stmt else matched_stmt | other unmatched_stmt if exprthen stmt | if exprthen matched_stmt else unmatched_stmt How does this grammar work ?

Resolving Difficulties : Left Recursion A left recursive grammar has rules that support the derivation : A  A, for some . + Top-Down parsing can’t reconcile this type of grammar, since it could consistently make choice which wouldn’t allow termination. A  A  A  A … etc. A A |  Take left recursive grammar: A  A |  To the following: A A’ A’  A’ | 

Why is Left Recursion a Problem ? Derive : id + id + id E  E + T  Consider: E  E + T | T T  T * F | F F  ( E ) | id How can left recursion be removed ? E  E + T | T What does this generate? E  E + T  T + T E  E + T  E + T + T  T + T + T  How does this build strings ? What does each string have to start with ?

Resolving Difficulties : Left Recursion (2) For our example: E  E + T | T T  T * F | F F  ( E ) | id E  TE’ E’  + TE’ |  T  FT’ T’  * FT’ |  F  ( E ) | id Informal Discussion: Take all productions for A and order as: A  A1 | A2 | … | Am | 1 | 2 | … | n Where no i begins with A. Now apply concepts of previous slide: A  1A’ | 2A’ | … | nA’ A’  1A’ | 2A’| … | m A’ | 

Resolving Difficulties : Left Recursion (3) S  Aa | b A  Ac | Sd |  S  Aa  Sda Problem: If left recursion is two-or-more levels deep, this isn’t enough Algorithm: • Input: Grammar G with ordered Non-Terminals A1, ..., An • Output: An equivalent grammar with no left recursion • Arrange the non-terminals in some order A1=start NT,A2,…An • for i := 1 to n do begin • for j := 1 to i – 1 do begin • replace each production of the form Ai  Aj • by the productions Ai  1 | 2 | … | k • where Aj  1|2|…|k are all current Aj productions; • end • eliminate the immediate left recursion among Ai productions • end

Using the Algorithm Apply the algorithm to: A1  A2a | b|  A2  A2c | A1d i = 1 For A1 there is no left recursion i = 2 for j=1 to 1 do Take productions: A2 A1 and replace with A2  1  | 2  | … | k | where A1 1 | 2 | … | k are A1 productions in our case A2  A1d becomes A2  A2ad | bd | d What’s left: A1 A2a | b |  A2  A2 c | A2ad | bd | d Are we done ?

Using the Algorithm (2) No ! We must still remove A2 left recursion ! A1 A2a | b |  A2  A2 c | A2ad | bd | d Recall: A  A1 | A2 | … | Am | 1 | 2 | … | n A  1A’ | 2A’ | … | nA’ A’  1A’ | 2A’| … | m A’ |  Apply to above case. What do you get ?

Removing Difficulties : Cycles How would cycles be removed ? Make sure every production is addingsome terminal(s) (except a single  -production in the start NT)… e.g. S  SS | ( S ) |  Has a cycle: S  SS  S Transformation: substitute A uBv by the productions A ua1v| ... | uanv assuming that all the productions for B are B a1| ... | an S   Transform to: S  S ( S ) | ( S ) | 

Removing Difficulties : Left Factoring Problem : Uncertain which of 2 rules to choose: stmt if exprthen stmt else stmt | if exprthen stmt When do you know which one is valid ? What’s the general form of stmt ? A  1 | 2  : if exprthen stmt 1: else stmt 2 :  Transform to: A   A’ A’  1 | 2 EXAMPLE: stmt if exprthen stmt rest rest else stmt | 

Resolving Grammar Problems Reg. Lang. CFLs • Note: Not all aspects of a programming language can be represented by context free grammars / languages. • Examples: • 1. Declaring ID before its use • 2. Valid typing within expressions • 3. Parameters in definition vs. in call • These features are call context-sensitive and define yet another language class, CSL. CSLs

Context-Sensitive Languages - Examples • Examples: • L1 = { wcw | w is in (a | b)* } : Declare before use • L2 = { an bm cn dm | n  1, m  1 } • an bm : formal parameter • cn dm : actual parameter What does it mean when L is context-sensitive ?

How do you show a Language is a CFL? L3 = { w c wR | w is in (a | b)* } L4 = { an bm cm dn | n  1, m  1 } L5 = { an bn cm dm | n  1, m  1 } L6 = { an bn | n  1 }

Solutions L3 = { w c wR | w is in (a | b)* } L4 = { an bm cm dn | n  1, m  1 } L5 = { an bn cm dm | n  1, m  1 } L6 = { an bn | n  1 } S  a S a | b S b | c S  a S d | a A d A  b A c | bc S  XY X  a X b | ab Y  c Y d | cd S  a S b | ab

Example of CS Grammar L2 = { an bm cn dm | n  1, m  1 } S  a A c HA  a A c | B B  b B D | b DD c  cDDDH D H dc DH cd

Top-Down Parsing • Identify a leftmost derivation for an input string • Why ? • By always replacing the leftmost non-terminal symbol via a production rule, we are guaranteed of developing a parse tree in a left-to-right fashion that is consistent with scanning the input. • A  aBc  adDc  adec (scan a, scan d, scan e, scan c - accept!) • Recursive-descent parsing concepts • Predictive parsing • Recursive / Brute force technique • non-recursive / table driven • Error recovery • Implementation

Recursive Descent Parsing Concepts S S cad cad c d A c d A a b Problem: backtrack S S cad cad c d A c d A a b a • General category of Parsing Top-Down • Choose production rule based on input symbol • May require backtracking to correct a wrong choice. • Example: S  c A d • A  ab | a input: cad S cad c d A a

Predictive Parsing : Recursive • To eliminate backtracking, what must we do/be sure of for grammar? • no left recursion • apply left factoring • (frequently) when grammar satisfies above conditions:current input symbol in conjunction with current non-terminal uniquely determines the production that needs to be applied. • Utilize transition diagrams: • For each non-terminal of the grammar do following: • 1. Create an initial and final state • 2. If A X1X2…Xn is a production, add path with edges X1, X2, … , Xn • Once transition diagrams have been developed, apply a straightforward technique to algorithmicize transition diagrams with procedure and possible recursion.

Sections 4.1-4.4: Syntax Analysis