630 likes | 813 Views
Chapter 4: Syntax Analysis Part 1: Grammar Concepts. Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155. steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818.
Chapter 4: Syntax AnalysisPart 1: Grammar Concepts Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 Material for course thanks to: Laurent Michel Aggelos Kiayias Robert LeBarre
Syntax Analysis - Parsing • An overview of parsing : Functions & Responsibilities • Context Free Grammars: Concepts & Terminology • Writing and Designing Grammars • Resolving Grammar Problems / Difficulties • Top-Down Parsing • Recursive Descent & Predictive LL • Bottom-Up Parsing • LR & LALR • Key Issues in Error Handling • Concluding Remarks/Looking Ahead
An Overview of Parsing • Why are Grammars to formally describe Languages Important ? • Precise, easy-to-understand representations • Compiler-writing tools can take grammar and generate a compiler • allow language to be evolved (new statements, changes to statements, etc.) Languages are not static, but are constantly upgraded to add new features or fix “old” ones • ADA ADA9x, C++ Adds: Templates, exceptions, How do grammars relate to parsing process ?
Parsing During Compilation errors token lexical analyzer rest of front end source program parse tree parser get next token symbol table regular expressions interm repres • also technically part or parsing • includes augmenting info on tokens in source, type checking, semantic analysis • uses a grammar to check structure of tokens • produces a parse tree • syntactic errors and recovery • recognize correct syntax • report errors
Parsing Responsibilities • Syntax Error Identification / Handling • Recall typical error types: • Lexical : Misspellings • Syntactic : Omission, wrong order of tokens • Semantic : Incompatible types • Logical : Infinite loop / recursive call • Majority of error processing occurs during syntax analysis • NOTE: Not all errors are identifiable !! Which ones?
Key issues in Error Handling • Detection • Actual Detection • Finding position at which they occur • Reporting • Clear / accurate presentation • Recovery • How to skip overto continue and find later errors • Cannot impact compilation of correct programs
What are some Typical Errors ? #include<stdio.h> int f1(int v) { int i,j=0; for (i=1;i<5;i++) { j=v+f2(i) } return j; } int f2(int u) { int j; j=u+f1(u*u) return j; } int main() { int i,j=0; for (i=1;i<10;i++) { j=j+i*I; printf(“%d\n”,i); printf("%d\n",f1(j)); return 0; } As reported by MS VC++ 'f2' undefined; assuming extern returning intsyntax error : missing ';' before ‘}‘syntax error : missing ';' before ‘return‘fatal error : unexpected end of file found Which are “easy” to recover from? Which are “hard” ?
Error Recovery Strategies Panic Mode – Discard tokens until a “synchro” token is found ( end, “;”, “}”, etc. ) -- Decision of designer -- Problems: skip input miss declaration – causing more errors miss errors in skipped material -- Advantages: simple suited to 1 error per statement Phase Level – Local correction on input -- “,” ”;” – Delete “,” – insert “;” -- Also decision of designer -- Not suited to all situations -- Used in conjunction with panic mode to allow less input to be skipped
Error Recovery Strategies – (2) Error Productions: -- Augment grammar with rules -- Augment grammar used for parser construction / generation -- Add a rule for := in C assignment statements Report error but continue compile -- Self correction + diagnostic messages -- What are other rules ? (supported by yacc) Global Correction: -- Adding / deleting / replacing symbols is chancy – may do many changes ! -- Algorithms available to minimize changes costly - key issues What do you think of each approach? What approach do compilers you’ve used take ?
Motivating Grammars Reg. Lang. CFLs • Regular Expressions • Basis of lexical analysis • Represent regular languages • Context Free Grammars • Basis of parsing • Represent language constructs • Characterize context free languages EXAMPLE: anbn , n 1 : Is it regular ?
Context Free Languages Context Free Languages Regular Languages Context Free Languages Reg. Languages Token Structure Sentence Structure Parser Generation Scanner Generation
Context Free Grammars : Concepts & Terminology Definition: A Context Free Grammar, CFG, is described by T, NT, S, PR, where: T: Terminals / tokens of the language NT: Non-terminals to denote sets of strings generatable by the grammar & in the language S: Start symbol, SNT, which defines all strings of the language PR: Production rules to indicate how T and NT are combines to generate valid strings of the language. PR: NT (T | NT)* Like a Regular Expression / DFA / NFA, a Context Free Grammar is a mathematical model !
Context Free Grammars : A First Look assign_stmt id := expr ; expr term operator term term id term real term integer operator + operator - What do “BLUE” symbols represent? What do “BLACK” symbols represent? Derivation: A sequence of grammar rule applications and substitutions that transform a starting non-term into a collection of terminals / tokens. Simply stated: Grammars / production rules allow us to “rewrite” and “identify” correct syntax.
How is Grammar Used ? Given the rules on the previous slide, suppose id := real + int; is input. Is it syntactically correct? How do we know? expr is represented as: expr term operator term Is this accurate / complete? expr expr operator term expr term How does this affect the derivations which are possible?
Derivation and Parse Tree Let’s derive: id := id + real – integer ; What’s its parse tree ?
Derivation Example • Let’s derive: id := id + real – integer ; assign_stmt assign_stmt → id := expr ; → id := expr ; expr → expr operator term → id := expr operator term; expr → expr operator term → id := expr operator term operator term; expr → term → id := term operator term operator term; term → id → id := id operator term operator term; operator → + → id := id + term operator term; term → real → id := id + real operator term; operator → - → id := id + real - term; term → integer → id := id + real - integer; assign_stmt → id := expr ; expr → expr operator term expr → term term → id term → real term → integer operator → + operator → -
Example Grammar expr expr op expr expr ( expr ) expr - expr expr id op + op - op * op / op Black : NT Blue : T expr : S 9 Production rules To simplify / standardize notation, we offer a synopsis of terminology.
Example Grammar - Terminology Terminals: a,b,c,+,-,punc,0,1,…,9, blue strings Non Terminals: A,B,C,S, black strings T or NT: X,Y,Z Strings of Terminals: u,v,…,z in T* Strings of T / NT: , in ( T NT)* Alternatives of production rules: A 1; A 2; …; A k; A 1 | 2 | … | 1 First NT on LHS of 1st production rule is designated as start symbol ! E E A E | ( E ) | -E | id A + | - | * | / |
Grammar Concepts A step in a derivation is zero or one action that replaces a NT with the RHS of a production rule. EXAMPLE: E -E (the means “derives” in one step) using the production rule: E -E EXAMPLE: E E A E E * E E * ( E ) DEFINITION: derives in one step derives in one step derives in zero steps + * EXAMPLES: A if A is a production rule 1 2 … n 1 n ; for all If and then * * * *
How does this relate to Languages? + Let G be a CFG with start symbol S. Then S W (where W has no non-terminals) represents the language generated by G, denoted L(G). So WL(G) S W. + W : is a sentence of G When S (and may have NTs) it is called a sentential form of G. EXAMPLE: id * id is a sentence Here’s the derivation: E E A E E * E id * E id * id E id * id Sentential forms *
Leftmost and Rightmost Derivations lm rm Leftmost: Replace the leftmost non-terminal symbol E E A E id A E id * E id * id Rightmost: Replace the leftmost non-terminal symbol E E A E E A id E *id id * id lm lm lm lm rm rm rm rm Important Notes: A If A , what’s true about ? If A , what’s true about ? Derivations: Actions to parse input can be represented pictorially in a parse tree.
Examples of LM / RM Derivations E E A E | ( E ) | -E | id A + | - | * | / | A leftmost derivation of : id + id * id A rightmost derivation of : id + id * id
Derivations & Parse Tree E E E E * E E E E A A A A E E E E id * id id * E E A E E * E id * E id * id
Parse Trees and Derivations E E E E E + E id+ E E E E E + + * + E E E E id+ E id+ E * E id id Consider the expression grammar: E E+E | E*E | (E) | -E | id Leftmost derivations of id + id * id E E + E
Parse Tree & Derivations - continued id+ E * E id+id* E id id id id id E E E E E E E E + * * + E E E E id+id* E id+id* id
Alternative Parse Tree & Derivation E E E * E E + E id id id E E * E E + E * E id + E * E id + id * E id + id * id WHAT’S THE ISSUE HERE ?
Example Pascal Grammar in Yacc Start : Program_Header Declarations Block T_PERIOD Program_Header : T_PROGRAM T_IDENTIFIER T_LPAREN Id_List T_RPAREN T_SEMI Block : T_BEGIN Stt_List T_END Declarations : Const_Def_Part Type_Def_Part Var_Decl_Part Routine_Decl_Part Const_Def_Part : T_CONST Const_Def_List | /* epsilon */ ; Const_Def_List : Const_Def | Const_Def_List Const_Def ; Const_Def : T_IDENTIFIER T_EQ Constant T_SEMI Type_Def_Part : T_TYPE Type_Def_List | /* epsilon */ Type_Def_List : Type_Def | Type_Def_List Type_Def
Example Pascal Grammar in Yacc Type_Def : T_IDENTIFIER T_EQ Type T_SEMI Var_Decl_Part : T_VAR Var_Decl_List | /* epsilon */ Var_Decl_List : Var_Decl_List Var_Decl | Var_Decl Var_Decl : Id_List T_COLON Type T_SEMI Routine_Decl_Part : Routine_Decl_Part Routine_Decl | /* epsilon */ Routine_Decl : Routine_Head T_IDENTIFIER T_SEMI | Routine_Head Declarations Block T_SEMI Routine_Head : T_PROCEDURE T_IDENTIFIER Params T_SEMI | T_FUNCTION T_IDENTIFIER Params Function_Type T_SEMI Params : T_LPAREN Params_List T_RPAREN | /* epsilon */ Param_Group : Id_List T_COLON Type | T_VAR Id_List T_COLON Type
Example Pascal Grammar in Yacc Function_Type : T_COLON Type | /* epsilon */ Params_List : Param_Group | Params_List T_SEMI Param_Group Constant : T_STRING | Number | T_PLUS Number | T_MINUS Number Number : T_IDENTIFIER | T_INTEGER | T_REAL Const_List : Constant | Const_List T_COMMA Constant Type : Simple_Type | T_UPARROW T_IDENTIFIER | Structured_Type Simple_Type : T_IDENTIFIER | T_LPAREN Id_List T_RPAREN | Constant T_RANGE Constant Structured_Type : T_ARRAY T_LBRACK Simple_Type_List T_RBRACK T_OF Type | T_SET T_OF Simple_Type | T_RECORD Field_List T_END Simple_Type_List : Simple_Type | Simple_Type_List T_COMMA Simple_Type
Example Pascal Grammar in Yacc Field_List : Fixed_Part /* Variant_part */ Fixed_Part : Field | Fixed_Part T_SEMI Field Field : /* epsilon */ | Id_List T_COLON Type Variant_part : / * epsilon * / | T_CASE T_IDENTIFIER T_OF Variant_List | T_CASE T_IDENTIFIER T_COLON T_IDENTIFIER T_OF Variant_List Variant_List : Variant | Variant_List T_SEMI Variant Variant : / * epsilon * / | Const_List T_COLON T_LPAREN Field_List T_RPAREN Stt_List : Statement | Stt_List T_SEMI Statement Case_Stt_List : Case_Statement | Case_Stt_List T_SEMI Case_Statement Case_Statement : Const_List T_COLON Statement | /* epsilon */
Example Pascal Grammar in Yacc Statement : /* epsilon */ | T_IDENTIFIER | T_IDENTIFIER T_LPAREN Expr_List T_RPAREN | Variable T_ASSIGN Expression | T_BEGIN Stt_List T_END | T_CASE Expression T_OF Case_Stt_List T_END | T_WHILE Expression T_DO Statement | T_REPEAT Stt_List T_UNTIL Expression | T_FOR Variable T_ASSIGN Expression T_TO Expression T_DO Statement | T_FOR Variable T_ASSIGN Expression T_DOWNTO Expression T_DO Statement | T_IF Expression T_THEN Statement | T_IF Expression T_THEN Statement T_ELSE Statement Expression : Expression relop Expression %prec T_LT | T_PLUS Expression %prec UNARYSIGN | T_MINUS Expression %prec UNARYSIGN | Expression addop Expression %prec T_PLUS | Expression divop Expression %prec T_MULT | T_NIL | T_STRING | T_INTEGER | T_REAL | Variable | T_IDENTIFIER T_LPAREN Expr_List T_RPAREN | T_LPAREN Expression T_RPAREN | negop Expression %prec T_NOT | T_LBRACK Element_List T_RBRACK | T_LBRACK T_RBRACK
Example Pascal Grammar in Yacc Element_List : Element | Element_List T_COMMA Element Element : Expression | Expression T_RANGE Expression Variable : T_IDENTIFIER | Variable T_LBRACK Expr_List T_RBRACK | Variable T_PERIOD T_IDENTIFIER | Variable T_UPARROW Expr_List : Expression | Expr_List T_COMMA Expression relop : T_EQ | T_LT | T_GT | T_NE | T_LE | T_GE | T_IN addop : T_PLUS | T_MINUS | T_OR divop : T_MULT | T_RDIV | T_DIV | T_MOD | T_AND ; negop : T_NOT
Resolving Grammar Problems/Difficulties Reg. Lang. CFLs Regular Expressions : Basis of Lexical Analysis Reg. Expr. generate/represent regular languages Reg. Languages smallest, most well defined class of languages Context Free Grammars: Basis of Parsing CFGs represent context free languages CFLs contain more powerful languages anbn – CFL that’s not regular! Try to write DFA/NFA for it !
Resolving Problems/Difficulties – (2) a start a b b 0 2 1 3 b Since Reg. Lang. Context Free Lang., it is possible to go from reg. expr. to CFGs via NFA. Recall: (a | b)*abb
Resolving Problems/Difficulties – (3) a b i j i j Construct CFG as follows: • Each State I has non-terminal Ai : A0, A1, A2, A3 • If then Aia Aj • If then Ai Aj • If I is an accepting state, Ai : A3 • If I is a starting state, Ai is the start symbol : A0 : A0 aA0, A0 aA1 : A0 bA0,A1 bA2 : A2 bA3 T={a,b}, NT={A0, A1, A2, A3}, S = A0 PR ={ A0 aA0 | aA1 | bA0 ; A1 bA2 ; A2 3A3 ; A3 }
How Does This CFG Derive Strings ? a start a b b 0 2 1 3 b vs. A0 aA0, A0 aA1 A0 bA0, A1 bA2 A2 bA3, A3 How is abaabb derived in each ?
Regular Expressions vs. CFGs • Regular expressions for lexical syntax • CFGs are overkill, lexical rules are quite simple and straightforward • REs – concise / easy to understand • More efficient lexical analyzer can be constructed • RE for lexical analysis and CFGs for parsing promotes modularity, low coupling & high cohesion. Why? CFGs : Match tokens “(“ “)”, begin / end, if-then-else, whiles, proc/func calls, … Intended for structural associations between tokens ! Are tokens in correct order ?
Resolving Grammar Difficulties : Motivation • ambiguity • -moves • cycles • left recursion • left factoring • Humans write / develop grammars • Different parsing approaches have different needs LL(k) Recursive LR(k) LALR(k) Top-Down vs. Bottom-Up • For: 1 remove “errors” • For: 2 put / redesign grammar Grammar Problems
Resolving Problems: Ambiguous Grammars Consider the following grammar segment: stmt if exprthen stmt | if exprthen stmtelse stmt | other (any other statement) What’s problem here ? Consider the Program: if e1 then if e2 then s1 else s2 Else must match to previous then. Structure indicates parse subtree for expression.
Resulting Parse Tree • Easy case • Else must match to previous then. if e1 then s1 else if e2 then s2 else s3
Example : What Happens with this string? If E1then if E2then S1else S2 How is this parsed ? if E1then if E2then S1 else S2 if E1then if E2then S1 else S2 vs. What’s the issue here ?
Parse Trees for Example if e1 then if e2 then s1 else s2 Form 1: Form 2: What’s the issue here ?
Removing Ambiguity Take Original Grammar: stmt if exprthen stmt | if exprthen stmtelse stmt | other (any other statement) Revise to remove ambiguity: stmt matched_stmt | unmatched_stmt matched_stmt if exprthen matched_stmt else matched_stmt | other unmatched_stmt if exprthen stmt | if exprthen matched_stmt else unmatched_stmt How does this grammar work ?
Resolving Difficulties : Left Recursion A left recursive grammar has rules that support the derivation : A A, for some . + Top-Down parsing can’t reconcile this type of grammar, since it could consistently make choice which wouldn’t allow termination. A A A A … etc. A A | Take left recursive grammar: A A | To the following: A’ A’ A’ A’ |
Why is Left Recursion a Problem ? Derive : id + id + id E E + T Consider: E E + T | T T T * F | F F ( E ) | id How can left recursion be removed ? E E + T | T What does this generate? E E + T T + T E E + T E + T + T T + T + T How does this build strings ? What does each string have to start with ?
Resolving Difficulties : Left Recursion (2) For our example: E E + T | T T T * F | F F ( E ) | id E TE’ E’ + TE’ | T FT’ T’ * FT’ | F ( E ) | id Informal Discussion: Take all productions for A and order as: A A1 | A2 | … | Am | 1 | 2 | … | n Where no i begins with A. Now apply concepts of previous slide: A 1A’ | 2A’ | … | nA’ A’ 1A’ | 2A’| … | m A’ |
Resolving Difficulties : Left Recursion (3) S Aa | b A Ac | Sd | S Aa Sda Problem: If left recursion is two-or-more levels deep, this isn’t enough Algorithm: • Input: Grammar G with no cycles or -productions • Output: An equivalent grammar with no left recursion • Arrange the non-terminals in some order A1,A2,…An • for i := 1 to n do begin • for j := 1 to i – 1 do begin • replace each production of the form Ai Aj • by the productions Ai 1 | 2 | … | k • where Aj 1|2|…|k are all current Aj productions; • end • eliminate the immediate left recursion among Ai productions • end
Using the Algorithm Apply the algorithm to: A1 A2a | b A2 A2c | A1d | i = 1 For A1 there is no left recursion i = 2 for j=1 to 1 do Take productions: A2 A1 and replace with A2 1 | 2 | … | k | where A1 1 | 2 | … | k are A1 productions in our case A2 A1d becomes A2 A2ad | bd What’s left: A1 A2a | b A2 A2 c | A2ad | bd | Are we done ?
Using the Algorithm (2) No ! We must still remove A2 left recursion ! A1 A2a | b A2 A2 c | A2ad | bd | Recall: A A1 | A2 | … | Am | 1 | 2 | … | n A 1A’ | 2A’ | … | nA’ A’ 1A’ | 2A’| … | m A’ | Apply to above case. What do you get ?