Compiler Construction Principles & Implementation Techniques

Compiler Construction Principles & Implementation Techniques Dr. Ying JIN Associate Professor Oct. 2007

Role of Parsing in a Compiler Target Code Generation Lexical Analysis scanning Intermediate Code Optimization Syntax Analysis Parsing Intermediate Code Generation Semantic Analysis analysis/front end synthesis/back end

What will be introduced • About Parsing • General information about a Parser (parsing) • Functional requirement (input, output, main function) • Process • General techniques in developing a scanner • How to define the syntax of a programming language? • Why not regular expressions? (not enough) • Context free grammar (上下文无关文法) • How to implement a parser with respect to its definition of syntax? • Top-down Parsing • Bottom-up Parsing

Knowledge Relation Graph Top-down using Syntax definition Context Free Grammar implement basing on Bottom-up Develop a Parser

§ 3 Context Free Grammar & Parsing 3.1 The Parsing Process （语法分析过程） 3.2 Context-free Grammars （上下文无关文法） 3.3 Parse Trees and Abstract Syntax Tree （语法分析树和抽象语法树） 3.4 Ambiguous （二义性） 3.5 Syntax of Sample Language （简单语言的语法）

3.1 The Parsing Process • Function of a Parser • Input : Token / TokenList • Output: internal representation of syntactic structure • Process • Read tokens • Establish syntactic structure – parse tree/syntax tree, according to the syntax definition (context free grammar); • Check syntactical errors

Syntactic Structures • Rules for describing the structure of a well-defined program • (1) Program • (2) Declaration • - Constant declaration • - Type declaration • - Variable Declaration • - Procedure/function declaration • (3) Body • (4) Statements (assignment, conditional, loop, function call) • (5) Expressions (arithmetic, logical, boolean)

Syntax Errors • Different Types of Syntax Errors • Following token for a syntactic structure is wrong (后继单词错) • Identifier/constant error (标识符或者常量错) • Keyword error(关键字错) • Start token for a syntactic structure is wrong(开始单词错) • Unbalanced parentheses(括号配对错)

Syntax Errors 后继单词错 int GetMax( int x ; int y) { if (x>y) {return x } else return y; } } vod main() { real 10, y; 10 + ; GetMax( x, y) ; } 括号配对错关键字错,开始单词错标识符错开始单词错

Dealing with Errors • Once an error is detected, how to deal with it? • Quit immediately, not practical • Error recovery • Error repair • Error correction • There is no “perfect” way to do it!

Different Types of Parsing Methods • Universal parsing methods for any grammars • Cocke-Younger-Kasami algorithm • Earley’s algorithm • Top-down parsing methods (limited grammars)- predictive • Recursive descendent parsing (递归下降法) • LL(k) -- k=1 • Bottom-up parsing methods (limited grammars) – shift-reduce • SLR(k) • LR(k) • LALR(k) • Operator-precedence parsing (简单优先关系法) inefficient k=1

3.2 Context-free Grammars • What is a Grammar? • Chomsky classification of Grammars; • Context Free Grammar (some concepts)

What is a Grammar? • To define syntactic structure? • A grammar G is a quadruple (VT,VN,S,P) • VT is a finite set of terminal symbols(有限的终极符集合) • VN是is a finite set of non-terminal symbols(有限的非终极符集合) • S is start symbol，S VN • P is a set of production rules(产生式的集合)， each production rule has following form：   ，where ， (VTVN )*

文法的分类 • O型文法: 也称为短语文法，其产生式具有形式: →，其中,(VTVN)*，并且至少含一个非终极符; • 1型文法: 也称为上下文有关文法。它是0型文法的特例，即要求||  || (S→除外，但S不得出现于产生式右部) • 2型文法: 也称为上下文无关文法。它是1型文法的特例，即要求产生式左部是一个非终极符: A→ • 3型文法: 也称为正则文法。它是2型文法的特例，即其产生式的右部至多有两个符号，而且具有下面形式之一: A →a ，A →a B, 其中A,BVN ，aVT  

文法的分类 • 描述能力 O型文法 > 1型文法> 2型文法> 3型文法 • 对应自动机 • O型文法:图灵机 • 1型文法:线性有界自动机 • 2型文法:下推自动机 • 3型文法:有限自动机

Context Free Grammar (CFG) • 定义为四元组(VT,VN,S,P) • VT是有限的终极符集合 • VN是有限的非终极符集合 • S是开始符，S VN • P是产生式的集合，且具有下面的形式： AX1X2…Xn 其中AVN，Xi(VTVN) ，右部可空。

Example VT = {i, n, +, *, (, )} P: E  T E  E + T T  F T  T * F F  (E) F  i F  n VN = {E, T, F } S = E

Derivation (直接推导)： if there is a production A, we can have A , where  represents one step derivation (用 A →一步推导). We can say that  is derived from A; • 的含义是，使用一条规则，代替左边的某个符号，产生右端的符号串

 + : represents one or more steps derivation (通过一步或多步可推导出 ) •  *  : 表示 通过0步或多步可推导出

Example P: (1) E  T (2) E  E + T (3) T  F (4) T  T * F (5) F  (E) (6) F  i (7) F  n • (E)  (E+T) • E + (i+n) • E+E+T * E+i+F

G = (VT,VN, S, P) • 句型：if S * ，则称符号串为G的句型。我们用SF(G)表示文法G的所有句型的集合; • Sentence(句子)：只包含终极符的句型被称为G的句子 • Language(语言)： L( G ) = { u | S + u , u  VT* } the set of all sentences of G;

Example P: (1) E  T (2) E  E + T (3) T  F (4) T  T * F (5) F  (E) (6) F  i (7) F  n • 句型: E, T, E+T, F, T*F, i, n, (E), ……. • 句子: i, n, (i), (n), i+i, i+n, …… • 语言:{i, n, (i), (n), i+i, i+n , ……}

Leftmost(rightmost) derivation 最左（右）推导： 如果进行推导时选择的是句型中的最左(右）非终极符，则称这种推导为最左(右)推导，并用符号lm（r m）表示最左（右）推导。 • 左（右）句型：用最左推导方式导出的句型，称为左句型，而用最右推导方式导出的句型，称为右句型(规范句型)。 • conclusion： each sentence has its rightmost or leftmost derivation（但对句型此结论不成立）

Example P: (1) E  T (2) E  E + T (3) T  F (4) T  T * F (5) F  (E) (6) F  i (7) F  n • 最左推导: i+T*n +F lm i + F*n + F • 最右推导: i+T*n +Flm i + T*n +i • 左句型: i + i*F • 右句型: E+(i) • 特例: i + (T*i)

Some Notes • CFG (VT,VN, S, P) will be used to define syntactic structure of a programming language; • Normally VT will be set of tokens of the programming language; • one terminal symbol might be one token, one token type, or one symbol representing certain structure; • Non-terminal symbols act as intermediate representation of certain structure; • Productions are rules on how to derive syntactic structure;

Example(1) • Arithmetic Expressions P: Exp  Exp + Exp Exp  Exp – Exp Exp  Exp * Exp Exp  Exp / Exp Exp  (Exp) Exp  id Exp  num VT = {id, num, (, ), +, -, *, /} VN = {Exp} S = Exp

Example(2) • Program VT = {VarDec, TypeDec, ConstDec, MainFun, FunDec } VN = {Program, Dec, Decs} S = Program P: Program  Decs MainFun Decs Dec  Dec  VarDec Dec  ConstDec Dec  FunDec Dec TypeDec Decs  Dec Decs  Dec Decs

Example(3) • Variable Declaration VT = {Type, id, ; , , } VN = {VarDec, VarDef, VarDefList, idList} S = VarDec P: VarDec   VarDec  VarDef ; VarDec  VarDef; VarDec P: 变量声明  变量声明  一个变量定义; 变量声明  一个变量定义 ; 变量声明变量定义 类型标识符序列标识符序列 一个标识符标识符序列 一个标识符 , 标识符序列 VarDef  Type idList idList id idList  id, idList

Extended BNF (扩展巴克斯范式) • Extended Notations • Optional | A   |  |  A   A   A   • Repetition * or {} A  A |  (left recursive) A    * A   A |  (right recursive) A   * 

Some algorithms on Grammar Transformation • 文法等价变化: L(G1) = L(G2) • 增补文法: the start symbol does not appear in the right part of any productions; (定理2.1) • Z  S

Some algorithms on Grammar Transformation • 文法G中除了S以外的空产生式(A)可以消除; (定理2.2) • 找出所有可以推导出的非终极符,记为S; • 对于所有产生式 • A  X1 X2 …… Xi-1XiXi+1……Xn • Xi S • 增加A  X1 X2 …… Xi-1Xi+1……Xn • 直到没有新的产生式产生 • 删除对应空产生式, 除了S

Example SA a BB A BB Sa BB S | A a BB A  B B | a B   |b S a BB A B S a B S A a B SA a B SA a {S, B, A} Sa B S aB S a S a S Aa

Some algorithms on Grammar Transformation • 消除无用产生式(定理2.3) • 无用产生式: A , A不出现在任何句型中; • How to find such non-terminal symbols? • 生成会出现在句型中的非终极符集合SS • SS = {S} •  Si SS, • Si 1 ,……, Si n • 把1 ,……, n中的所有非终极符加入SS中; • 重复直到没有新的加入; • 不属于SS的非终极符,它们的产生式是无用产生式,删除掉;

Example S | A a BB A  B B | a B   |b C  c {S} {S, A, B} {S, A, B} C  c是无用产生式

Some algorithms on Grammar Transformation • 消除特型产生式: (定理2.4) • 特型产生式: A  B • Algorithm • 对每个非终极符A,构造 SA = {B| A+B, BVN} • 如果C  SA,而且C  不是特型产生式,则增加 A ; • 删除特型产生式; • 删除无用产生式;

Example S: {A,B} S  a S  b • S | A • A  B | a • B  b A: {B} A  b • S | a | b • A  a | b • B  b

Some algorithms on Grammar Transformation • 消除公共前缀(left factoring) • 公共前缀 • A  1 | … | n| 1|…| m • 提取公因子 • A  A’| 1|…| m • A’  1 | … | n

Some algorithms on Grammar Transformation • 消除左递归(left recursion) • 直接左递归:A  A1 | … | A n|  1|…| m • 消除方法: A  1A’|…| mA’ A  1A’ | … | nA’|

Example P: (1) E  T (2) E  E + T (3) T  F (4) T  T * F (5) F  (E) (6) F  i (7) F  n E  T E’ E’  +T E’ |  E  E + T | T  = + T 1 = T

Some algorithms on Grammar Transformation • 消除左递归(left recursion) • 间接左递归: • 消除方法: • Pre-conditions • Algorithm S  A b A  S a | b 1:S 2:A A  Aba | b A  bA’ A’  baA’ | 

3.3 Parse Trees & Abstract Syntax Tree • Problem: Derivation(推导) is a way to construct a sentence from the start symbol; • Many derivation for the same sentence; • Not uniquely represent the structure of the sentence • Parse tree: one way to represent the structure of a sentence;

Example sentence : i + i * n P: (1) E  T (2) E  E + T (3) T  F (4) T  T * F (5) F  (E) (6) F  i (7) F  n Several derivations for it Parse Tree

Parse Tree • A labeled tree for a CFG • The root must be labeled with the start symbol; • Each node has a symbol associated with it; • Each leaf must be labeled with a terminal symbol ; • For each node which is associated with a non-terminal symbol A, has n sons, from left to right they are associated with symbols B1, …, Bn, then there must be a production A  B1 … Bn

Abstract Syntax Tree • Problem with Parse tree • Includes much more than necessary nodes • Abstract Syntax Tree • contains only those nodes necessary for compilation sentence : i + i * n

3.4 Ambiguous Grammar • 二义性文法 • For a Grammar G, if there exists one sentence which has more than one parse tree, G is called ambiguous Grammar;

Example P: Exp  Exp + Exp Exp  Exp – Exp Exp  Exp * Exp Exp  Exp / Exp Exp  (Exp) Exp  id Exp  num id + id + id

Compiler Construction Principles & Implementation Techniques