Understanding Parsing Techniques and Context-Free Grammars

Parsing • Parsing is the determining of the syntactic structure of a program • Parsing frequently is called “syntactic analysis” • The syntax of a programming language frequently is expressed as a collection of grammar rules • Almost always, the grammars are context-free (type 2) grammars

Parsing… • Grammar rules almost always are recursive • This means that rules may make use of rules that make use of themselves • E.g., blocks may be nested within blocks…

Parsing… • The result of parsing a program usually is a parse tree – also called a syntax tree • It shows the precise structure of the original program, but in tree form instead of in lexicographic form

Where we are Lexical analyzer Parser assignment Expr Total := id + id price tax

Parsing Techniques • While there’s fundamentally only one way to look at scanning… • There are two fundamentally different ways to look at parsing: • Top-down • Bottom-up • We’ll examine both techniques • Both have advantages and disadvantages…

The Parsing Process • A parser is a function that takes its input as a series of tokens (output from the scanner) and it produces a parse tree as its output • parseTree = parse( );

The Parsing Process • The reason for creating a parse tree is that it enables us to “walk the tree” (i.e., to traverse the tree) more than once, extracting information each with each pass so that eventually code may be generated

The role of parser symbol table Source code improve code object code generate code Lexical analyzer syntax analyzer Semantic analyzer error handler Synthesis Synthesis Analysis

Error Handling • In the case of a scanner, when an illegal input is detected the usual action is: • Issue an error message • Consume the illegal input and continue to scan • Possibly return a “guess” as to what the offending input should have been to better enable the scanner to continue

Error Handling… • A parser must not only issue an error message indicating that illegal syntax has been encountered, but it must recover in a manner so as to best ensure that the remainder of the input will be parsed correctly • This is extremely difficult to do in general

Error Handling… • So a parser, upon encountering illegal syntax, should attempt to repair the error as best it can • This is called “error repair” • It’s so difficult to do in general, that usually only the simplest kinds of repairs are attempted • For example: inserting a missing semicolon or balancing parens…

Context-Free Grammars • Context-free grammars (also known as Chomsky type 2 grammars) consist of a collection of syntax rules (called rewriting rules or productions), a starting symbol, and a vocabulary • The “context-free” description means that there are some restrictions on the kinds of rules one may use

Context-Free Grammars… • CF Grammars always have a single symbol on the LHS of every production • No CF Grammar may make use of lambda (epsilon) on the LHS as a symbol • CF Grammars may be (almost always are) recursive

Context-Free Grammars… • Although the text uses slightly different notation for RE rules and CFG rules… We’ll not bother to do so • I’ll leave it up to you to “understand from context” what’s going on…

Context free grammars and languages • Many languages are not regular. Thus we need to consider larger classes of languages; • Context Free Language (CFL) played a central role in natural languages since 1950’s (Chomsky) and in compilers since 1960’s (Backus); • Context Free Grammar (CFG) is the basis of BNF syntax; • CFG is increasingly important for XML and DTD (XML Schema).

Informal Example of CFG • Palindrome: madamimadam • Consider Lpal ={w|w is a palindrome on symbols 0 and 1} • Example: 1001, 11 are palindromes • How to represent palindrome of any length? • Basis: ε, 0 and 1 are palindromes; • Pε • P0 • P1 • Induction: if w is palindrome, so are 0w0 and 1w1. nothing else is a palindrome. • P0P0 • P1P1

The informal example (cont.) • CFG is a formal mechanism for definitions such as the one for Lpal • Pε • P0 • P1 • P0P0 • P1P1 • 0 and 1 are terminals • P is a variable (or nonterminal, or syntactic category) • P is also the start symbol. • 1-5 are productions (or rules)

Formal definition of CFG • A CFG is a 4-tuple G=(Σ,N,P,S) where • Σ is a finite set of terminals; • N is finite set of variables (non-terminals); • P is a finite set of productions of the form Aα, where A is a nonterminal and α consists of symbols from Σ and N; • A is called the head of the production; • α is called the body of the production; • S is a designated non-terminal called the start symbol. • Example: • Gpal=({0,1}, {P}, A,P), where A ={Pε, P0, P1, P0P0, P1P1} • Sometimes we group productions with the same head. • e.g., A={Pε|0|1|0P0|1P1}.

Another CFG example: arithmetic expression • G=( {+,*,(,),a,b,0,1}, {E, R, S}, P, E) • ER • EE+E • EE*E • E(E) • RaS • RbS • Sε • SaS • SbS • S0S • S1S

Two level descriptions • Context-free syntax of arithmetic expressions • ER • EE+E • EE*E • E(E) • Lexical syntax of arithmetic expressions • RaS • RbS • Sε • SaS • SbS • S0S • S1S R(a|b)(a|b|0|1)* • Why two levels • We think that way (sentence, word, character). • With this understanding, we have BNF and EBNF.

Chomsky level 0 (unrestricted) Chomsky level 1 (Context sensitive grammar) Chomsky level 2 (Context free grammar) Chomsky level 3 (Regular expression) Why use regular expression • Every regular set is a context free language. • Since we can use CFG to describe regular language, why using regular expression to define lexical syntax of a language? why not using CFG to describe everything? • Lexical rules are simpler; • Regular expressions are concise and easier to understand; • More efficient lexical analyzers can be constructed from regular expression; • It is good to modularize compiler into two separate components.

BNF and EBNF • BNF and EBNF are commonly accepted ways to express productions of a context-free grammar. • BNF • Introduced by John Backus, first used to describe Algol 60. John won Turing award1977. • BNF originally stood for "Backus Normal Form". However, it was pointed out that this is not literally a "normal" form, and the acronym has come to stand for "Backus Naur Form". • EBNF stands for Extended BNF. • BNF format • lhs ::= rhs • Quote terminals or non-terminals • <> to delimit non-terminals, • bold or quotes for terminals, or ‘as is’ • vertical bar for alternatives • There are many different notations, here is an example opt-stats ::= stats-list | EMPTY . stats-list ::= statement | statement ‘;’ stats-list .

EBNF • An extension of BNF, use regular- expression-like constructs on the right-hand-side of the rules: • write [A] to say that A is optional • Write {A} or A* to denote 0 or more repetitions of A (ie. the Kleene closure of A). • Using EBNF has the advantage of simplifying the grammar by reducing the need to use recursive rules. • Both BNF and EBNF are equivalent to CFG • Example:

Languages, Grammars, and automata Closer to machine More expressive

Another arithmetic expression example (p1) expexp+digit (p2) expexp-digit (p3) expdigit (p4) digit0|1|2|3|4|5|6|7|8|9 • the “|” means OR • So the rules can be simplified as: • (P1-3) Exp  exp + digit | exp - digit | digit • (p4) digit0|1|2|3|4|5|6|7|8|9

Derivation • one step derivation ( relation) • if Aγ is a production, α A β is a string of terminals and variables, then αAβαγβ • Example: 9 - digit + digit 9 - digit + 2 using rule P4 : digit 2 • Zero or more steps of derivation (*) • Basis: α*α • Induction: if α * β , β  γ, then α *γ. • Example: • 9 - digit + digit * 9 - digit + digit • 9 - digit + digit* 9 - 5 + 2 • 9 - digit + digit  9 - 5 + digit  9 - 5 + 2 • One or more steps of derivation (+) • Example: 9 - digit + digit+ 9 - 5 + 2

Example of derivation • we can derive the string 9 - 5 + 2 as follows: (P1: exp exp + digit) expexp + digit (P2: expexp – digit) exp - digit + digit (P3 : expdigit) digit - digit + digit (P4 : digit 9)  9 - digit + digit (P4 : digit 5)  9 - 5 + digit (P4 : digit 2)  9 - 5 + 2 exp+ 9 - 5 + 2 exp*9 - 5 + 2

Derivations and Languages… • Note that the (nonterminal) symbol to be replaced is entirely up to you • The order of replacement is entirely up to you • The resulting (final) string (or sentence) is one of the legal strings in the language • A language is the set of all possible such derivations

Example • Here’s a simple CFG for a language… E = ( E ) | a • The language that it produces is: L(E) = { a, (a), ((a)), (((a))), … } • I.e., L(E) = { (n a )n }, where n = 0, 1, 2,…

Another Example • Consider the CFG: E = ( E ) • What is the language that it produces? • The rule never terminates… • So, L(E) = { } = EMPTY!

Left most and right most derivations • Right most derivation: • exprmexp + digit • rmexp + 2 • rm exp - digit + 2 • rmexp - 5 + 2 • rmdigit - 5 + 2 • rm 9 - 5 + 2 • exp+rm9 - 5 + 2 • exp*rm9 - 5 + 2 Left most derivation: explmexp + digit lmexp - digit + digit lmdigit - digit + digit lm 9 - digit + digit lm 9 - 5 + digit lm 9 - 5 + 2 exp+lm9 - 5 + 2 exp*lm9 - 5 + 2

Left and Right Recursion • If a rule has its LHS symbol (its goal symbol) as the first symbol in the RHS, it is said to be left recursive exp = exp + number • If a rule has its goal symbol as the last symbol in the RHS, it is said to be right recursive exp = number + exp

Left and Right Recursion… • Say we want to generate the language, a* • There are many ways to do this… • Among these ways are: A = A a | ε A = a A | ε • How about generating a+ ? • How about generating a2n, where n = 0, 1, 2,…

Parse tree • This derivation could also be represented via a Parse Tree. • (p1) expexp+digit • (p2) expexp-digit • (p3) expdigit • (p4) digit0|1|2|3|4|5|6|7|8|9 Exp Exp + digit exp - digit 2 digit 5 9 9 - 5 + 2

Tree terminologies • Trees are collections of nodes, with a parent-child relationship. • A node has at most one parent, drawn above the node; • A node has zero or more children, drawn below the node. • There is one node, the root, that has no parent. This node appears at the top of the tree • Nodes with no children are called leaves. • Nodes that are not leaves are interior nodes.

Formal definition of parse tree • Parse tree shows the derivation of a string using a grammar. • Properties of a parse tree: • The root is labeled by the start symbol; • Each leaf is labeled by a terminal or ε; • Each interior node is labeled by a nonterminal; • If A is the nonterminal node and X1, …, Xn are the children nodes of A, then A X1 …Xn is a production. • Yield of a parse tree • look at the leaves of a parse tree, from left to right, and concatenate them • Example: 9-5+2 Exp Exp + digit exp - digit 2 digit 5 9

The language of a grammar • If G is a grammar, the language of the grammar, denoted as L(G), is the set of terminal strings that have derivations from the start symbol. • If a language L is the language of some context free grammar, then L is said to be a context free language. • Example: the set of palindromes is a context free language.

Derivation and the parse tree • The followings are equivalent: • A *w; • A *lmw; • A *rmw; • There is a parse tree with root A and yield w.

Applications of CFG • Parser and parser generator; • Markup languages.

Ambiguity of Grammar • What is ambiguous grammar; • How to remove ambiguity;

Example of ambiguous grammar • Ambiguous sentence: • Fruit flies like a banana • Consider a slightly modified grammar for expressions expr  expr + expr | expr * expr | digit • Derivations of 9+5*2 expr  expr + expr • expr + expr * expr • + 9+5*2 expr • expr*expr • expr+expr*expr • + 9+5*2 • there are different derivations

Ambiguity of a grammar • Ambiguous grammar: produce more than one parse tree expr  expr + expr | exp * expr | digit expr expr expr expr expr expr expr expr expr expr • + 5 * 2 • + 5 * 2 9 + (5 * 2) (9 + 5) * 2 • Problems of ambiguous grammar • one sentence has different interpretations

using the expr grammar, 3+4 has many derivations: E E+E D+E 3+E 3+D 3+4 Based on the existence of two derivations, we can not deduce that the grammar is ambiguous; it is not the multiplicity of derivations that causes ambiguity; It is the existence of more than one parse tree. In this example, the two derivations will produce the same tree E E+E E+D E+4 D+4 3+4 Several derivations can not decide whether the grammar is ambiguous E E E D D 3 + 4

Derivation and parse tree • In an unambiguous grammar, leftmost derivation will be unique; and rightmost derivation will be unique; • How about ambiguous grammar? E • lm E+E • lm D+E • lm 9+E • lm 9+E*E • lm 9+D*E • lm 9+5*E • lm 9+5*D • lm 9+5*2 • a string has two parser trees iff it has two distinct leftmost derivations (or rightmost derivations). E • lm E*E • lm E+E*E • lm D+E*E • lm 9+E*E • lm 9+D*E • lm 9+5*E • lm 9+5*D • lm 9+5*2

Remove ambiguity • Some theoretical results (bad news) • Is there an algorithm to remove the ambiguity in CFG? • the answer is no • Is there an algorithm to tell us whether a CFG is ambiguous? • The answer is also no. • there are CFLs that have nothing but ambiguous CFGs. • that kind of language is called ambiguous language; • if a language has one unambiguous grammar, then it is called unambiguous language. • In practice, there are well-known techniques to remove ambiguity • Two causes of the ambiguity in the expr grammar • the precedence of operator is not respected. “*” should be grouped before “+”; • a sequence of identical operator can be grouped either from left or from right. 3+4+5 can be grouped either as (3+4)+5 or 3+(4+5).

Remove ambiguity • enforcing precedence by introducing several different variables, each represents those expressions that share a level of binding strength • factor: digit is a factor • term: factor * factor*factor .... is a term • expression: term+term+term ... is an expression • so we have a new grammar: ET | E+T TF | T*F FD • Compare the original grammar: EE+E EE*E ED • The parser tree for D+D*D is: E E T T T F F + F * D D D

Another ambiguous grammar Stmt  if expr then stmt | if expr then stmt else stmt | other If E1 then if E2 then S1 else S2 stmt expr then stmt if stmt expr stmt else if then stmt stmt expr then stmt else if expr then stmt if

Remove the ambiguity • Match “else” with closest previous unmatched “then” • How to incorporate this rule into the grammar? Stmt  if expr then stmt | if expr then stmt else stmt | other stmtmatched_stmt | unmatched_stmt matched_stmtifexpr then matched_stmt else matched_stmt |other unmatched_stmtifexpr then stmt | if expr then matched_stmt else unmatched_stmt

The parse tree of if-stmt example stmt unmatched_stmt expr then if stmt matched_stmt E2 else if then matched_stmt matched_stmt S1 S2

Understanding Parsing Techniques and Context-Free Grammars

Understanding Parsing Techniques and Context-Free Grammars

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing