Parsing

Parsing Giuseppe Attardi Università di Pisa

Parsing To extract the grammatical structure of a sentence, where: sentence = program words = tokens For further information: Aho, Sethi, Ullman, “Compilers: Principles, Techniques, and Tools” (a.k.a, the “Dragon Book”)

Outline of coverage • Context-free grammars • Parsing • Tabular Parsing Methods • One pass • Top-down • Bottom-up • Yacc

Grammatical structure of program function-def name arguments stmt-list stmt () main expression expression operator expression variable << string cout “hello, world\n”

Context-free languages Grammatical structure defined by context-free grammar statementassignment ; statementexpression ;statementcompound-statementassignmentident =expressioncompound-statement{declaration-list statement-list } Definition (Context-free). Grammar with only one non-terminal symbol in left hand side of productions.

Context Free Grammar G = (V, S, P, S) • V is a finite set of non-terminal symbols • S is a finite set of terminal symbols • P is a finite set of productions V (VS)* • S is the start symbol

Parse trees • “Parse tree from A” = root labeled with A • “Complete parse tree” = all leaves labeled with tokens Parse tree: tree labeled with grammar symbols, such that: if node is labeled A and its children are labeled x1...xn, then there is a productionAx1...xn

L L ; E E “Frontier” a Parse trees and sentences • Frontier of tree = labels on leaves (in left-to-right order) • Frontier of tree from S is a sentential form Definition. A sentence is the frontier of a complete tree from S.

L L L L L E ; E ; L ; E E b E E b a a a a a;E a;b;b Example G: L L ; E | E E a | b Syntax trees from start symbol (L): Sentential forms:

Derivations Alternate definition of sentential form: • Given ,  in V*, say  is a derivation step if ’’’ and  = ’’’ , where A is a production •  is a sentential form iff there exists a derivation (sequence of derivation steps) S  (alternatively, we say that S * ) Two definitions are equivalent, but note that there are many derivations corresponding to each parse tree

L E a Another example H: L E ; L | E E a | b L L L L E E ; ; L ; E E a E b a b

E E + * E E E E * E E id + id E E id id id id Ambiguity • A sentence can have more than one parse tree • A grammar is ambiguous if there is a sentence with more than one parse tree • Example 1 E E+E | E*E | id

Notes • If e then if b then d else f • { int x; y = 0; } • a.b.c = d; • Id -> s | s.id E -> E + T -> E + T + T -> T + T + T -> id + T + T -> id + T * id + T -> id + id * id + T -> id + id * id + id

Ambiguity • Ambiguity is a feature of the grammar rather than the language • Certain ambiguous grammars may have equivalent unambiguous ones

Grammar Transformations • Grammars can be transformed without affecting the language generated • Three transformations are discussed next: • Eliminating Ambiguity • Eliminating Left Recursion (i.e. productions of the form AA ) • Left Factoring

Eliminating Ambiguity • Sometimes an ambiguous grammar can be rewritten to eliminate ambiguity • For example, the grammar of Example 1 can be written as follows: • E E +T | T • T E *id| id • The language generated by this grammar is the same as that generated by the previous grammar. Both generate id(+id|*id)* • However, this grammar is not ambiguous

E + E T id * T T id id Eliminating Ambiguity (Cont.) • One advantage of this grammar is that it represents the precedence between operators. In the parsing tree, products appear nested within additions

Eliminating Ambiguity (Cont.) • An example of ambiguity in a programming language is the dangling else • Consider • S  ifbthenSelseS | ifbthenS | a

S S S if b then else ifbthen S a a S S b if then S S b if then else a a Eliminating Ambiguity (Cont.) • When there are two nested ifs and only one else..

Eliminating Ambiguity (Cont.) • In most languages (including C++ and Java), each else is assumed to belong to the nearest ifthat is not already matched by an else. • This association is expressed in the following (unambiguous) grammar: • S  Matched • | Unmatched • Matched ifbthen Matched else Matched • | a • Unmatched ifb then S • |ifbthen Matched else Unmatched

Eliminating Ambiguity (Cont.) • Ambiguity is a property of the grammar • It is undecidable whether a context free grammar is ambiguous • The proof is done by reduction to Post’s correspondence problem • Although there is no general algorithm, it is possible to isolate certain constructs in productions which lead to ambiguous grammars

Eliminating Ambiguity (Cont.) • For example, a grammar containing the production AAA |a would be ambiguous, because the substring aaa has two parses: A A A A A A A A a A A a a a a a • This ambiguity disappears if we use the productions • AAB |B and B a or the productions • ABA |B and B a.

Eliminating Ambiguity (Cont.) • Examples of ambiguous productions: AAaA AaA |Ab AaA |aAbA • A CF language is inherently ambiguous if it has no unambiguous CFG • An example of such a language is L = {aibjcm | i=j or j=m} which can be generated by the grammar: • SAB | DC • AaA | e CcC | e • BbBc | e DaDb | e

Elimination of Left Recursion • A grammar is left recursive if it has a nonterminalA and a derivation A + Aafor some string a. • Top-down parsing methods cannot handle left-recursive grammars, so a transformation to eliminate left recursion is needed • Immediate left recursion (productions of the form A  Aa) can be easily eliminated: • Group the A-productions as • A  A a1 |A a2 |… | A am| b1| b2 | … | bn • where no bi begins with A 2. Replace the A-productions by • A  b1A’| b2A’ | … | bnA’ • A’ a1A’|a2A’|… | amA’| e

Elimination of Left Recursion (Cont.) • The previous transformation, however, does not eliminate left recursion involving two or more steps • For example, consider the grammar • S  Aa|b • A  Ac|Sd |e • S is left-recursive because S Aa Sda, but it is not immediately left recursive

Elimination of Left Recursion (Cont.) Algorithm. Left recursion elimination. Arrange nonterminals in some order A1, A2 ,,…, An for i = 1 to n { • for j = 1 to i - 1 { • replace each production of the form • Ai  Ajg • by • Ai  d1 g| … | dng • where Aj d1 |…| dnare the current productions for Aj • } • eliminate the immediate left recursion among the Ai-productions }

Elimination of Left Recursion (Cont.) • Notice that iteration i only changes productions with Ai on the left-hand side, and Aj with j > i in the right-hand side • Correctness induction proof: • Clearly true for i = 1 • If true for all i < k, then when the outer loop is executed for i = k, the inner loop will remove all productions Ai  Aj with j<i • Finally, after the elimination of self recursion, m in any Ai Am productions will be > i • At the end of the algorithm, all derivations of the form Ai + Amawill have m > i and therefore left recursion will not be present

Left Factoring • Left factoring helps transform a grammar for predictive parsing • For example, if we have the two productions • S  ifbthenSelseS • | ifbthenS on seeing the input token if, we cannot immediately tell which production to choose to expand S • In general, if we have A b1 |b2 and the input begins with a, we do not know(without looking further) which production to use to expand A

Left Factoring (Cont.) • However, we may defer the decision by expanding A to A’ • Then after seeing the input derived from , we may expand A’ to 1 or to2 • Left-factored, the original productions become • AA’ • A’ b1 | b2

Non-Context-Free Language Constructs • Examples of non-context-free languages are: • L1 = {wcw | w is of the form (a|b)*} • L2 = {anbmcndm | n  1 and m  1 } • L3 = {anbncn | n  0 } • Languages similar to these that are context free • L'1 = {wcwR | w is of the form (a|b)*} (wRstands for w reversed) • This language is generated by the grammar SaSa | bSb | c • L'2 = {anbmcmdn | n  1 and m 1 } • This language is generated by the grammar SaSd | aAd AbAc | bc

Non-Context-Free Language Constructs (Cont.) • L''2 = {anbncmdm | n1and m 1} • is generated by the grammar SAB AaAb | ab BcBd | cd • L'3 = {anbn | n 1} • is generated by the grammar SaSb | ab • This language is not definable by any regular expression

CFG vs DFSA • L‘4 = {anbm | n> 0, n> 0} a b start 0 1 b

Non-Context-Free Language Constructs (Cont.) • Suppose we could construct a DFSM D accepting L'3. • D must have a finite number of states, say k. • Consider the sequence of states s0, s1, s2, …, sk entered by D having read , a, aa, …, ak. • Since D only has k states, two of the states in the sequence have to be equal. Say,sisj (i j). • From si, a sequence of ibs leads to an accepting (final) state. Therefore, the same sequence of ibs will also lead to an accepting state from sj. Therefore D would accept ajbi which means that the language accepted by D is not identical to L’3. A contradiction.

Parsing The parsing problem is: Given string of tokens w, find a parse tree whose frontier is w. (Equivalently, find a derivation from w) A parser for a grammar G reads a list of tokens and finds a parse tree if they form a sentence (or reports an error otherwise) Two classes of algorithms for parsing: • Top-down • Bottom-up

Parser generators • A parser generator is a program that reads a grammar and produces a parser • The best known parser generator is yacc It produces bottom-up parsers • Most parser generators - including yacc - do not work for every CFG; they accept a restricted class of CFG’s that can be parsed efficiently using the method employed by that parser generator

Top-down (predictive) parsing • Starting from parse tree containing just S, build tree down toward input. Expand left-most non-terminal.

Top-down parsing algorithm Let input = a1a2...an current sentential form (csf) = S loop { suppose csf = a1…akA based on ak+1…, choose production A   csf becomes a1…ak }

L L E ; L L E ; L a Top-down parsing example Grammar: H: L E ; L | E E a | b Input: a;b Parse tree Sentential form Input L a;b E;L a;b a;L a;b

L E ; L a E L E ; L a E b Top-down parsing example (cont.) Parse tree Sentential form Input a;E a;b a;b a;b

LL(1) parsing • Efficient form of top-down parsing • Use only first symbol of remaining input (ak+1) to choose next production. That is, employ a function M:   N P in “choose production” step of algorithm. • When this is possible, grammar is called LL(1)

LL(1) examples • Example 1: H: L E ; L | E E a | b Given input a;b, so next symbol is a. Which production to use? Can’t tell.  H not LL(1)

LL(1) examples • Example 2: Exp Term Exp’ Exp’ $ | + Exp Term id (Use $ for “end-of-input” symbol.) Grammar is LL(1): Exp and Term have only one production; Exp’ has two productions but only one is applicable at any time.

Nonrecursive predictive parsing • Maintain a stack explicitly, rather than implicitly via recursive calls • Key problem during predictive parsing: determining the production to be applied for a non-terminal

Nonrecursive predictive parsing • Algorithm. Nonrecursive predictive parsing • Set ip to point to the first symbol of w$. • Push S onto the stack. • repeat • Let X be the top of the stack symbol and a the symbol pointed to by ip • ifX is a terminal or $ then • ifX == athen • pop X from the stack and advance ip • else error() • else // X is a nonterminal • ifM[X,a] == XY1 Y2 … Y kthen • pop X from the stack • push YkYk-1, …, Y1 onto the stack with Y1 on top • (push nothing if Y1 Y2 … Y k is  ) • output the production XY1 Y2 … Y k • else error() • until X == $

LL(1) grammars • No left recursion A  Aa : If this production is chosen, parse makes no progress. • No common prefixes A ab | ag Can fix by “left factoring”: A aA’ A’  b | g

LL(1) grammars (cont.) • No ambiguity Precise definition requires that production to choose be unique (“choose” function M very hard to calculate otherwise)

LL(1) definition • Define: Grammar G = (N, , P, S) is LL(1)iff whenever there are two left-most derivations S * wA  w * wtx S * wA  w * wty • it follows that  = • Leftmost-derivation: where the leftmost non-terminal is always expanded first • In other words, given • 1. a stringwA in V* and • 2. t, the first terminal symbol to be derived from A • there is at most one production that can be applied to A to • yield a derivation of any terminal string beginning with wt

Checking LL(1)-ness • For any sequence of grammar symbols , define set FIRST(a) S to be FIRST(a) = { a | a* ab for some b } • FIRST sets can often be calculated by inspection

FIRST Sets ExpTerm Exp’ Exp’$ | +Exp Termid (Use $ for “end-of-input” symbol) FIRST($) = {$} FIRST(+Exp) = {+} FIRST($)  FIRST(+Exp) = {}  grammar is LL(1)

FIRST Sets L E ; L | EE a | b FIRST(E ; L) = {a, b} = FIRST(E) FIRST(E ; L)  FIRST(E)  {}  grammar not LL(1).

Parsing

Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing